<a href="https://colab.research.google.com/github/data-space/datalab-notebooks/blob/master/Python/4.%20Workflows/1.1%20Introduction%20-%20diamonds%20dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workflows - diamonds dataset

Updated: Aug 21 at 7:57 AM 

## Reference
- https://scikit-learn.org/stable/index.html
- https://scikit-learn.org/stable/datasets/index.html
- https://scikit-learn.org/stable/data_transforms.html
- https://scikit-learn.org/stable/supervised_learning.html
- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
- https://github.com/data-space/datalab-notebooks/tree/master/Python/3.%20Pipelines
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html

## Table of Contents
1. Introduction
1. Setup
1. Create feature-target pipeline
1. Create estimator pipeline
1. Discussion
1. Next steps

## Introduction

From the [introduction notebook](https://github.com/data-space/datalab-notebooks/blob/master/Python/4.%20Workflows/1.0%20Introduction.ipynb): machine learning workflows (for supervised learning) have these steps:
1. Get the initial dataset
2. Create a feature-target dataset (from the initial datset)
3. Create train and test datasets (from the feature-target dataset)
4. Fit a model (to the train dataset)
5. Make and evaluate predictions (made by the fit model on the test dataset)

There are four components that you provide as input to the workflow:
1. The initial dataset
2. The process to create the feature-target dataset, which is the feature-target pipeline
3. The process to fit the model, which is the estimator pipeline
4. The choice of metric to use in evaluating the model predictions

Our focus is on:
- component 2 (the feature-target pipeline) 
- component 3 (the estimator pipeline)

The takeaways from this notebook are:
- the ability to write transformer classes 
- the ability to use the `Pipeline` class to create a multi-stage pipeline 
- an understanding of the purpose and use of the `OneHotEncoder` class

## Setup

This section loads three libraries and displays their version numbers. 

Import the `pandas`,  `numpy` and `sklearn` libraries. 

In [0]:
import pandas  as pd
import numpy   as np
import sklearn as sk

Display the version numbers of the `pandas`, `numpy` and `sklearn` packages:

In [2]:
print('pandas :',pd.__version__)
print('numpy  :',np.__version__)
print('sklearn:',sk.__version__)

pandas : 0.24.2
numpy  : 1.16.4
sklearn: 0.21.3


### Standard/common functions

The following code cells defines the standard functions which implment the machine learning workflow defined and discussed in the `1.0 Introduction` notebook. (It is contained in the same folder as this notebook you are reading.) 

In [0]:
def get_feature_target_pdf(pdf, transformer_object): 
  return transformer_object.fit(pdf).transform(pdf)

In [0]:
def get_train_test_dict(pdf, target_name, **kwargs):
  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(pdf.drop(columns=target_name),
                                                      pdf[target_name], 
                                                      **kwargs)
  return {
      'x_train': X_train,
      'y_train': y_train,
      'x_test' : X_test,
      'y_test' : y_test
  }

In [0]:
def get_fit_model(estimator_object, x_train, y_train):
  return estimator_object.fit(X=x_train,
                              y=y_train)

In [0]:
def get_predict_ser(model, x_test):
  return model.predict(X=x_test)

In [0]:
def get_actual_predict_pdf(actual,predict):
  import pandas as pd
  return pd.DataFrame(data={'actual' : actual,
                            'predict': predict},
                      index=actual.index)

In [0]:
def get_actual_predict_eval(actual_predict_pdf, metric_function):
  return metric_function(actual_predict_pdf['actual'], 
                         actual_predict_pdf['predict'])

The `workflow` function bundles these standard functions together and takes as input the four workflow components. 

In [0]:
def workflow(initial_pdf,        # workflow component
             transformer_object, # workflow component
             estimator_object,   # workflow component
             metric_function     # workflow component
            ):
  feature_target_pdf = \
  get_feature_target_pdf(pdf               =initial_pdf,       # workflow component
                         transformer_object=transformer_object # workflow component
                        )

  train_test_dict = \
  get_train_test_dict(pdf        =feature_target_pdf,
                      target_name='target'
                     )

  fit_model = \
  get_fit_model(estimator_object=estimator_object, # workflow component
                x_train         =train_test_dict.get('x_train'),
                y_train         =train_test_dict.get('y_train'),
               )

  actual_predict_pdf = \
  get_actual_predict_pdf(train_test_dict.get('y_test'),
                         get_predict_ser(fit_model,
                                         train_test_dict.get('x_test')
                                        )
                        ) 

  return get_actual_predict_eval(actual_predict_pdf,
                                 metric_function # workflow component
                                )

## Create feature-target pipeline

The diamonds dataset is used in this example.

In [0]:
def get_initial_pdf():
  import pandas as pd
  diamonds_file_link = 'https://raw.githubusercontent.com/datalab-datasets/file-samples/master/diamonds.csv'
  return pd.read_csv(diamonds_file_link) \
           .drop('Unnamed: 0',
                 axis=1)

In [11]:
get_initial_pdf().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
carat      53940 non-null float64
cut        53940 non-null object
color      53940 non-null object
clarity    53940 non-null object
depth      53940 non-null float64
table      53940 non-null float64
price      53940 non-null int64
x          53940 non-null float64
y          53940 non-null float64
z          53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


The first step in the feature-target pipeline (step 2 of the workflow) is to rename the target column to `target`. The `rename` pandas dataframe method renames columns. 

In [12]:
(get_initial_pdf()
 .rename(columns={'price':'target'})
 .head(3)
)

Unnamed: 0,carat,cut,color,clarity,depth,table,target,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


Our goal though is to use Scikit-learn transformer, estimator and pipeline classes. The transformer class `RenameColumnsPDF` is designed to rename columns. 

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class RenameColumnsPDF(BaseEstimator, TransformerMixin):  
  def __init__(self,rename_dict={}):
    self.rename_dict = rename_dict
    return
    
  def fit(self, X, y=None): 
    return self

  def transform(self, X, y=None): 
    return X.rename(columns=self.rename_dict)

An object of this class is created and used to rename the `price` column to `target`. 

In [14]:
(RenameColumnsPDF(rename_dict={'price':'target'})
 .fit(get_initial_pdf())
 .transform(get_initial_pdf())
 .head(3)
)

Unnamed: 0,carat,cut,color,clarity,depth,table,target,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


Define the `LinearRegression` and `mean_absolute_error` functions. 

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics      import mean_absolute_error

Calling the `workflow` function will produce an error as the feature-target dataframe contains non-numeric features. 

In [16]:
try:
  workflow(initial_pdf       =get_initial_pdf(),
           transformer_object=RenameColumnsPDF(rename_dict={'price':'target'}),
           estimator_object  =LinearRegression(), # supply an appropriate estimator object
           metric_function   =mean_absolute_error # supply an appropriate metric function
          )
except Exception as error:
  print('Error: ' + repr(error))


Error: ValueError("could not convert string to float: 'Good'",)


Specifically, these non-numeric features are `cut`, `color` and `clarity`.

In [17]:
(RenameColumnsPDF(rename_dict={'price':'target'})
 .fit(get_initial_pdf())
 .transform(get_initial_pdf())
 .head(3)
)

Unnamed: 0,carat,cut,color,clarity,depth,table,target,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


An easy way to get around this problem is to use the `select_dtypes` method to select only the `float` and `int` columns. 

In [18]:
(RenameColumnsPDF(rename_dict={'price':'target'})
 .fit(get_initial_pdf())
 .transform(get_initial_pdf())
 .select_dtypes(include=['float','int'])
 .head(3)
)

Unnamed: 0,carat,depth,table,target,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31


The `FilterColumnPDF` transformer class does this work of filtering columns based on `dtype`. 

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class FilterColumnsPDF  (BaseEstimator, TransformerMixin):  
  def __init__(self,include_dtype_list=['object','float','int','bool']):
    self.include_dtype_list = include_dtype_list
    return
    
  def fit(self, X, y=None): 
    return self

  def transform(self, X, y=None): 
    return X.select_dtypes(include=self.include_dtype_list)

In [20]:
(FilterColumnsPDF(include_dtype_list=['float','int'])
 .fit(get_initial_pdf())
 .transform(get_initial_pdf())
 .head(3)
)

Unnamed: 0,carat,depth,table,price,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31


You might use a variable to pass the output of the `transform` method of a `RenameColumnsPDF` object to the `fit` method of a `FilterColumnsPDF` object (but don't).

In [21]:
renamed_pdf = \
(RenameColumnsPDF(rename_dict={'price':'target'})
 .fit(get_initial_pdf())
 .transform(get_initial_pdf())
)

(FilterColumnsPDF(include_dtype_list=['float','int'])
 .fit(renamed_pdf)
 .transform(renamed_pdf)
 .head(3)
)

Unnamed: 0,carat,depth,table,target,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31


An object of the `Pipeline` class will organize this work for us. 

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

In [0]:
from sklearn.pipeline import Pipeline

The following `Pipeline` object does exactly the same work as calling the fit/transform methods of the `RenameColumnsPDF` object. 

In [23]:
(Pipeline([('ren',RenameColumnsPDF(rename_dict={'price':'target'}))
          ]
         )
 .fit(get_initial_pdf())
 .transform(get_initial_pdf())
 .head(3)
)

Unnamed: 0,carat,cut,color,clarity,depth,table,target,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


This `Pipeline` object below sends the output of the `transform` method of the `RenameColumnsPDF` object to the `fit` method of the `FilterColumnsPDF` object. 

In [24]:
(Pipeline([('ren',RenameColumnsPDF(rename_dict       ={'price':'target'})),
           ('flt',FilterColumnsPDF(include_dtype_list=['int','float']))
          ]
         )
 .fit      (get_initial_pdf())
 .transform(get_initial_pdf())
 .head(3)
)

Unnamed: 0,carat,depth,table,target,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31


There is less code involved, which is
- easier to read
- easier to debug
- less likely to contain errors
- easier to extend (and still be less likely to contains errors and still easy to read and debug)

The following code cell creates the feature-target dataset from 
- the initial dataframe 
- the feature-target pipeline.

In [25]:
feature_target_pipeline = Pipeline([('ren',RenameColumnsPDF(rename_dict       ={'price':'target'})),
                                    ('flt',FilterColumnsPDF(include_dtype_list=['int','float']))
                                   ]
                                  )

(get_feature_target_pdf(pdf               =get_initial_pdf(), 
                        transformer_object=feature_target_pipeline
                       )
 .head(3)
)

Unnamed: 0,carat,depth,table,target,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31


The dataframe (which is the result of calling the `get_initial_pdf` function) and the feature-target pipeline (which is stored in the `feature_target_pipeline` variable) are two of the four input components to the `workflow` function and in general to a machine learning workflow. 

The other two components are the estimator object `LinearRegression` and the metric function `mean_absolute_error`.

In [26]:
feature_target_pipeline = Pipeline([('ren',RenameColumnsPDF(rename_dict       ={'price':'target'})),
                                    ('flt',FilterColumnsPDF(include_dtype_list=['int','float']))
                                   ]
                                  )
workflow(initial_pdf       =get_initial_pdf(),
         transformer_object=feature_target_pipeline,
         estimator_object  =LinearRegression(), 
         metric_function   =mean_absolute_error 
        )

877.9317020941659

## Create estimator pipeline

In the last section a `FilterColumnsPDF` object was used to keep only numeric columns and so drop the three `object` columns. Below a `OneHotEncoder` object is used to transform these these columns into 20 binary columns. Each of these 20 columns corresponds to a distinct value of each of these three variables. 

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Import the `OneHotEncoder` class. 

In [0]:
from sklearn.preprocessing import OneHotEncoder

Recall the columns of the diamonds dataset. 

In [28]:
get_initial_pdf().head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


The `cut` column of the dataframe is transformed into an array with binary columns. 

In [29]:
(OneHotEncoder()
 .fit      (get_initial_pdf()[['cut']])
 .transform(get_initial_pdf()[['cut']])
 .todense()
)

matrix([[0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 1., 0., 0., 0.],
        ...,
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0.]])

Notice that these five columns correspond to the five distinct values found in the `cut` column.

In [30]:
get_initial_pdf()['cut'].value_counts()

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64

The `ColumnTransformer` class is an efficient way to transform all three columns (`cut`, `color`, `clarity`) into binary columns.

First import the class. 

In [0]:
from sklearn.compose import ColumnTransformer

The `ColumnTransformer` class is given a `OneHotEncoder` object which is applied to each of the three columns. The remaining columns are "passed through" and are also results of the transformation.

In [32]:
(ColumnTransformer(transformers=[('ohe',OneHotEncoder(),['cut','color','clarity'])],
                   remainder='passthrough'
                  )
 .fit      (get_initial_pdf())
 .transform(get_initial_pdf())
 [0]
)

array([0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
       1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
       0.00e+00, 0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00,
       0.00e+00, 0.00e+00, 2.30e-01, 6.15e+01, 5.50e+01, 3.26e+02,
       3.95e+00, 3.98e+00, 2.43e+00])

Notice the 20 binary columns, corresponding to the 20 values of these three `object` columns, and the remaining numeric columns. 

In [33]:
for object_var in ['cut','color','clarity']:
  print(get_initial_pdf()[object_var].value_counts())

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64
G    11292
E     9797
F     9542
H     8304
D     6775
I     5422
J     2808
Name: color, dtype: int64
SI1     13065
VS2     12258
SI2      9194
VS1      8171
VVS2     5066
VVS1     3655
IF       1790
I1        741
Name: clarity, dtype: int64


Recall the workflow given at the end of the previous section.

In [34]:
feature_target_pipeline = Pipeline([('ren',RenameColumnsPDF(rename_dict       ={'price':'target'})),
                                    ('flt',FilterColumnsPDF(include_dtype_list=['int','float']))
                    ]
                   )
workflow(initial_pdf       =get_initial_pdf(),
         transformer_object=feature_target_pipeline,
         estimator_object  =LinearRegression(),
         metric_function   =mean_absolute_error
        )

893.9119130504884

Notice the mean absolute error. 

The feature-target pipeline of the following code cell only renames the target column. It does not drop the `object` columns. 

The estimator pipeline transforms these `object` columns to binary columns and then passes the transformed matrix to the linear regression. 

In [35]:
feature_target_pipeline = Pipeline([('ren',RenameColumnsPDF(rename_dict={'price':'target'}))
                                   ]
                                  )
estimator_pipeline = Pipeline([('ohe',ColumnTransformer(transformers=[('ohe',
                                                                       OneHotEncoder(),
                                                                       ['cut','color','clarity']
                                                                      )
                                                                     ],
                                                        remainder='passthrough')),
                               ('lin', LinearRegression())
                              ]
                             )
workflow(initial_pdf       =get_initial_pdf(),
         transformer_object=feature_target_pipeline,
         estimator_object  =estimator_pipeline,
         metric_function   =mean_absolute_error 
        )

746.6343344456803

Not suprisingly the mean absolute error is better than the workflow where these variables where dropped. 

## Discussion

The Scikit-learn transformer, estimator and pipeline classes with their `fit`, `transform` and `predict` methods seem to have been designed to provide a method that is easy use, understand and extend that allows one to follow good modeling practices. The essential modeling requirement is to not allow data from the test dataset to leak into the train dataset. In addition, it is important that the final evaluation of a model's prediction not be used to choose or configure the model.

- In what order are the `fit` and `transform` methods called in a feature-target pipeline?
- In what order are the `fit`, `transform` and `predict` methods called in an estimator pipeline? 
- What are the differences between feature-target pipelines and estimator pipelines? 
- Why is the `FilterColumnsPDF` object in the feature-target pipeline?
- Why is the `OneHotEncoder` object in the estimator pipeline?


In what order are ... ?
- Run the [`5.4 Pipelines`](https://colab.research.google.com/github/data-space/datalab-notebooks/blob/master/Python/3.%20Pipelines/5.4%20Pipelines.ipynb) data lab notebook from GitHub on Colab

What are the differences between feature-target pipelines and estimator pipelines? 
- A transformer pipeline is a sequence of transformer objects. 
- An estimator pipeline is a sequence of transformer objects followed by an estimator object. 
- The `fit` methods of feature-target pipelines do not aggregate information over the entire dataset. 
- The `transform` methods of feature-target pipelines transform rows with information only from the row being transformed.


Why is the `FilterColumnsPDF` object in the feature-target pipeline?
- The `fit` method does nothing and so does not aggregate information over the entire dataset.
- It could be placed in the estimator pipeline, but is simple enough to be performed just once.

Why is the `OneHotEncoder` object in the estimator pipeline?
- It does aggregate information over the entire dataset by finding the list of unique values contained in a column. For this reason it should not be in the feature-target pipeline.

## Next steps

1. Modify the init parameter `normalize` of the final estimator object in the estimator pipeline. Rerun the workflow and check your results. 
1. Change the final estimator object (try `Lasso` or `Ridge`) in the estimator pipeline. Rerun the workflow and check your results. 
1. Modify init parameter `alpha` of the `Lasso`/`Ridge` estimator object. Rerun and check results. 

__The End__

### Ignore this for now

In [36]:
(feature_target_pipeline
 .set_params(ren__rename_dict={'price':'target'})
 .fit_transform(get_initial_pdf())
 .head(3)
)

Unnamed: 0,carat,cut,color,clarity,depth,table,target,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


In [37]:
(estimator_pipeline.set_params(lin__normalize=True)
)

Pipeline(memory=None,
         steps=[('ohe',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('ohe',
                                                  OneHotEncoder(categorical_features=None,
                                                                categories=None,
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                n_values=None,
                                                                sparse=True),
                                                  ['cut', 'color', 'clarity'])],
                                   ver