<a href="https://colab.research.google.com/github/data-space/datalab-notebooks/blob/master/Python/4.%20Workflows/1.1%20Introduction%20-%20diamonds%20dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workflows - Diamonds dataset.ipynb

## Reference
- https://scikit-learn.org/stable/index.html
- https://scikit-learn.org/stable/datasets/index.html
- https://scikit-learn.org/stable/data_transforms.html
- https://scikit-learn.org/stable/supervised_learning.html
- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
- https://github.com/data-space/datalab-notebooks/tree/master/Python/3.%20Pipelines
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html

## Table of Contents
1. Introduction
1. Setup
1. Workflow - Diamonds dataset (hands-on)
1. Next steps

## Introduction

Supervised machine learning has the general goal of creating a model/method to make _good_ predictions on unseen data. 

Machine learning workflows (for supervised learning) organize the steps that accomplish this goal. These steps are:
1. Get the initial dataset
2. Create a feature-target dataset (from the initial datset)
3. Create train and test datasets (from the feature-target dataset)
4. Fit model (to the train dataset)
5. Make and evaluate predictions (made by the fit model on the test dataset)

There are four components (that you provide) as input to the workflow:
1. The initial dataset
2. The process to create the feature-target dataset
3. The process to fit the model
4. The choice of metric to use in evaluating the model predictions

Our focus is on:
- component 2 and the processes to prepare the data in step 2
- component 3 and the process to fit the model in step 4

In this notebook, these steps will be described and implemented in Python code and run with two simple datasets. 

Later notebooks will implement this workflow with more involved datasets which will provide opportunity to focus on steps 2 and 4. 

There are a few requirements of the workflow. In general, information used to evaluate the model should not be available when creating the model. Specifically, 
- the train dataset should be used to fit the model
- the test dataset should be used to evaluate the model
- step 5 should only be run once
- the feature-target dataset should be created using only per-row transformations
- predictor columns of the feature-target dataset should have only numeric types 

The first two requirements will be implemented in the code below. The remaining requirements will be discussed in later notebooks. 

The takeaways from this notebook are:
- workflow steps
- workflow components
- the realization that this workflow isn't really that difficult 

The requirements will be revisited in later notebooks.

## Setup

This section loads three libraries and displays their version numbers. 

Import the `pandas`,  `numpy` and `sklearn` libraries. 

In [0]:
import pandas  as pd
import numpy   as np
import sklearn as sk

Display the version numbers of the `pandas`, `numpy` and `sklearn` packages:

In [2]:
print('pandas :',pd.__version__)
print('numpy  :',np.__version__)
print('sklearn:',sk.__version__)

pandas : 0.24.2
numpy  : 1.16.4
sklearn: 0.21.3


### Standard/common functions

The following code cells contains the standard functions defined above. (Does not include `get_iris_pdf`).

In [0]:
def get_feature_target_pdf(pdf, transformer_object): 
  return transformer_object.fit(pdf).transform(pdf)

In [0]:
def get_train_test_dict(pdf, target_name, **kwargs):
  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(pdf.drop(columns=target_name),
                                                      pdf[target_name], 
                                                      **kwargs)
  return {
      'x_train': X_train,
      'y_train': y_train,
      'x_test' : X_test,
      'y_test' : y_test
  }

In [0]:
def get_fit_model(estimator_object, x_train, y_train):
  return estimator_object.fit(X=x_train,
                              y=y_train)

In [0]:
def get_predict_ser(model, x_test):
  return model.predict(X=x_test)

In [0]:
def get_actual_predict_pdf(actual,predict):
  import pandas as pd
  return pd.DataFrame(data={'actual' : actual,
                            'predict': predict},
                      index=actual.index)

In [0]:
def get_actual_predict_eval(actual_predict_pdf, metric_function):
  return metric_function(actual_predict_pdf['actual'], 
                         actual_predict_pdf['predict'])

The `workflow` function bundles these standard functions together and takes as input the four workflow components. 

The result is the evaluation of the predictions 
- made on the test dataset by the model, 
- which was fit to the train dataset, 
- which was split from the feature-target dataset 
- which was prepared from the initial dataset.

In [0]:
def workflow(initial_pdf,        # workflow component
             transformer_object, # workflow component
             estimator_object,   # workflow component
             metric_function     # workflow component
            ):
  feature_target_pdf = \
  get_feature_target_pdf(pdf               =initial_pdf,       # workflow component
                         transformer_object=transformer_object # workflow component
                        )

  train_test_dict = \
  get_train_test_dict(pdf        =feature_target_pdf,
                      target_name='target'
                     )

  fit_model = \
  get_fit_model(estimator_object=estimator_object, # workflow component
                x_train         =train_test_dict.get('x_train'),
                y_train         =train_test_dict.get('y_train'),
               )

  actual_predict_pdf = \
  get_actual_predict_pdf(train_test_dict.get('y_test'),
                         get_predict_ser(fit_model,
                                         train_test_dict.get('x_test')
                                        )
                        ) 

  return get_actual_predict_eval(actual_predict_pdf,
                                 metric_function # workflow component
                                )

## Workflow - Diamonds dataset (hands-on)

In [0]:
def get_initial_pdf():
  import pandas as pd
  diamonds_file_link = 'https://raw.githubusercontent.com/datalab-datasets/file-samples/master/diamonds.csv'
  return pd.read_csv(diamonds_file_link) \
           .drop('Unnamed: 0',
                 axis=1)

In [11]:
get_initial_pdf().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
carat      53940 non-null float64
cut        53940 non-null object
color      53940 non-null object
clarity    53940 non-null object
depth      53940 non-null float64
table      53940 non-null float64
price      53940 non-null int64
x          53940 non-null float64
y          53940 non-null float64
z          53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class RenameColumnsPDF(BaseEstimator, TransformerMixin):  
  def __init__(self,rename_dict={}):
    self.rename_dict = rename_dict
    return
    
  def fit(self, X, y=None): 
    return self

  def transform(self, X, y=None): 
    return X.rename(columns=self.rename_dict)

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics      import mean_absolute_error

The following function call to `workflow` will produce an error. (The third and fourth parameters are missing.)

Supply appropriate values to the `estimator_object`  and `metric_function` parameters. Then run the `workflow` function.

In [14]:
try:
  workflow(initial_pdf       =get_initial_pdf(),
           transformer_object=RenameColumnsPDF(rename_dict={'price':'target'}),
           estimator_object  =LinearRegression(), # supply an appropriate estimator object
           metric_function   =mean_absolute_error # supply an appropriate metric function
          )
except Exception as error:
  print('Error: ' + repr(error))


Error: ValueError("could not convert string to float: 'Premium'",)


In [15]:
get_feature_target_pdf(get_initial_pdf(), 
                       RenameColumnsPDF({'price':'target'})
                      ).head()

Unnamed: 0,carat,cut,color,clarity,depth,table,target,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [16]:
(get_feature_target_pdf(get_initial_pdf(), 
                       RenameColumnsPDF({'price':'target'})
                      )
 .select_dtypes(include=['float','int'])
 .head()
)

Unnamed: 0,carat,depth,table,target,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31
3,0.29,62.4,58.0,334,4.2,4.23,2.63
4,0.31,63.3,58.0,335,4.34,4.35,2.75


In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class FilterColumnsPDF(BaseEstimator, TransformerMixin):  
  def __init__(self,include_dtype_list=['object','float','int','bool']):
    self.include_dtype_list = include_dtype_list
    return
    
  def fit(self, X, y=None): 
    return self

  def transform(self, X, y=None): 
    return X.select_dtypes(include=self.include_dtype_list)

In [18]:
(FilterColumnsPDF(include_dtype_list=['float','int'])
 .fit(get_initial_pdf())
 .transform(get_initial_pdf())
 .head()
)

Unnamed: 0,carat,depth,table,price,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31
3,0.29,62.4,58.0,334,4.2,4.23,2.63
4,0.31,63.3,58.0,335,4.34,4.35,2.75


Reference: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

In [0]:
from sklearn.pipeline import Pipeline

In [20]:
(Pipeline([('ren',RenameColumnsPDF(rename_dict={'price':'target'}))
          ]
         )
 .fit(get_initial_pdf())
 .transform(get_initial_pdf())
 .head()
)

Unnamed: 0,carat,cut,color,clarity,depth,table,target,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [21]:
(Pipeline([('ren',RenameColumnsPDF(rename_dict={'price':'target'})),
           ('flt',FilterColumnsPDF())
          ]
         )
 .fit(get_initial_pdf())
 .transform(get_initial_pdf())
 .head()
)

Unnamed: 0,carat,cut,color,clarity,depth,table,target,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [22]:
(Pipeline([('ren',RenameColumnsPDF(rename_dict={'price':'target'})),
           ('flt',FilterColumnsPDF(include_dtype_list=['int','float']))
          ]
         )
 .fit(get_initial_pdf())
 .transform(get_initial_pdf())
 .head()
)

Unnamed: 0,carat,depth,table,target,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31
3,0.29,62.4,58.0,334,4.2,4.23,2.63
4,0.31,63.3,58.0,335,4.34,4.35,2.75


In [23]:
# previous code
(get_feature_target_pdf(get_initial_pdf(), 
                       RenameColumnsPDF({'price':'target'})
                      )
 .select_dtypes(include=['float','int'])
 .head()
)

Unnamed: 0,carat,depth,table,target,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31
3,0.29,62.4,58.0,334,4.2,4.23,2.63
4,0.31,63.3,58.0,335,4.34,4.35,2.75


In [24]:
fea_pipe = Pipeline([('ren',RenameColumnsPDF(rename_dict={'price':'target'})),
                         ('flt',FilterColumnsPDF(include_dtype_list=['int','float']))
                        ]
                       )

(get_feature_target_pdf(pdf=get_initial_pdf(), 
                        transformer_object=fea_pipe
                       )
 .head()
)

Unnamed: 0,carat,depth,table,target,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31
3,0.29,62.4,58.0,334,4.2,4.23,2.63
4,0.31,63.3,58.0,335,4.34,4.35,2.75


In [25]:
fea_pipe = Pipeline([('ren',RenameColumnsPDF(rename_dict={'price':'target'})),
                         ('flt',FilterColumnsPDF(include_dtype_list=['int','float']))
                        ]
                       )
workflow(initial_pdf       =get_initial_pdf(),
         transformer_object=fea_pipe,
         estimator_object  =LinearRegression(), # supply an appropriate estimator object
         metric_function   =mean_absolute_error # supply an appropriate metric function
        )

887.9020841294825

## next section

But we are missing information in the `object` columns.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [0]:
from sklearn.preprocessing import OneHotEncoder

In [27]:
get_initial_pdf().head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [28]:
(OneHotEncoder()
 .fit(get_initial_pdf()[['cut']])
 .transform(get_initial_pdf()[['cut']])
 .todense()
)

matrix([[0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 1., 0., 0., 0.],
        ...,
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0.]])

In [29]:
get_initial_pdf()['cut'].value_counts()

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64

In [0]:
from sklearn.compose import ColumnTransformer

In [31]:
(ColumnTransformer(transformers=[('ohe',OneHotEncoder(),['cut','color','clarity'])],
                   remainder='passthrough'
                  )
 .fit(get_initial_pdf())
 .transform(get_initial_pdf())
 [0]
)

array([0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
       1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
       0.00e+00, 0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00,
       0.00e+00, 0.00e+00, 2.30e-01, 6.15e+01, 5.50e+01, 3.26e+02,
       3.95e+00, 3.98e+00, 2.43e+00])

In [32]:
for object_var in ['cut','color','clarity']:
  print(get_initial_pdf()[object_var].value_counts())

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64
G    11292
E     9797
F     9542
H     8304
D     6775
I     5422
J     2808
Name: color, dtype: int64
SI1     13065
VS2     12258
SI2      9194
VS1      8171
VVS2     5066
VVS1     3655
IF       1790
I1        741
Name: clarity, dtype: int64


Good: 20 values over the three `object` columns and 20 zero or one values in the numpy array that results from the above transformation.

In [33]:
fea_pipe = Pipeline([('ren',RenameColumnsPDF(rename_dict={'price':'target'})),
                         ('flt',FilterColumnsPDF(include_dtype_list=['int','float']))
                        ]
                       )
workflow(initial_pdf       =get_initial_pdf(),
         transformer_object=fea_pipe,
         estimator_object  =LinearRegression(), # supply an appropriate estimator object
         metric_function   =mean_absolute_error # supply an appropriate metric function
        )

882.1290491891065

In [34]:
fea_pipe = Pipeline([('ren',RenameColumnsPDF(rename_dict={'price':'target'}))])
est_pipe = Pipeline([('ohe',ColumnTransformer(transformers=[('ohe',
                                                             OneHotEncoder(),
                                                             ['cut','color','clarity'])],
                                              remainder='passthrough')),
                     ('lin', LinearRegression())
                    ]
                   )
workflow(initial_pdf       =get_initial_pdf(),
         transformer_object=fea_pipe,
         estimator_object  =est_pipe,
         metric_function   =mean_absolute_error 
        )

741.1661661104931

__The End__