<a href="https://colab.research.google.com/github/data-space/datalab-notebooks/blob/master/Python/4.%20Workflows/1.0%20Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Machine Learning Workflows

## Reference
- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Old references
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- https://scikit-learn.org/stable/datasets/index.html
- https://machinelearningmastery.com/time-series-forecasting-supervised-learning/
- https://machinelearningmastery.com/time-series-datasets-for-machine-learning/
- https://plot.ly/python/time-series/
- https://www.plotly.express/plotly_express/

## Table of Contents
1. Introduction
1. Setup


## Introduction

Supervised machine learning has the general goal of creating a model/method to make _good_ predictions on unseen data. 

Machine learning workflows for supervised learning organize the steps that accompish this goal. These steps are:
1. Get the initial dataset
2. Create a feature-target dataset (from the initial datset)
3. Create train and test datasets (from the feature-target dataset)
4. Fit model (to the train dataset)
5. Make and evaluate predictions (made by the fit model on the test dataset)

There are four components (that you provide) as input to the workflow:
1. The initial dataset
2. The process to create the feature-target dataset
3. The process to fit the model
4. The choice of metric to use in evaluating the model predictions

Our focus will be on the processes to prepare the data in step 2 and to fit the model in step 4. 

In this notebook, these steps will be described and implemented in Python code and run with two simple datasets. 

Later notebooks will implement this workflow with more involved datasets which will provide opprtunity to focus on steps 2 and 4. 

There are a few requirements of the workflow. In general, information used to evaluate the model should not be available when creating the model. Specifically, 
- the train dataset should be used to fit the model
- the test dataset should be used to evaluate the model
- step 5 should only be run once
- the feature-target dataset should be created using only per-row transformations and have predictor columns only of numeric types 

The first two requirements will be implemented in the code below. The last two requirements will be discussed in later notebooks. 

The takeaways from this notebook are:
- workflow steps
- workflow components

The requirements will be revisited in later notebooks.

The __Setup__ section loads three libraries and displays their version numbers. Following this is a section for each of the workflow steps.

## Setup

Import the `pandas`,  `numpy` and `sklearn` libraries. 

In [0]:
import pandas  as pd
import numpy   as np
import sklearn as sk

Display the version numbers of the numpy, pandas and scikit-learn packages:

In [2]:
print('numpy  :',np.__version__)
print('pandas :',pd.__version__)
print('sklearn:',sk.__version__)

numpy  : 1.16.4
pandas : 0.24.2
sklearn: 0.21.3


## Workflow steps (demonstration)

### Step 1. Get initial dataframe

This step is implemented with a single Python function (`get_initial_data_pdf`) that reads data from its source and returns a pandas dataframe (`pdf`). The function for this example retrieves the iris dataset from the `sklearn.datasets` module and returns the features and target concatenated into a single dataframe. 

In [0]:
def get_iris_pdf():
  import pandas as pd
  from sklearn.datasets import load_iris
  iris_features     = load_iris().data
  iris_target       = load_iris().target
  iris_target_names = load_iris().target_names

  iris_feature_columns = [feature_name.replace(' ','_')
                                      .replace('(','')
                                      .replace(')','') 
                          for feature_name in load_iris().get('feature_names')]
  
  iris_features_pdf = pd.DataFrame(data=iris_features,
                                   columns=iris_feature_columns
                                  )
  iris_target_pdf = pd.DataFrame(data={'species': iris_target}) \
                      .replace(to_replace={n:iris_target_names[n]
                                           for n in [0,1,2]}) \
                      .astype('object')
  iris_pdf = pd.concat([iris_features_pdf, iris_target_pdf],
                       axis='columns',
                       join='inner')
  return iris_pdf

Notice that the column names have been changed to use snake case, the parentheses have been removed from the feature column names, and the target column has been named `species` and its values replaced with strings. 

Store the data frame in `initial_pdf` for input to the next step. 

In [0]:
initial_pdf = get_iris_pdf()

Notice that the datatypes look correct. 

In [5]:
initial_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length_cm    150 non-null float64
sepal_width_cm     150 non-null float64
petal_length_cm    150 non-null float64
petal_width_cm     150 non-null float64
species            150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


In [6]:
initial_pdf.head()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [7]:
initial_pdf['species'].value_counts()

virginica     50
setosa        50
versicolor    50
Name: species, dtype: int64

The `get_initial_data_pdf` function returns a pandas dataframe with columns that have correct datatypes, which is the primary concern at this step. This dataframe will be passed to the next step, which creates a feature-target dataframe).

### Step 2. Create feature-target dataframe by preprocessing initial dataframe

This step takes as input the initial dataframe produced by the previous step and returns a dataframe with a target variable and predictor variables. In general this step consists of per row transformations of the raw dataframe, which __do not__ use aggregate data summarized across the entire dataframe. Later notebooks will focus on this constraint.

In addition, it is essential that the target does not include any missing values.

In this example, the only work involved is to
- drop rows where `species` is equal to `setosa`
- label the `species` variable as the `target` variable

Below this work is done by a _transformer object_ of class `IrisFeaTgtPDF` that is returned by the function `get_prepare_transformer_object`.  Using a transformer class doesn't seem useful now, but it will later when things get more complicated. 

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class IrisFeaTgtPDF(BaseEstimator, TransformerMixin):  
  def __init__(self):
    return
    
  def fit(self, X, y=None): 
    return self

  def transform(self, X, y=None): 
    return X.loc[lambda pdf: pdf.species!='setosa'] \
            .rename(columns={'species':'target'})


The `get_feature_target_pdf` function passes the pandas dataframe `pdf` to the `fit` and `transform` methods of the `transformer_object`.

In [0]:
def get_feature_target_pdf(pdf, transformer_object): 
  return transformer_object.fit(pdf).transform(pdf)

This function is called with the initial dataframe as input. 

In [0]:
feature_target_pdf = get_feature_target_pdf(initial_pdf, 
                                            IrisFeaTgtPDF()
                                           )

In [11]:
feature_target_pdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 50 to 149
Data columns (total 5 columns):
sepal_length_cm    100 non-null float64
sepal_width_cm     100 non-null float64
petal_length_cm    100 non-null float64
petal_width_cm     100 non-null float64
target             100 non-null object
dtypes: float64(4), object(1)
memory usage: 4.7+ KB


In [12]:
feature_target_pdf['target'].value_counts()

virginica     50
versicolor    50
Name: target, dtype: int64

### Step 3. Create train and test datasets

This is a simple, but important step in the process. The two steps, creating the model (step 4) and evaluating the model (step 5) must be use different datasets. Step 4 uses the train dataset and step 5 uses the test dataset. Otherwise there is no reason to believe that the results for predictions on unseen data would be similar to your results. In particular, 
- The model is created from the  train dataset
- Predictions are made from the predictor variables of the test dataset
- These predicitions will be compared to the target variable of the test dataset

The Scikit-learn library makes available the `train_test_split` function that creates train and test datasets. This function is _wrapped_ below to create a function that takes as input a dataframe and the name of the target variable and returns a dictionary of results. 

In [0]:
def get_train_test_dict(pdf, target_name, **kwargs):
  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(pdf.drop(columns=target_name),
                                                      pdf[target_name], 
                                                      **kwargs)
  return {
      'x_train': X_train,
      'y_train': y_train,
      'x_test' : X_test,
      'y_test' : y_test
  }

In [0]:
train_test_dict = get_train_test_dict(feature_target_pdf,'target')

The function returns a dictionary with one key-value pair for each dataset.

In [15]:
train_test_dict.keys()

dict_keys(['x_train', 'y_train', 'x_test', 'y_test'])

Notice the differences in datatype and shape of the output. 


In [16]:
[(name, type(val), val.shape) for (name,val) in train_test_dict.items()]

[('x_train', pandas.core.frame.DataFrame, (75, 4)),
 ('y_train', pandas.core.series.Series, (75,)),
 ('x_test', pandas.core.frame.DataFrame, (25, 4)),
 ('y_test', pandas.core.series.Series, (25,))]

### Step 4. Fit model 

A _model_ is _fit_ on a training dataset so that it (the fit model) can be used to make predictions on unseen datasets (that contain the same predictor columns as the training dataset). 

Linear regression and logistic regression are common models. When they are fit to a training dataset coefficients are determined for each predictor column/variable that are then used to make predictions from a row of values with values for these predictor variables.

This is accomplished in Python using the Scikit-learn library with estimator objects that have `fit` and `predict` methods. The `fit` method takes as input the `x_train` of predictor dataframe and the `y_train` target series. 

In [0]:
def get_fit_model(estimator_object, x_train, y_train):
  return estimator_object.fit(X=x_train,
                              y=y_train)

The `get_fit_model` function (defined above) is called with transformer object `LogisticRegression()`  and with the predictor training dataframe and the target training series. The function returns the fit model. 

In [18]:
from sklearn.linear_model import LogisticRegression

fit_model = get_fit_model(estimator_object=LogisticRegression(),
                          x_train         =train_test_dict.get('x_train'),
                          y_train         =train_test_dict.get('y_train'),
                         )
fit_model



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

### Step 5. Make and evaluate predictions 

After a model has been fit to the training datasets it can then be used to make predictions using the `predict` method. The `get_predict_ser` function takes as input a fit model and a predictor dataframe and returns a series of predictions. 

In [0]:
def get_predict_ser(model, x_test):
  return model.predict(X=x_test)

As the name of the `x_test` parameter suggests that parameter value should be the test predictor dataframe. As mentioned above, it is important to use different datasets for model evaluation (the test dataset) and for model fitting (the train dataset). 

Create a pandas series of predictions. 

In [0]:
predict_ser = get_predict_ser(model =fit_model, 
                              x_test=train_test_dict.get('x_test')
                             )

The `get_actual_predict_pdf` function (below) simply creates a dataframe with one column for actual values and one column for predicted values. In addition, the dataframe index is set to be the same as the index of the actual values (target test series). 

In [0]:
def get_actual_predict_pdf(actual,predict):
  import pandas as pd
  return pd.DataFrame(data={'actual' : actual,
                            'predict': predict},
                      index=actual.index)

This function is run with input of the test target series and the predictions made from the test predictor dataframe. 

In [0]:
actual_predict_pdf = get_actual_predict_pdf(train_test_dict.get('y_test'),
                                            get_predict_ser(fit_model,train_test_dict.get('x_test'))
                                           ) 


Notice the values and that count of differences. 

In [23]:
actual_predict_pdf.head()

Unnamed: 0,actual,predict
90,versicolor,versicolor
53,versicolor,versicolor
99,versicolor,versicolor
94,versicolor,versicolor
69,versicolor,versicolor


In [24]:
(    actual_predict_pdf['actual']
 .ne(actual_predict_pdf['predict'])
 .sum()
)

2

The `get_actual_predict_eval` function takes as input a metric function and a dataframe with columns named `actual` and `predict`. It returns the result of applying the metric to the two columns. 

In [0]:
def get_actual_predict_eval(actual_predict_pdf, metric_function):
  return metric_function(actual_predict_pdf['actual'], 
                         actual_predict_pdf['predict']
                        )

In [26]:
from sklearn.metrics import accuracy_score

get_actual_predict_eval(actual_predict_pdf=actual_predict_pdf,
                        metric_function   =accuracy_score
                       )

0.92

## Workflow (essentials)

The code cell below collects only the essential commands to implement the workflow. 

In [27]:
initial_pdf      = get_iris_pdf()

feature_target_pdf = get_feature_target_pdf(pdf               =initial_pdf, 
                                            transformer_object=IrisFeaTgtPDF()
                                           )

train_test_dict = get_train_test_dict(pdf        =feature_target_pdf,
                                      target_name='target'
                                     )

from sklearn.linear_model import LogisticRegression

fit_model = get_fit_model(estimator_object=LogisticRegression(),
                          x_train         =train_test_dict.get('x_train'),
                          y_train         =train_test_dict.get('y_train'),
                         )

actual_predict_pdf = get_actual_predict_pdf(                             
                              train_test_dict.get('y_test'),
    get_predict_ser(fit_model,train_test_dict.get('x_test'))
) 

from sklearn.metrics import accuracy_score

get_actual_predict_eval(actual_predict_pdf,
                        accuracy_score
                       )



0.96

## Workflow (function)

### Standard/common functions

In [0]:
def get_feature_target_pdf(pdf, transformer_object): 
  return transformer_object.fit(pdf).transform(pdf)

In [0]:
def get_train_test_dict(pdf, target_name, **kwargs):
  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(pdf.drop(columns=target_name),
                                                      pdf[target_name], 
                                                      **kwargs)
  return {
      'x_train': X_train,
      'y_train': y_train,
      'x_test' : X_test,
      'y_test' : y_test
  }

In [0]:
def get_fit_model(estimator_object, x_train, y_train):
  return estimator_object.fit(X=x_train,
                              y=y_train)

In [0]:
def get_predict_ser(model, x_test):
  return model.predict(X=x_test)

In [0]:
def get_actual_predict_pdf(actual,predict):
  import pandas as pd
  return pd.DataFrame(data={'actual' : actual,
                            'predict': predict},
                      index=actual.index)

In [0]:
def get_actual_predict_eval(actual_predict_pdf, metric_function):
  return metric_function(actual_predict_pdf['actual'], 
                         actual_predict_pdf['predict']
                        )

In [0]:
def workflow(initial_pdf,
             transformer_object,
             estimator_object,
             metric_function
            ):
  feature_target_pdf = \
  get_feature_target_pdf(pdf               =initial_pdf, 
                         transformer_object=transformer_object
                        )

  train_test_dict = \
  get_train_test_dict(pdf        =feature_target_pdf,
                      target_name='target'
                     )

  fit_model = \
  get_fit_model(estimator_object=estimator_object,
                x_train         =train_test_dict.get('x_train'),
                y_train         =train_test_dict.get('y_train'),
               )

  actual_predict_pdf = \
  get_actual_predict_pdf(train_test_dict.get('y_test'),
                         get_predict_ser(fit_model,
                                         train_test_dict.get('x_test')
                                        )
                        ) 

  return get_actual_predict_eval(actual_predict_pdf,
                                 metric_function
                                )

### Your functions (specific to your dataset)

In [0]:
def get_iris_pdf():
  import pandas as pd
  from sklearn.datasets import load_iris
  iris_features     = load_iris().data
  iris_target       = load_iris().target
  iris_target_names = load_iris().target_names

  iris_feature_columns = [feature_name.replace(' ','_')
                                      .replace('(','')
                                      .replace(')','') 
                          for feature_name in load_iris().get('feature_names')]
  
  iris_features_pdf = pd.DataFrame(data=iris_features,
                                   columns=iris_feature_columns
                                  )
  iris_target_pdf = pd.DataFrame(data={'species': iris_target}) \
                      .replace(to_replace={n:iris_target_names[n]
                                           for n in [0,1,2]}) \
                      .astype('object')
  iris_pdf = pd.concat([iris_features_pdf, iris_target_pdf],
                       axis='columns',
                       join='inner')
  return iris_pdf

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class IrisFeaTgtPDF(BaseEstimator, TransformerMixin):  
  def __init__(self):
    return
    
  def fit(self, X, y=None): 
    return self

  def transform(self, X, y=None): 
    return X.loc[lambda pdf: pdf.species!='setosa'] \
            .rename(columns={'species':'target'})


In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics      import accuracy_score

workflow(initial_pdf       =get_iris_pdf(),
         transformer_object=IrisFeaTgtPDF(),
         estimator_object  =LogisticRegression(),
         metric_function   =accuracy_score
        )




1.0

## Workflow - Boston housing dataset

To configure this workflow for the Boston housing dataset:
1. Define the `get_initial_pdf` function to return the Boston housing dataset as a single pandas dataframe. 
1. There is no work to do to prepare the Boston housing dataframe so modify the `BostonHousingFeaTgtPDF` transform class so that its `transform` method returns same value as the the `X` input parameter. 
1. 
1. Pass `mean_absolute_error` to the second parameter `func` of the `get_actual_predict_eval` function 3. Pass the second parameter `func` of the `get_actual_predict_eval` function. Import the function too. 

Define `get_initial_pdf` to return 

In [0]:
def get_boston_housing_pdf(): # for boston housing dataset
  from sklearn.datasets import load_boston
  feature_names = [name.lower() 
                   for name in load_boston().get('feature_names').tolist()
                  ]
  features_pdf = pd.DataFrame(data=load_boston().get('data'),
                              columns=feature_names
                             )
  target_pdf = pd.DataFrame(data={'price': load_boston().get('target')}
                           )
  return pd.concat([features_pdf, target_pdf],
                   axis='columns',
                   join='inner')

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class BostonHousingFeaTgtPDF(BaseEstimator, TransformerMixin):  
  def __init__(self):
    return
    
  def fit(self, X, y=None): 
    return self

  def transform(self, X, y=None): 
    return X.rename(columns={'price':'target'})


In [40]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics      import mean_absolute_error

workflow(initial_pdf       =get_boston_housing_pdf(),
         transformer_object=BostonHousingFeaTgtPDF(),
         estimator_object  =LinearRegression(),
         metric_function   =mean_absolute_error
        )


3.748539807410097

In [0]:
def get_diamonds_pdf():
  import pandas as pd
  diamonds_file_link = 'https://raw.githubusercontent.com/datalab-datasets/file-samples/master/diamonds.csv'
  return pd.read_csv(diamonds_file_link) \
           .drop('Unnamed: 0',
                 axis=1)

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class DiamondsFeaTgtPDF(BaseEstimator, TransformerMixin):  
  def __init__(self):
    return
    
  def fit(self, X, y=None): 
    return self

  def transform(self, X, y=None): 
    return X.rename(columns={'price':'target'})


In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics      import mean_absolute_error

workflow(initial_pdf       =get_boston_housing_pdf(),
         transformer_object=BostonHousingFeaTgtPDF(),
         estimator_object  =LinearRegression(),
         metric_function   =mean_absolute_error
        )

__The End__