<a href="https://colab.research.google.com/github/data-space/datalab-notebooks/blob/master/Python/4.%20Workflows/1.0%20Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Machine Learning Workflows

## Reference
- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Old references
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- https://scikit-learn.org/stable/datasets/index.html
- https://machinelearningmastery.com/time-series-forecasting-supervised-learning/
- https://machinelearningmastery.com/time-series-datasets-for-machine-learning/
- https://plot.ly/python/time-series/
- https://www.plotly.express/plotly_express/

## Table of Contents
1. Introduction
1. Setup


## Introduction

Supervised machine learning has the general goal of creating a model/method to make _good_ predictions on unseen data. 

Machine learning workflows for supervised learning organize the steps that accompish this goal. These steps are:
1. Get the raw dataset
2. Create prepared dataset (from the raw datset)
3. Create train and test datasets (from the prepared dataset)
4. Fit model (to the train dataset using a machine learning algorithm)
5. Make predictions (on the test dataset using the fit model)
6. Evaluate predictions (made by the model on the test dataset)

Our focus is on preparing the data in step 2 and determining the algorithm in step 4. In this notebook, all steps will be described in words and implemented in Python code for a couple simple datasets. 

Later notebooks will implement this workflow with more involved datasets which will provide opprtunity to focus on steps 2 and 4. 

The essential requirement of the workflow is that this estimation of prediction accuracy should only happen once and only happen on unseen data. Specifically, the test dataset should be used only in steps 5 and 6 and these two steps should only be run once. 

In addition, no information from train dataset should be made available to the test dataset or be used when making predictions on the test dataset. 

The __Setup__ section loads three libraries and displays their version numbers. Following this is a section for each of the workflow steps.

## Setup

Import the `pandas`,  `numpy` and `sklearn` libraries. 

In [0]:
import pandas  as pd
import numpy   as np
import sklearn as sk

Display the version numbers of the numpy, pandas and scikit-learn packages:

In [0]:
print('numpy  :',np.__version__)
print('pandas :',pd.__version__)
print('sklearn:',sk.__version__)

numpy  : 1.16.4
pandas : 0.24.2
sklearn: 0.21.3


## Workflow steps (iris demo)

### Step 1. Get raw dataframe

This step is implemented with a single Python function (`get_raw_data_pdf`) that reads data from its source and returns a pandas dataframe. The function for this example retrieves the iris dataset from the `sklearn.datasets` module,  concatenates the features and target into a single dataframe, and returns this dataframe. 

In [0]:
def get_initial_pdf():
  import pandas as pd
  from sklearn.datasets import load_iris
  iris_features     = load_iris().data
  iris_target       = load_iris().target
  iris_target_names = load_iris().target_names

  iris_feature_columns = [feature_name.replace(' ','_')
                                      .replace('(','')
                                      .replace(')','') 
                          for feature_name in load_iris().get('feature_names')]
  
  iris_features_pdf = pd.DataFrame(data=iris_features,
                                   columns=iris_feature_columns
                                  )
  iris_target_pdf = pd.DataFrame(data={'species': iris_target}) \
                      .replace(to_replace={n:iris_target_names[n]
                                           for n in [0,1,2]}) \
                      .astype('object')
  iris_pdf = pd.concat([iris_features_pdf, iris_target_pdf],
                       axis='columns',
                       join='inner')
  return iris_pdf

Notice that the column names have been changed to use snake case, the parentheses have been removed from the feature column names, and the target column has been named `species` and its values replaced with strings. 

In [0]:
initial_pdf = get_initial_pdf()

In [0]:
initial_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length_cm    150 non-null float64
sepal_width_cm     150 non-null float64
petal_length_cm    150 non-null float64
petal_width_cm     150 non-null float64
species            150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


In [0]:
initial_pdf['species'].value_counts()

virginica     50
versicolor    50
setosa        50
Name: species, dtype: int64

The `get_raw_data_pdf` function returns a pandas dataframe with columns that have correct datatypes. This dataframe will be passed to the next step (data preparation).

### Step 2. Create prepared dataframe

This step takes as input the raw dataframe produced by the previous step and returns a dataframe with a target variable and predictor variables. In general this step consists of per row transformations of the raw dataframe, which __do not__ use aggregate data summarized across the entire dataframe. 

In this example, the only work involved is to
- drop rows where `species` is equal to `setosa`
- label the `species` variable as the `target` variable

Below this work is done by a _transformer object_ of (transformer) class `PreparePDF` that is returned by the function `get_prepare_transformer_object`.  Using a transformer class doesn't seem useful now, but it will later when things get more complicated. 

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class PreparePDF(BaseEstimator, TransformerMixin):  
  def __init__(self):
    return
    
  def fit(self, X, y=None): 
    return self

  def transform(self, X, y=None): 
    return X.loc[lambda pdf: pdf.species!='setosa'] \
            .rename(columns={'species':'target'})


In [0]:
def get_prepared_pdf(pdf, transformer_object): 
  return transformer_object.fit_transform(pdf)

The `fit_transform` method of this object is called with the raw dataframe as input. 

In [0]:
prepared_pdf = get_prepared_pdf(initial_pdf, 
                                PreparePDF())

In [0]:
prepared_pdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 50 to 149
Data columns (total 5 columns):
sepal_length_cm    100 non-null float64
sepal_width_cm     100 non-null float64
petal_length_cm    100 non-null float64
petal_width_cm     100 non-null float64
target             100 non-null object
dtypes: float64(4), object(1)
memory usage: 4.7+ KB


In [0]:
prepared_pdf['target'].value_counts()

virginica     50
versicolor    50
Name: target, dtype: int64

### Step 3. Create train and test datasets

This is a simple, but important step in the process. The processes of creating the model and evaluating the model must be kept separate. Otherwise there is no reason to believe that the results for predictions on unseen data would be similar to your results on the test dataset. 

The train dataset will be used to create the model, then predictions will be made from the predictor variables in test dataset, then these predicitions will be compared to the target variable in the test dataset

In [0]:
def get_train_test_dict(pdf, target_name, **kwargs):
  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(pdf.drop(columns=target_name),
                                                      pdf[target_name], 
                                                      **kwargs)
  return {
      'x_train': X_train,
      'y_train': y_train,
      'x_test' : X_test,
      'y_test' : y_test
  }

In [0]:
train_test_dict = get_train_test_dict(prepared_pdf,'target',test_size=0.2)

In [0]:
train_test_dict.keys()

dict_keys(['x_train', 'y_train', 'x_test', 'y_test'])

In [0]:
[(name, type(val), val.shape) for (name,val) in train_test_dict.items()]

[('x_train', pandas.core.frame.DataFrame, (80, 4)),
 ('y_train', pandas.core.series.Series, (80,)),
 ('x_test', pandas.core.frame.DataFrame, (20, 4)),
 ('y_test', pandas.core.series.Series, (20,))]

### Step 4. Fit model 

(from the train dataset using a machine learning algorithm)

In [0]:
def get_fit_model(estimator_object, x_train, y_train):
  return estimator_object.fit(X=x_train,
                              y=y_train)

In [0]:
from sklearn.linear_model import LogisticRegression

fit_model = get_fit_model(estimator_object=LogisticRegression(),
                          x_train         =train_test_dict.get('x_train'),
                          y_train         =train_test_dict.get('y_train'),
                         )



### Step 5. Make predictions 

(on the test dataset using the model)

In [0]:
def get_predict_ser(model, x_test):
  return model.predict(X=x_test)

In [0]:
predict_ser = get_predict_ser(model =fit_model, 
                              x_test=train_test_dict.get('x_test')
                             )


### Step 6. Evaluate predictions


In [0]:
def get_actual_predict_pdf(actual,predict):
  import pandas as pd
  return pd.DataFrame(data={'actual' : actual,
                            'predict': predict},
                      index=actual.index)

In [0]:
actual_predict_pdf = get_actual_predict_pdf(                             
                              train_test_dict.get('y_test'),
    get_predict_ser(fit_model,train_test_dict.get('x_test'))
) 
actual_predict_pdf.head()

Unnamed: 0,actual,predict
81,versicolor,versicolor
94,versicolor,versicolor
75,versicolor,versicolor
124,virginica,virginica
100,virginica,virginica


In [0]:
def get_actual_predict_eval(pdf, func):
  return func(pdf.actual, 
              pdf.predict)

In [0]:
from sklearn.metrics import accuracy_score

get_actual_predict_eval(actual_predict_pdf,
                        accuracy_score
                       )

0.9

In [0]:
error_here

## Workflow steps (iris condensed)

In [0]:
initial_pdf      = get_initial_pdf()

prepared_pdf = get_prepared_pdf(pdf               =initial_pdf, 
                                transformer_object=PreparePDF()
                               )

train_test_dict = get_train_test_dict(pdf        =prepared_pdf,
                                      target_name='target'
                                     )

from sklearn.linear_model import LogisticRegression

fit_model = get_fit_model(estimator_object=LogisticRegression(),
                          x_train         =train_test_dict.get('x_train'),
                          y_train         =train_test_dict.get('y_train'),
                         )

actual_predict_pdf = get_actual_predict_pdf(                             
                              train_test_dict.get('y_test'),
    get_predict_ser(fit_model,train_test_dict.get('x_test'))
) 

from sklearn.metrics import accuracy_score

get_actual_predict_eval(actual_predict_pdf,
                        accuracy_score
                       )

## Workflow - Boston housing dataset

To configure this workflow for the Boston housing dataset:
1. Define the `get_initial_pdf` function to return the Boston housing dataset as a single pandas dataframe. 
1. There is no work to do to prepare the Boston housing dataframe so modify the `PreparePDF` transform class so that its `transform` method returns same value as the the `X` input parameter. 
1. 
1. Pass `mean_absolute_error` to the second parameter `func` of the `get_actual_predict_eval` function 3. Pass the second parameter `func` of the `get_actual_predict_eval` function. Import the function too. 

Define `get_initial_pdf` to return 

In [0]:
def get_initial_pdf(): # for boston housing dataset
  from sklearn.datasets import load_boston
  feature_names = [name.lower() 
                   for name in load_boston().get('feature_names').tolist()
                  ]
  features_pdf = pd.DataFrame(data=load_boston().get('data'),
                              columns=feature_names
                             )
  target_pdf = pd.DataFrame(data={'target': load_boston().get('target')}
                           )
  return pd.concat([features_pdf, target_pdf],
                   axis='columns',
                   join='inner')

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class PreparePDF(BaseEstimator, TransformerMixin):  
  def __init__(self):
    return
    
  def fit(self, X, y=None): 
    return self

  def transform(self, X, y=None): 
    return X

In [0]:
initial_pdf  = get_initial_pdf() # read the initial dataset

prepared_pdf = get_prepared_pdf(pdf               =initial_pdf, 
                                transformer_object=PreparePDF() # preprocessing class
                               )

train_test_dict = get_train_test_dict(pdf        =prepared_pdf,
                                      target_name='target' # column name of target
                                     )
from sklearn.linear_model import LinearRegression # estimator object

fit_model = get_fit_model(estimator_object=LinearRegression(), # estimator object
                          x_train         =train_test_dict.get('x_train'),
                          y_train         =train_test_dict.get('y_train'),
                         )


actual_predict_pdf = get_actual_predict_pdf(                             
                              train_test_dict.get('y_test'),
    get_predict_ser(fit_model,train_test_dict.get('x_test'))
) 

from sklearn.metrics import mean_absolute_error # evaluation metric 

get_actual_predict_eval(actual_predict_pdf,
                        mean_absolute_error # evaluation metric 
                       )

## Datasets

First, load the iris dataset for the demonstration.

In [0]:
def plot_iris_pdf():
  import plotly.express as px
  import pandas as pd
  fig = px.scatter(get_iris_pdf(), x='sepal_length_cm', y='sepal_width_cm', color='species')
  fig.show()
plot_iris_pdf()

### Shampoo dataset

In [0]:
def get_shampoo_pdf():
  import pandas as pd
  shampoo_url='https://raw.githubusercontent.com/jbrownlee/Datasets/master/shampoo.csv'
  return pd.read_csv(shampoo_url)

print(get_shampoo_pdf().info())
get_shampoo_pdf().head()

In [0]:
def plot_shampoo_pdf():
  import plotly.express as px
  import pandas as pd
  fig = px.line(get_shampoo_pdf(), x='Month', y='Sales')
  fig.show()
  
plot_shampoo_pdf()

### Daily temperatures dataset

In [0]:
import pandas as pd
daily_temps_url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv'
daily_temps_pdf = pd.read_csv(daily_temps_url)
print(daily_temps_pdf.info())
daily_temps_pdf.head()

In [0]:
def get_daily_temps_pdf():
  daily_temps_url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv'
  return pd.read_csv(daily_temps_url)

In [0]:
def plot_daily_temps_pdf():
  import plotly.express as px
  import pandas as pd
  fig = px.line(get_daily_temps_pdf(), x='Date', y='Temp')
  fig.show()
plot_daily_temps_pdf()

##2. Introduction

## Time series

In [0]:
def train_test_split_timeseries(pdf, train_pct=0.8, start_test=None, target_column_name=None):
  assert target_column_name is not None, "target_column_name cannot be None"
  import pandas as pd
  from datetime import timedelta
  if start_test is not None:
    start_test_dt  = pd.to_datetime(start_test)

  if start_test_dt > pdf.index.max(): # create a test set
    return {'train_x': pdf.loc[:start_test].drop(target_column_name),
            'train_y': pdf.loc[:start_test].loc[:,[target_col_num]],
            'test_x' : pdf.loc[start_test:].drop(target_column_name),
            'test_y' : pdf.loc[start_test:].loc[:,[target_col_num]],
           }

## Time series datasets

In [0]:
train_test_split_timeseries(sample_timeseries_pdf, target_column_name='')

In [0]:
# Using graph_objects
import plotly.graph_objects as go

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')

fig = go.Figure([go.Scatter(x=df['Date'], y=df['AAPL.High'])])
fig.show()

__The End__ (test)