# Python scikit-learn Machine Learning Workflow 

# Outcomes

- Overview of scikit-learn for supervised machine learning
- Create training and testing data sets
- Create a workflow pipeline
    - Add a column transformer object
    - Add a feature selection object
    - Add a regression object
    - Fit a model
    - Set hyperparameters for tuning
    - Tune the model using grid search cross validation

# Setting up

In [1]:
# Import the libraries
from datetime import datetime
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso, LassoCV, LinearRegression
from sklearn.model_selection import cross_val_score, GridSearchCV,\
    train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn import set_config
from xgboost import XGBRegressor

In [2]:
# Set the global parameters
pd.options.display.max_rows = None
pd.options.display.max_columns = None
filename = 'lunch_and_learn.csv'
numrows = 500
target = 'Y'
features = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7',
            'X8', 'X9', 'X10', 'X11', 'X12', 'X13']
set_config(display='diagram')

# Cleaning the data

In [3]:
# Read the data file into a pandas DataFrame
data = pd.read_csv(filename, nrows=numrows)

In [4]:
# Set lower and upper values to remove outliers
mask_values = [
    ('X1', -20, 20),
    ('X2', -25, 25),
    ('X3', -5, 5),
    ('X4', -10, 10),
    ('X5', -3, 3),
    ('X6', -5, 5),
    ('X7', -13, 13),
    ('X8', -9, 15),
    ('X9', -17, 15),
    ('X10', -16, 15),
    ('X11', -16, 17),
    ('X12', -16, 17),
    ('X13', -20, 23)
]
# Replace outliers with NaN
for column, lowvalue, highvalue in mask_values:
    data[column]= data[column].mask(
        (data[column] <= lowvalue) |
        (data[column] >= highvalue)
    )

# Splitting the data

In [5]:
# Create training and testing data sets
X = data[features]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

# Machine learning workflow

A typical workflow involves several, sequential steps:

- Column transformation such as imputing missing values
- Feature selection
- Modeling

These steps are embedded in a workflow method called a pipeline, which is simply a series of sequential steps. The output of each step is passed to the next step.

# Workflow 1

- Impute using the mean
- Select features using SelectFromModel(DecisionTreeRegressor)
- Fit with LinearRegression

In [6]:
start_time = datetime.now()

## Creating a column transformer

In [7]:
# Create the imputer object
imp = SimpleImputer()

In [8]:
# Create the column transformer object
ct = make_column_transformer(
     (imp, features),
     remainder='passthrough'
)

## Creating a feature selection object

In [9]:
# Create objects to use for feature selection
linreg_selection = LinearRegression()
dtr_selection = DecisionTreeRegressor()
lasso_selection = Lasso()
lassocv_selection = LassoCV()
rfr_selection = RandomForestRegressor()

In [10]:
# Create the feature selection object
selection = SelectFromModel(estimator=dtr_selection,
                            threshold='median')

## Creating a regression object

In [11]:
# Potential objects to use for regression
# linreg = LinearRegression()
# dtr = DecisionTreeRegressor()
# lasso = Lasso()
# lassocv = LassoCV()
# rfr = RandomForestRegressor()
# xgb = XGBRegressor()

In [12]:
# Create an object to use for regression
linreg = LinearRegression()

## Create a workflow object

In [13]:
# Create the workflow object
pipe = make_pipeline(ct, selection, linreg)

In [14]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

In [15]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X4', 'X5', 'X6', 'X7', 'X13'], dtype='object')

In [16]:
# Display the regression intercept
pipe.named_steps.linearregression.intercept_.round(3)

69.173

In [17]:
# Display the regression coefficients of the features
pipe.named_steps.linearregression.coef_.round(3)

array([  7.076, -12.948,  -9.146,   3.347,   4.631,  -8.16 ,  -0.061])

Cross validation estimates how accurately a predictive model will perform in practice. The model is fitted using the training data set and tested using the testing data set. cross_val_score performs the columns transformations, specified in column transformer, after each of the data splits in order to prevent data leakage.

In [18]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.986

## Hyperparameter optimization

Hyperparameters are values you set, such as 'mean' in SimpleImputer(). Parameters are values learned in the fit step. We tune the hyperparameters for all of the objects in the pipeline. You define the values to try for each hyperparameter. GridSearchCV performs cross-validation for every possible combination of these values for the entire pipeline.

In [19]:
# Determine the step names of the pipeline
pipe.named_steps.keys()

dict_keys(['columntransformer', 'selectfrommodel', 'linearregression'])

In [20]:
pipe.named_steps.columntransformer.get_params

<bound method ColumnTransformer.get_params of ColumnTransformer(remainder='passthrough',
                  transformers=[('simpleimputer', SimpleImputer(),
                                 ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7',
                                  'X8', 'X9', 'X10', 'X11', 'X12', 'X13'])])>

In [21]:
pipe.named_steps.selectfrommodel.get_params

<bound method BaseEstimator.get_params of SelectFromModel(estimator=DecisionTreeRegressor(), threshold='median')>

In [22]:
pipe.named_steps.linearregression.get_params

<bound method BaseEstimator.get_params of LinearRegression()>

In [23]:
# Set the hyperparameters for optimization
# Create a dictionary
# The dictionary key is the step name, followed by two underscores,
# followed by the hyperparameter name
# The dictionary value is the list of values to try for a hyperparameter
# There are 4 x 5 x 3 x 2 = 120 combinations
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] =\
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectfrommodel__estimator'] =\
    [linreg_selection, dtr_selection, lasso_selection,
     lassocv_selection, rfr_selection]
hyperparams['selectfrommodel__threshold'] = [None, 'mean', 'median']
hyperparams['linearregression__normalize'] = [False, True]

In [24]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);

In [25]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score');

In [26]:
# Access the best score
grid.best_score_.round(3)

0.993

In [27]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'mean',
 'linearregression__normalize': False,
 'selectfrommodel__estimator': LinearRegression(),
 'selectfrommodel__threshold': 'median'}

In [28]:
end_time = datetime.now()
round((end_time - start_time).total_seconds(), 3)

43.865

# Workflow 2

- Impute using the mean
- Select features using SelectFromModel(LinearRegression)
- Fit with LinearRegression

In [29]:
start_time = datetime.now()

## Creating a column transformer

In [30]:
# Create the imputer object
imp = SimpleImputer()

In [31]:
# Create the column transformer object
ct = make_column_transformer(
     (imp, features),
     remainder='passthrough'
)

## Creating a feature selection object

In [32]:
# Create objects to use for feature selection
linreg_selection = LinearRegression()

In [33]:
# Create the feature selection object
selection = SelectFromModel(estimator=linreg_selection,
                            threshold='median')

## Creating a regression object

In [34]:
# Create objects to use for regression
linreg = LinearRegression()

## Create a workflow object

In [35]:
# Create the workflow object
pipe = make_pipeline(ct, selection, linreg)

In [36]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

In [37]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7'], dtype='object')

In [38]:
# Display the regression intercept
pipe.named_steps.linearregression.intercept_.round(3)

69.005

In [39]:
# Display the regression coefficients of the features
pipe.named_steps.linearregression.coef_.round(3)

array([  7.014, -13.01 ,   6.364,  -8.985,   3.36 ,   4.849,  -8.002])

In [40]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.993

In [41]:
end_time = datetime.now()
round((end_time - start_time).total_seconds(), 3)

0.333

# References

## numpy

- [API](https://numpy.org/devdocs/reference/index.html)

## pandas

- [API](https://pandas.pydata.org/docs/reference/index.html)

- [isna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html)

- [mask](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html#pandas.DataFrame.mask)

- [options.display.max_rows, options.display.max_columns](https://pandas.pydata.org/docs/reference/api/pandas.set_option.html#pandas.set_option)

- [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)

- [shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)

## python

- [library reference](https://docs.python.org/3/library/index.html)

- [datetime](https://docs.python.org/3/library/datetime.html)

## scikit-learn

- [API](https://scikit-learn.org/stable/modules/classes.html#)

- [compose module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose)

- [compose.make_column_transformer function](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer)

- [ensemble module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)

- [ensemble.RandomForestRegressor class](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)

- [feature_selection module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)

- [feature_selection.SelectFromModel class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel)

- [feature_selection.SelectFromModel.get_support() method](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel.get_support)

- [feature_selection.SelectKBest class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)

- [feature_selection.f_regression scoring function](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression)

- [impute module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute)

- [impute SimpleImputer class](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer)

- [linear_model module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)

- [linear_model.Lasso class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso)

- [linear_model.LassoCV class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV)

- [linear_model.LinearRegression class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

- [linear models User Guide](https://scikit-learn.org/stable/modules/linear_model.html#linear-model)

- [linear regression example](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

- [model_selection module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

- [model_selection cross_val_score function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)

- [model_selection GridSearchCV class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)

- [model_selection train_test_split function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split)

- [pipeline module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline)

- [pipeline.make_pipeline function](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline)

- [tree module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)

- [tree.DecisionTreeRegressor class](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor)

## XGBoost

- [API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)

- [XGBRegressor class](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn)


## Machine learning

- [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

- [feature selection](https://en.wikipedia.org/wiki/Feature_selection)

- [hyperparameter optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization)

- [imputation](https://en.wikipedia.org/wiki/Imputation_(statistics))

- [linear regression](https://en.wikipedia.org/wiki/Linear_regression)

# Glossary

**ColumnTransformer** It is a Class in scikit-learn that applies transformers to columns in a data set.

**Cross validation** It is a model validation technique for estimating how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is performed (training data set) and a data set of unknown data (or first-seen data) against which the model is tested (testing data set). Cross validation tests the model's ability to predict new data that was not used in estimating it in order to identify problems such as overfitting or selection bias, and to give insight on how the model will generalize to an independent (unknown) data set. One round of cross validation involves partitioning a sample of data into complementary subsets, performing the analyis on one subset (training set) and validating the analysis on the other subset (testing set). To reduce variability in the results, multiple rounds of cross validation are performed using different partitions and the validation results are combined (averaged) over the rounds to give an estimate of the model's predictive performance.

**Data leakage** It is inadvertently including knowledge from the testing data when training a model. The model will be less reliable. This may lead to incorrect decisions when tuning hyperparameters. This may lead to overestimating how well the model will perform on new data.

**Decision tree** It is a non-parametric supervised machine learning method used for classification and regression. It uses feature importance to determine potential features that could be in the model.

**Feature** It is an independent variable that is controlled in order to cause an outcome in the dependent variable.

**Feature selection** It is the process of selecting a set of features for a model. The data set probably contains features that are redundant or irrelevant, and can be removed with little effect on a model.

**Hyperparameter** It is a value you set during the model fitting process.

**Linear regression** It is a linear approach to modeling the relationship between a target and one or more features, using linear predictor functions where unknown model parameters are estimated from the data.

**Machine Learning** Machine learning algortihms build a mathematical model on a training data subset in order to make predictions on a test data subset. Various measures are used to compare the actual and predicted data in the test data subset to estimate the performance of the model.

**Mask** It is a pandas function that replaces a value with another value, a NaN by default.

**Parameter** It is a value learned during the model fitting process.

**Pipeline** A pipeline is a series of sequential steps. The output of each step is passed to the next step. It is a scikit-learn Class that applies one or more column transformations and a final estimator. The final estimator only needs to implement fit.The purpose is to assemble several steps that can be cross-validated together while setting different hyperparameters.

**Target** It is a dependent variable that represents the outcome resulting from altering features (independent variables).

**Testing data set** It is the data set upon which we use the model and put the values of the features to predict the target in order to compare the actual target values with the predicted values in order to evaluate the performance of the model.

**Training data set** It contains known values of the target. The model learns from these data, that is, we fit a model to estimate the relationship between the target and the features.

**Transformer** It is a scikit-learn object that transforms a column. For example, it can replace NaN with the average of all values in a column.

**Workflow** It is a repeatable pattern of activity, a sequence of operations. In scikit-learn, workflow is achieved through the Pipeline class.