# Python scikit-learn Machine Learning Workflow 

In this example I use a simulated data set that has seven features (X1-X7) that affect the target (Y) and six features (X8-X13) that have zero effect on the target.

pandas

- Use mask to replace outliers with NaN

scikit-learn

-	Impute missing values (NaN) with averages
-	Create training and testing data sets
-	Perform linear regression for a dependent variable of type float (target) and independent variables of type float or integer (features)
-	Create a workflow pipeline
-	Perform n-fold cross validation on the entire pipeline
-	Perform feature selection
-	Tune the hyperparameters of the model using grid search

## Setting up

In [32]:
# Import libraries
from datetime import datetime
import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.pipeline import make_pipeline
import datasense as ds

In [33]:
start_time = datetime.now()

In [34]:
# Set global parameters
pd.options.display.max_rows = None
pd.options.display.max_columns = None
filename = 'lunch_and_learn.csv'
target = 'Y'
features = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7',
            'X8', 'X9', 'X10', 'X11', 'X12', 'X13']

## Clean the data

In [35]:
# Read data file into a pandas DataFrame
data = pd.read_csv(filename)

In [36]:
# Describe information about the DataFrame
# ds.dataframe_info(data, filename)

In [37]:
# Check for missing values
data.isna().sum()

X1     0
X2     0
X3     0
X4     0
X5     0
X6     0
X7     0
X8     0
X9     0
X10    0
X11    0
X12    0
X13    0
Y      0
dtype: int64

In [38]:
# Determine the number of rows and columns
data.shape

(5000, 14)

In [39]:
# Describe the feature columns
# for column in features:
#     print(column)
#     result = ds.nonparametric_summary(data[column])
#     print(result, '\n')

In [40]:
# Set lower and upper values to remove outliers
mask_values = [
    ('X1', -20, 20),
    ('X2', -25, 25),
    ('X3', -5, 5),
    ('X4', -10, 10),
    ('X5', -3, 3),
    ('X6', -5, 5),
    ('X7', -13, 13),
    ('X8', -9, 15),
    ('X9', -17, 15),
    ('X10', -16, 15),
    ('X11', -16, 17),
    ('X12', -16, 17),
    ('X13', -20, 23)
]
# Replace outliers with NaN
for column, lowvalue, highvalue in mask_values:
    data[column]= data[column].mask(
        (data[column] <= lowvalue) |
        (data[column] >= highvalue)
    )

In [41]:
# Describe the feature columns
# for column in features:
#     print(column)
#     result = ds.nonparametric_summary(data[column])
#     print(result, '\n')

## Start of machine learning

In [42]:
# Create training and testing data sets
X = data[features]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [43]:
# Determine the number of rows and columns
X.shape, y.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape

((5000, 13), (5000,), (3350, 13), (1650, 13), (3350,), (1650,))

In [44]:
# Check the DataFrame for missing values
data.isna().sum()

X1      4
X2      4
X3      4
X4      4
X5      4
X6      4
X7      6
X8     12
X9      4
X10     4
X11     4
X12     4
X13     4
Y       0
dtype: int64

## Start of pipeline

In [45]:
# Create the imputer object
imp = SimpleImputer()

In [46]:
# Create the column transformer object
ct = make_column_transformer(
     (imp, features),
     remainder='passthrough'
)

In [47]:
# Create a regression object to use for the feature selection object
linreg_selection = LinearRegression(fit_intercept=True)

In [48]:
# Create a feature selection object
selection = SelectFromModel(linreg_selection, threshold='median')

In [49]:
# Create the linear regression object
linreg= LinearRegression(fit_intercept=True)

In [50]:
# Create the workflow object
pipe = make_pipeline(ct, selection, linreg)

In [51]:
# Determine the linear regression model for the training data set
pipe.fit(X_train, y_train);

In [52]:
# Set the hyperparameters for optimization
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] = \
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectfrommodel__threshold'] = [None, 'mean', 'median']
hyperparams['linearregression__normalize'] = [False, True]

In [53]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);

In [54]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__simpleimputer__strategy,param_linearregression__normalize,param_selectfrommodel__threshold,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
2,0.006726,0.000156,0.002551,5.8e-05,mean,False,median,{'columntransformer__simpleimputer__strategy':...,0.992061,0.989913,0.986155,0.9939,0.988982,0.990202,0.002649,1
5,0.006695,8.6e-05,0.00256,3.5e-05,mean,True,median,{'columntransformer__simpleimputer__strategy':...,0.992061,0.989913,0.986155,0.9939,0.988982,0.990202,0.002649,1
23,0.007288,0.001183,0.002692,0.000252,constant,True,median,{'columntransformer__simpleimputer__strategy':...,0.992084,0.989849,0.986147,0.993903,0.988996,0.990196,0.002656,3
20,0.008219,0.001564,0.003697,0.00065,constant,False,median,{'columntransformer__simpleimputer__strategy':...,0.992084,0.989849,0.986147,0.993903,0.988996,0.990196,0.002656,4
8,0.010751,0.001469,0.002969,0.000569,median,False,median,{'columntransformer__simpleimputer__strategy':...,0.992067,0.989829,0.986138,0.993902,0.988969,0.990181,0.002659,5
11,0.01159,0.001404,0.002992,0.000545,median,True,median,{'columntransformer__simpleimputer__strategy':...,0.992067,0.989829,0.986138,0.993902,0.988969,0.990181,0.002659,5
0,0.014632,0.003874,0.00551,0.001388,mean,False,,{'columntransformer__simpleimputer__strategy':...,0.99156,0.989465,0.98536,0.993315,0.988622,0.989665,0.002704,7
3,0.006635,5.9e-05,0.002606,0.000174,mean,True,,{'columntransformer__simpleimputer__strategy':...,0.99156,0.989465,0.98536,0.993315,0.988622,0.989665,0.002704,7
1,0.0067,0.000337,0.002474,3.5e-05,mean,False,mean,{'columntransformer__simpleimputer__strategy':...,0.99156,0.989465,0.98536,0.993315,0.988622,0.989665,0.002704,7
4,0.006606,8.1e-05,0.002419,5.1e-05,mean,True,mean,{'columntransformer__simpleimputer__strategy':...,0.99156,0.989465,0.98536,0.993315,0.988622,0.989665,0.002704,7


In [55]:
# Access the best score
grid.best_score_.round(3)

0.99

In [56]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'mean',
 'linearregression__normalize': False,
 'selectfrommodel__threshold': 'median'}

In [57]:
# Display the regression coefficients of the features
pipe.named_steps.linearregression.coef_.round(3)

array([  7.   , -12.979,   6.116,  -9.03 ,   3.03 ,   5.107,  -8.108])

In [58]:
# Display the regression intercept
pipe.named_steps.linearregression.intercept_.round(3)

68.972

In [59]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7'], dtype='object')

In [60]:
# The true coefficients are:
# intercept = 69
# X1 = 7, X2 = -13, X3 = 6, X4 = -9
# X5 = 3, X6 = 5, X7 = -8
# X8 = 0, X9 = 0, X10 = 0, X11 = 0, X12 = 0, X13 = 0

In [61]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.99

In [62]:
end_time = datetime.now()
round((end_time - start_time).total_seconds(), 3)

2.063

# References

## pandas

- [API](https://pandas.pydata.org/docs/reference/index.html)

- [isna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html)

- [mask](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html#pandas.DataFrame.mask)

- [options.display.max_rows, options.display.max_columns](https://pandas.pydata.org/docs/reference/api/pandas.set_option.html#pandas.set_option)

- [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)

- [shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)

## python

- [library reference](https://docs.python.org/3/library/index.html)

- [datetime](https://docs.python.org/3/library/datetime.html)

## scikit-learn

- [API](https://scikit-learn.org/stable/modules/classes.html#)

- [compose module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose)

- [compose.make_column_transformer function](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer)

- [feature_selection module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)

- [feature_selection.SelectFromModel class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel)

- [feature_selection.SelectFromModel.get_support() method](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel.get_support)

- [impute module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute)

- [impute SimpleImputer class](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer)

- [linear_model module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)

- [linear_model.LinearRegression class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

- [linear models User Guide](https://scikit-learn.org/stable/modules/linear_model.html#linear-model)

- [linear regression example](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

- [model_selection module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

- [model_selection cross_val_score function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)

- [model_selection GridSearchCV class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)

- [model_selection train_test_split function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split)

- [pipeline module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline)

- [pipeline.make_pipeline function](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline)

## Machine learning

- [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

- [feature selection](https://en.wikipedia.org/wiki/Feature_selection)

- [hyperparameter optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization)

- [imputation](https://en.wikipedia.org/wiki/Imputation_(statistics))

- [linear regression](https://en.wikipedia.org/wiki/Linear_regression)