# Python scikit-learn Machine Learning Workflow 

## Outcomes

- Using scikit-learn to perform machine learning
- Create training and testing data sets
- Creating a column transformer to impute missing values
- Estimate various feature selection and modeling algorithms
    - Linear regression
    - Decision tree regression
    - Random forest
    - Extreme gradient boost
    - Lasso
    - SelectFromModel feature selection
    - SelectKBest feature selection
    - Cross validation
    - Grid search
- Create a workflow pipeline
- Perform n-fold cross validation on the entire pipeline
- Perform feature selection
- Tune the hyperparameters of the model using grid search

# Feature selection

Which features should you use to create a predictive model? This usually requires subject matter expertise. It is possible to automatically select features using algorithms. Feature selection is a process to automatically select those features (Xs) that contribute most to the target (Y) which you are trying to predict.

# Setting up

In [1]:
# Import libraries
from datetime import datetime
import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, SelectFromModel,\
    f_regression
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso, LassoCV, LinearRegression
from sklearn.model_selection import cross_val_score, GridSearchCV,\
    train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
import datasense as ds

In [2]:
start_time = datetime.now()

In [3]:
# Set global parameters
pd.options.display.max_rows = None
pd.options.display.max_columns = None
filename = 'lunch_and_learn.csv'
target = 'Y'
features = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7',
            'X8', 'X9', 'X10', 'X11', 'X12', 'X13']

In [4]:
# The true coefficients are:
# intercept = 69±3
# X1 = 7±1, X2 = -13±2, X3 = 6±0.5, X4 = -9±1.5
# X5 = 3±0.3, X6 = 5±0.5, X7 = -8±1.7
# X8 = 0±3, X9 = 0±3, X10 = 0±3, X11 = 0±3, X12 = 0±3, X13 = 0±3

# Cleaning the data

In [5]:
# Read data file into a pandas DataFrame
data = pd.read_csv(filename)

In [6]:
# Describe information about the DataFrame
# ds.dataframe_info(data, filename)

In [7]:
# Determine the number of rows and columns
data.shape

(5000, 14)

In [8]:
# Check for missing values
data.isna().sum()

X1     0
X2     0
X3     0
X4     0
X5     0
X6     0
X7     0
X8     0
X9     0
X10    0
X11    0
X12    0
X13    0
Y      0
dtype: int64

In [9]:
# Describe the feature columns
# for column in features:
#     print(column)
#     result = ds.nonparametric_summary(data[column])
#     print(result, '\n')

In [10]:
# Set lower and upper values to remove outliers
mask_values = [
    ('X1', -20, 20),
    ('X2', -25, 25),
    ('X3', -5, 5),
    ('X4', -10, 10),
    ('X5', -3, 3),
    ('X6', -5, 5),
    ('X7', -13, 13),
    ('X8', -9, 15),
    ('X9', -17, 15),
    ('X10', -16, 15),
    ('X11', -16, 17),
    ('X12', -16, 17),
    ('X13', -20, 23)
]
# Replace outliers with NaN
for column, lowvalue, highvalue in mask_values:
    data[column]= data[column].mask(
        (data[column] <= lowvalue) |
        (data[column] >= highvalue)
    )

In [11]:
# Describe the feature columns
# for column in features:
#     print(column)
#     result = ds.nonparametric_summary(data[column])
#     print(result, '\n')

# Splitting the data

In [12]:
# Create training and testing data sets
X = data[features]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [13]:
# Determine the number of rows and columns
X.shape, y.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape

((5000, 13), (5000,), (3350, 13), (1650, 13), (3350,), (1650,))

In [14]:
# Check the DataFrame for missing values
data.isna().sum()

X1      4
X2      4
X3      4
X4      4
X5      4
X6      4
X7      6
X8     12
X9      4
X10     4
X11     4
X12     4
X13     4
Y       0
dtype: int64

# Machine learning workflow

A typical workflow involves several, sequential steps:

- Column transformation such as imputing missing values
- Feature selection
- Modeling

These steps are embedded in a workflow method called a pipeline, which is simply a series of steps performed in sequence.

# Creating a column transformer

In [15]:
# Create the imputer object
imp = SimpleImputer()

In [16]:
# Create the column transformer object
ct = make_column_transformer(
     (imp, features),
     remainder='passthrough'
)

Column transformation algorithms include: imputers, encoders, scalers, and vectorizers. NumPy and pandas work well with scikit-learn and their functions can be converted easily to custom transformations.

# Selecting features using feature importance

Feature importance refers to techniques that assign a score to features based on how useful they are at predicting a target variable.

## Linear regression coefficients

In [17]:
# Create the linear regression object
linreg= LinearRegression(fit_intercept=True)

In [18]:
# Create the workflow object
pipe = make_pipeline(ct, linreg)

In [19]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                            

In [20]:
# Calculate and rank the regression coefficients
lincoef = pipe.named_steps.linearregression.coef_.round(3)
result = pd.DataFrame(features, columns=['features'])
result['coefficients'] = pd.DataFrame(lincoef)
result = result.reindex(result['coefficients']
                        .abs().sort_values(ascending=False).index)
result

Unnamed: 0,features,coefficients
1,X2,-12.979
3,X4,-9.026
6,X7,-8.108
0,X1,7.0
2,X3,6.117
5,X6,5.103
4,X5,3.03
9,X10,0.041
7,X8,-0.028
12,X13,-0.022


## Decision tree feature importance

In [21]:
# Create the decision tree regression object
dtr = DecisionTreeRegressor()

In [22]:
# Create the workflow object
pipe = make_pipeline(ct, dtr)

In [23]:
# Determine the decision tree regression model
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                            

In [24]:
# Calculate and rank the decision tree feature importances
dtr_feature_importances = pipe.named_steps.decisiontreeregressor\
    .feature_importances_
result = pd.DataFrame(features, columns=['features'])
result['feature_importances'] = pd.DataFrame(dtr_feature_importances)
result = result.reindex(result['feature_importances']
                        .abs().sort_values(ascending=False).index)
result

Unnamed: 0,features,feature_importances
1,X2,0.447677
0,X1,0.290626
6,X7,0.119855
3,X4,0.103615
2,X3,0.006474
7,X8,0.005089
8,X9,0.004727
5,X6,0.004684
12,X13,0.004323
11,X12,0.003776


## Random forest feature importance

In [25]:
# Create the random forest regression objection
rfr = RandomForestRegressor()

In [26]:
# Create the workflow object
pipe = make_pipeline(ct, rfr)

In [27]:
# Determine the random forest regression model
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                            

In [28]:
# Calculate and rank the random forest feature importances
rfr_feature_importances = pipe.named_steps.randomforestregressor\
    .feature_importances_
result = pd.DataFrame(features, columns=['features'])
result['feature_importances'] = pd.DataFrame(rfr_feature_importances)
result = result.reindex(result['feature_importances']
                        .abs().sort_values(ascending=False).index)
result

Unnamed: 0,features,feature_importances
1,X2,0.448009
0,X1,0.291524
6,X7,0.116818
3,X4,0.104398
2,X3,0.006334
5,X6,0.005185
11,X12,0.00422
9,X10,0.004219
12,X13,0.004093
7,X8,0.003936


## XGBoost regression feature importance

In [29]:
# Create the extreme gradient boosting object
xgb = XGBRegressor()

In [30]:
# Create the workflow object
pipe = make_pipeline(ct, xgb)

In [31]:
# Determine the extreme gradient boosting model
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                            

In [32]:
# Calculate and rank the extreme gradient boosting feature importances
xgb_feature_importances = pipe.named_steps.xgbregressor\
    .feature_importances_
result = pd.DataFrame(features, columns=['features'])
result['feature_importances'] = pd.DataFrame(xgb_feature_importances)
result = result.reindex(result['feature_importances']
                        .abs().sort_values(ascending=False).index)
result

Unnamed: 0,features,feature_importances
1,X2,0.460309
0,X1,0.209623
6,X7,0.164956
3,X4,0.129992
2,X3,0.009863
5,X6,0.007966
4,X5,0.003263
12,X13,0.002573
7,X8,0.002487
8,X9,0.002399


## Lasso

In [33]:
# Create the lasso object
lasso = Lasso(max_iter=100)

In [34]:
# Create the workflow object
pipe = make_pipeline(ct, lasso)

In [35]:
# Determine the lasso model
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                            

In [36]:
# Calculate and rank the coefficients
lasso_coef = pipe.named_steps.lasso.coef_
result = pd.DataFrame(features, columns=['features'])
result['feature_coefficients'] = pd.DataFrame(lasso_coef)
result = result.reindex(result['feature_coefficients']
                        .abs().sort_values(ascending=False).index)
result

Unnamed: 0,features,feature_coefficients
1,X2,-12.747146
3,X4,-8.610564
6,X7,-7.787523
0,X1,6.911849
2,X3,2.163139
5,X6,1.132501
12,X13,-0.0
11,X12,0.0
10,X11,0.0
9,X10,0.0


# Selecting features using algorithms

## SelectFromModel + LinearRegression

In [37]:
# Create the regression object to use for the feature selection object
linreg_selection = LinearRegression(fit_intercept=True)

In [38]:
# Create the feature selection object
selection = SelectFromModel(estimator=linreg_selection,
                            threshold='median')

In [39]:
# Create the workflow object
pipe = make_pipeline(ct, selection, linreg)

In [40]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                            

In [41]:
# Set the hyperparameters for optimization
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] = \
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectfrommodel__threshold'] = [None, 'mean', 'median']
hyperparams['linearregression__normalize'] = [False, True]

In [42]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);

In [43]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score');

In [44]:
# Access the best score
grid.best_score_.round(3)

0.99

In [45]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'mean',
 'linearregression__normalize': False,
 'selectfrommodel__threshold': 'median'}

In [46]:
# Display the regression coefficients of the features
pipe.named_steps.linearregression.coef_.round(3)

array([  7.   , -12.979,   6.116,  -9.03 ,   3.03 ,   5.107,  -8.108])

In [47]:
# Display the regression intercept
pipe.named_steps.linearregression.intercept_.round(3)

68.972

In [48]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7'], dtype='object')

In [49]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.99

## SelectKBest + F test linear dependency

In [50]:
# Create a feature selection object
selection = SelectKBest(score_func=f_regression, k=10)

In [51]:
# Create the workflow object
# pipe = make_pipeline(ct, selection, linreg)
pipe = make_pipeline(ct, selection, linreg)

In [52]:
# Determine the linear regression model for the training data set
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                            

In [53]:
# Set the hyperparameters for optimization
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] = \
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectkbest__k'] = [7, 10, 'all']
hyperparams['linearregression__normalize'] = [False, True]

In [54]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);

In [55]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score');

In [56]:
# Access the best score
grid.best_score_.round(3)

0.99

In [57]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'mean',
 'linearregression__normalize': False,
 'selectkbest__k': 10}

In [58]:
# Display the regression coefficients of the features
pipe.named_steps.linearregression.coef_.round(3)

array([ 6.9990e+00, -1.2979e+01,  6.1150e+00, -9.0280e+00,  3.0300e+00,
        5.1050e+00, -8.1090e+00,  4.1000e-02,  1.8000e-02, -4.0000e-03])

In [59]:
# Display the regression intercept
pipe.named_steps.linearregression.intercept_.round(3)

68.974

In [60]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X10', 'X11', 'X12'], dtype='object')

In [61]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.99

## SelectFromModel + Lasso

In [62]:
# Create the regression object to use for the feature selection object
lasso_selection = LassoCV(max_iter=100, selection='random')

In [63]:
# Create the feature selection object
selection = SelectFromModel(estimator=lasso_selection, threshold=None)

In [64]:
# Create the workflow object
pipe = make_pipeline(ct, selection, lasso)

In [65]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                            

In [66]:
# Set the hyperparameters for optimization
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] = \
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectfrommodel__threshold'] = [None, 0.25,
                                             'mean', 'median']
hyperparams['lasso__max_iter'] = [100, 1000]
hyperparams['lasso__selection'] = ['cyclic', 'random']

In [67]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);

In [68]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score');

In [69]:
# Access the best score
grid.best_score_.round(3)

0.984

In [70]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'mean',
 'lasso__max_iter': 100,
 'lasso__selection': 'random',
 'selectfrommodel__threshold': 0.25}

In [71]:
# Display the regression coefficients of the features
pipe.named_steps.lasso.coef_.round(3)

array([  6.912, -12.747,   2.163,  -8.611,   0.   ,   1.133,  -7.788,
        -0.   ,   0.   ,   0.   ,  -0.   ])

In [72]:
# Display the regression intercept
pipe.named_steps.lasso.intercept_.round(3)

69.043

In [73]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X10', 'X11', 'X13'], dtype='object')

In [74]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.984

# SelectFromModel + DecisionTreeRegression

In [75]:
# Create the decision tree regression object
dtr = DecisionTreeRegressor()

In [76]:
# Create the regression object to use for the feature selection object
dtr_selection = DecisionTreeRegressor()

In [77]:
# Create the feature selection object
selection = SelectFromModel(estimator=dtr_selection,
                            threshold='median')

In [78]:
# Create the workflow object
pipe = make_pipeline(ct, selection, dtr)

In [79]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                            

In [80]:
# Set the hyperparameters for optimization
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] = \
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectfrommodel__threshold'] = [None, 'mean', 'median']
hyperparams['decisiontreeregressor__criterion'] = \
    ['mse', 'friedman_mse', 'mae']

In [81]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);

In [82]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score');

In [83]:
# Access the best score
grid.best_score_.round(3)

0.869

In [84]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'median',
 'decisiontreeregressor__criterion': 'friedman_mse',
 'selectfrommodel__threshold': 'mean'}

In [85]:
# Display the regression coefficients of the features
pipe.named_steps.decisiontreeregressor.feature_importances_.round(3)

array([0.296, 0.451, 0.008, 0.107, 0.124, 0.007, 0.006])

In [86]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X3', 'X4', 'X7', 'X8', 'X9'], dtype='object')

In [87]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.849

# SelectFromModel + DecisionTreeRegressor + LinearRegression

In [88]:
# Create the linear regression object
linreg = LinearRegression(fit_intercept=True)

In [89]:
# Create the regression object to use for the feature selection object
dtr_selection = DecisionTreeRegressor()

In [90]:
# Create the feature selection object
selection = SelectFromModel(estimator=dtr_selection,
                            threshold='median')

In [91]:
# Create the workflow object
pipe = make_pipeline(ct, selection, linreg)

In [92]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  ['X1', 'X2', 'X3', 'X4', 'X5',
                                                   'X6', 'X7', 'X8', 'X9',
                                            

In [93]:
# Set the hyperparameters for optimization
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] = \
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectfrommodel__threshold'] = [None, 'mean', 'median']
hyperparams['linearregression__normalize'] = [False, True]

In [94]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);

In [95]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score');

In [96]:
# Access the best score
grid.best_score_.round(3)

0.989

In [97]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'mean',
 'linearregression__normalize': False,
 'selectfrommodel__threshold': 'median'}

In [98]:
# Display the regression coefficients of the features
pipe.named_steps.linearregression.coef_.round(3)

array([  7.029, -12.997,   6.063,  -9.049,  -8.11 ,  -0.048,  -0.014])

In [99]:
# Display the regression intercept
pipe.named_steps.linearregression.intercept_.round(3)

68.97

In [107]:
pipe.named_steps.selectfrommodel.estimator

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [100]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X3', 'X4', 'X7', 'X8', 'X13'], dtype='object')

In [101]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.989

In [102]:
end_time = datetime.now()
round((end_time - start_time).total_seconds(), 3)

62.454

# References

## numpy

- [API](https://numpy.org/devdocs/reference/index.html)

## pandas

- [API](https://pandas.pydata.org/docs/reference/index.html)

- [isna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html)

- [mask](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html#pandas.DataFrame.mask)

- [options.display.max_rows, options.display.max_columns](https://pandas.pydata.org/docs/reference/api/pandas.set_option.html#pandas.set_option)

- [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)

- [shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)

## python

- [library reference](https://docs.python.org/3/library/index.html)

- [datetime](https://docs.python.org/3/library/datetime.html)

## scikit-learn

- [API](https://scikit-learn.org/stable/modules/classes.html#)

- [compose module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose)

- [compose.make_column_transformer function](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer)

- [ensemble module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)

- [ensemble.RandomForestRegressor class](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)

- [feature_selection module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)

- [feature_selection.SelectFromModel class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel)

- [feature_selection.SelectFromModel.get_support() method](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel.get_support)

- [feature_selection.SelectKBest class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)

- [feature_selection.f_regression scoring function](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression)

- [impute module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute)

- [impute SimpleImputer class](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer)

- [linear_model module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)

- [linear_model.Lasso class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso)

- [linear_model.LassoCV class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV)

- [linear_model.LinearRegression class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

- [linear models User Guide](https://scikit-learn.org/stable/modules/linear_model.html#linear-model)

- [linear regression example](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

- [model_selection module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

- [model_selection cross_val_score function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)

- [model_selection GridSearchCV class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)

- [model_selection train_test_split function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split)

- [pipeline module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline)

- [pipeline.make_pipeline function](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline)

- [tree module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)

- [tree.DecisionTreeRegressor class](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor)

## XGBoost

- [API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)

- [XGBRegressor class](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn)


## Machine learning

- [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

- [feature selection](https://en.wikipedia.org/wiki/Feature_selection)

- [hyperparameter optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization)

- [imputation](https://en.wikipedia.org/wiki/Imputation_(statistics))

- [linear regression](https://en.wikipedia.org/wiki/Linear_regression)

# Glossary

**ColumnTransformer** It is a Class in scikit-learn that applies transformers to columns in a data set.

**Decision tree** It is a non-parametric supervised machine learning method used for classification and regression. It uses feature importance to determine potential features that could be in the model.

**Feature** It is an independent variable that is controlled in order to cause an outcome in the dependent variable.

**Feature selection** It is the process of selecting a set of features for a model. The data set probably contains features that are redundant or irrelevant, and can be removed with little effect on a model.

**Hyperparameter** It is a value you set during the model fitting process.

**Linear regression** It is a linear approach to modeling the relationship between a target and one or more features, using linear predictor functions where unknown model parameters are estimated from the data.

**Machine Learning** Machine learning algortihms build a mathematical model on a training data subset in order to make predictions on a test data subset. Various measures are used to compare the actual and predicted data in the test data subset to estimate the performance of the model.

**Mask** It is a pandas function that replaces a value with another value, a NaN by default.

**Parameter** It is a value learned during the model fitting process.

**Pipeline** It is a scikit-learn Class that applies a test of transforms and a final estimator. The final estimator only needs to implement fit.The purpose is to assemble several steps that can be cross-validated together while setting different hyperparameters.

**Target** It is a dependent variable that represents the outcome resulting from altering features (independent variables).

**Testing data set** It is the data set upon which we use the model and put the values of the features to predict the target in order to compare the actual target values with the predicted values in order to evaluate the performance of the model.

**Training data set** It contains known values of the target. The model learns from these data, that is, we fit a model to estimate the relationship between the target and the features.

**Transformer** It is a scikit-learn object that transforms a column. For example, it can replace NaN with the average of all values in a column.

**Workflow** It is a repeatable pattern of activity, a sequence of operations. In scikit-learn, workflow is achieved through the Pipeline class.