# Python scikit-learn Machine Learning Workflow 

# Outcomes

- Overview of scikit-learn for supervised machine learning
- Create training and testing data sets
- Create a workflow pipeline
    - Add a column transformer object
    - Add a feature selection object
    - Add a regression object
    - Fit a model
    - Set hyperparameters for tuning
    - Tune the model using grid search cross validation

# Feature selection

Which features should you use to create a predictive model? This usually requires subject matter expertise. It is possible to automatically select features using algorithms. Feature selection is a process to automatically select those features (Xs) that contribute most to the target (Y) which you are trying to predict.

# Setting up

In [1]:
# Import libraries
from datetime import datetime
import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, SelectFromModel,\
    f_regression
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso, LassoCV, LinearRegression
from sklearn.model_selection import cross_val_score, GridSearchCV,\
    train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn import set_config
from xgboost import XGBRegressor
import datasense as ds

  return Index(sequences[0], name=names)


In [2]:
# Set the global parameters
pd.options.display.max_rows = None
pd.options.display.max_columns = None
filename = 'lunch_and_learn.csv'
numrows = 500
target = 'Y'
features = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7',
            'X8', 'X9', 'X10', 'X11', 'X12', 'X13']
set_config(display='diagram')

In [3]:
# The true coefficients are:
# intercept = 69±3
# X1 = 7±1, X2 = -13±2, X3 = 6±0.5, X4 = -9±1.5
# X5 = 3±0.3, X6 = 5±0.5, X7 = -8±1.7
# X8 = 0±3, X9 = 0±3, X10 = 0±3, X11 = 0±3, X12 = 0±3, X13 = 0±3

# Cleaning the data

In [4]:
# Read the data file into a pandas DataFrame
data = pd.read_csv(filename, nrows=numrows)

In [5]:
# Describe information about the DataFrame
# ds.dataframe_info(data, filename)

In [6]:
# Determine the number of rows and columns
# data.shape

In [7]:
# Check for missing values
# data.isna().sum()

In [8]:
# Describe the feature columns
# for column in features:
#     print(column)
#     result = ds.nonparametric_summary(data[column])
#     print(result, '\n')

In [9]:
# Set lower and upper values to remove outliers
mask_values = [
    ('X1', -20, 20),
    ('X2', -25, 25),
    ('X3', -5, 5),
    ('X4', -10, 10),
    ('X5', -3, 3),
    ('X6', -5, 5),
    ('X7', -13, 13),
    ('X8', -9, 15),
    ('X9', -17, 15),
    ('X10', -16, 15),
    ('X11', -16, 17),
    ('X12', -16, 17),
    ('X13', -20, 23)
]
# Replace outliers with NaN
for column, lowvalue, highvalue in mask_values:
    data[column]= data[column].mask(
        (data[column] <= lowvalue) |
        (data[column] >= highvalue)
    )

In [10]:
# Delete rows if target is NaN
data = data.dropna(subset=[target])

In [11]:
# Describe the feature columns
# for column in features:
#     print(column)
#     result = ds.nonparametric_summary(data[column])
#     print(result, '\n')

# Splitting the data

In [12]:
# Create training and testing data sets
X = data[features]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [13]:
print(type(X))
print(type(y))
print(type(X_train))
print(type(X_test))
print(type(y_train))
print(type(y_test))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [14]:
# Determine the number of rows and columns
# X.shape, y.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [15]:
# Check the DataFrame for missing values
# data.isna().sum()

# Machine learning workflow

A typical workflow involves several, sequential steps:

- Column transformation such as imputing missing values
- Feature selection
- Modeling

These steps are embedded in a workflow method called a pipeline, which is simply a series of sequential steps. The output of each step is passed to the next step.

# Creating a feature selection object

In [16]:
# Create objects to use for feature selection
linreg_selection = LinearRegression()
dtr_selection = DecisionTreeRegressor()
lasso_selection = Lasso()
lassocv_selection = LassoCV()
rfr_selection = RandomForestRegressor()

# Creating a regression object

In [17]:
# Create objects to use for regression
linreg = LinearRegression()
dtr = DecisionTreeRegressor()
lasso = Lasso()
lassocv = LassoCV()
rfr = RandomForestRegressor()
xgb = XGBRegressor()

In [18]:
print(type(linreg))
print(type(dtr))
print(type(lasso))
print(type(lassocv))
print(type(rfr))
print(type(xgb))

<class 'sklearn.linear_model._base.LinearRegression'>
<class 'sklearn.tree._classes.DecisionTreeRegressor'>
<class 'sklearn.linear_model._coordinate_descent.Lasso'>
<class 'sklearn.linear_model._coordinate_descent.LassoCV'>
<class 'sklearn.ensemble._forest.RandomForestRegressor'>
<class 'xgboost.sklearn.XGBRegressor'>


# Creating a column transformer

In [19]:
# Create the imputer object
imp = SimpleImputer()

In [20]:
# Create the column transformer object
ct = make_column_transformer(
     (imp, features),
     remainder='passthrough'
)

Column transformation algorithms include: imputers, encoders, scalers, and vectorizers. NumPy and pandas work well with scikit-learn and their functions can be converted easily to custom transformations.

# Selecting features using feature importance

Feature importance refers to techniques that assign a score to features based on how useful they are at predicting a target variable.

## Linear regression coefficients

In [21]:
# Create the workflow object
pipe = make_pipeline(ct, linreg)

In [22]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

In [23]:
# Calculate and rank the regression coefficients
lincoef = pipe.named_steps.linearregression.coef_.round(3)
result = pd.DataFrame(features, columns=['features'])
result['coefficients'] = pd.DataFrame(lincoef)
result = result.reindex(result['coefficients']
                        .abs().sort_values(ascending=False).index)
result

Unnamed: 0,features,coefficients
1,X2,-12.949
3,X4,-9.057
6,X7,-8.039
0,X1,6.968
2,X3,6.338
5,X6,5.225
4,X5,3.04
10,X11,0.089
8,X9,0.058
11,X12,-0.055


## Decision tree feature importance

In [24]:
# Create the workflow object
pipe = make_pipeline(ct, dtr)

In [25]:
# Determine the decision tree regression model
pipe.fit(X_train, y_train)

In [26]:
# Calculate and rank the decision tree feature importances
dtr_feature_importances = pipe.named_steps.decisiontreeregressor\
    .feature_importances_
result = pd.DataFrame(features, columns=['features'])
result['feature_importances'] = pd.DataFrame(dtr_feature_importances)
result = result.reindex(result['feature_importances']
                        .abs().sort_values(ascending=False).index)
result

Unnamed: 0,features,feature_importances
1,X2,0.44574
0,X1,0.239152
6,X7,0.126769
3,X4,0.110819
10,X11,0.017051
7,X8,0.012955
2,X3,0.011311
8,X9,0.009053
5,X6,0.006996
11,X12,0.005746


## Random forest feature importance

In [27]:
# Create the workflow object
pipe = make_pipeline(ct, rfr)

In [28]:
# Determine the random forest regression model
pipe.fit(X_train, y_train)

In [29]:
# Calculate and rank the random forest feature importances
rfr_feature_importances = pipe.named_steps.randomforestregressor\
    .feature_importances_
result = pd.DataFrame(features, columns=['features'])
result['feature_importances'] = pd.DataFrame(rfr_feature_importances)
result = result.reindex(result['feature_importances']
                        .abs().sort_values(ascending=False).index)
result

Unnamed: 0,features,feature_importances
1,X2,0.453952
0,X1,0.259632
6,X7,0.111964
3,X4,0.085964
5,X6,0.012356
4,X5,0.010693
2,X3,0.010571
10,X11,0.010561
11,X12,0.009121
12,X13,0.009056


## XGBoost regression feature importance

In [30]:
# Create the workflow object
pipe = make_pipeline(ct, xgb)

In [31]:
# Determine the extreme gradient boosting model
pipe.fit(X_train, y_train);

In [32]:
# Calculate and rank the extreme gradient boosting feature importances
xgb_feature_importances = pipe.named_steps.xgbregressor\
    .feature_importances_
result = pd.DataFrame(features, columns=['features'])
result['feature_importances'] = pd.DataFrame(xgb_feature_importances)
result = result.reindex(result['feature_importances']
                        .abs().sort_values(ascending=False).index)
result

Unnamed: 0,features,feature_importances
1,X2,0.452074
6,X7,0.209579
0,X1,0.173878
3,X4,0.101201
12,X13,0.012283
2,X3,0.011529
5,X6,0.008845
7,X8,0.008613
4,X5,0.005388
10,X11,0.005286


## Lasso

In [33]:
# Create the workflow object
pipe = make_pipeline(ct, lasso)

In [34]:
# Determine the lasso model
pipe.fit(X_train, y_train)

In [35]:
# Calculate and rank the coefficients
lasso_coef = pipe.named_steps.lasso.coef_
result = pd.DataFrame(features, columns=['features'])
result['feature_coefficients'] = pd.DataFrame(lasso_coef)
result = result.reindex(result['feature_coefficients']
                        .abs().sort_values(ascending=False).index)
result

Unnamed: 0,features,feature_coefficients
1,X2,-12.80313
3,X4,-8.862099
6,X7,-7.816439
0,X1,6.928438
2,X3,2.032962
5,X6,1.574388
12,X13,-0.002544
4,X5,0.0
7,X8,0.0
8,X9,0.0


## LassoCV

In [36]:
# Create the workflow object
pipe = make_pipeline(ct, lassocv)

In [37]:
# Determine the lasso model
pipe.fit(X_train, y_train)

In [38]:
# Calculate and rank the coefficients
lassocv_coef = pipe.named_steps.lassocv.coef_
result = pd.DataFrame(features, columns=['features'])
result['feature_coefficients'] = pd.DataFrame(lassocv_coef)
result = result.reindex(result['feature_coefficients']
                        .abs().sort_values(ascending=False).index)
result

Unnamed: 0,features,feature_coefficients
1,X2,-12.947752
3,X4,-9.055433
6,X7,-8.024839
0,X1,6.959398
2,X3,6.037589
5,X6,4.997705
4,X5,2.247338
10,X11,0.081442
8,X9,0.058628
12,X13,-0.045769


# Selecting features using algorithms

## SelectFromModel + LinearRegression

In [39]:
# Create the feature selection object
selection = SelectFromModel(estimator=linreg_selection,
                            threshold='median')

In [40]:
# Create the workflow object
pipe = make_pipeline(ct, selection, linreg)

In [41]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

In [42]:
# Set the hyperparameters for optimization
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] = \
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectfrommodel__threshold'] = [None, 'mean', 'median']
# hyperparams['linearregression__normalize'] = [False, True]

In [43]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);

In [44]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score');

In [45]:
# Access the best score
grid.best_score_.round(3)

0.994

In [46]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'mean',
 'selectfrommodel__threshold': 'median'}

In [47]:
# Display the regression coefficients of the features
pipe.named_steps.linearregression.coef_.round(3)

array([  6.967, -12.952,   6.34 ,  -9.068,   3.168,   5.155,  -8.034])

In [48]:
# Display the regression intercept
pipe.named_steps.linearregression.intercept_.round(3)

68.759

In [49]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7'], dtype='object')

In [50]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.994

## SelectKBest + F test linear dependency

In [51]:
# Create a feature selection object
selection = SelectKBest(score_func=f_regression, k=10)

In [52]:
# Create the workflow object
# pipe = make_pipeline(ct, selection, linreg)
pipe = make_pipeline(ct, selection, linreg)

In [53]:
# Determine the linear regression model for the training data set
pipe.fit(X_train, y_train)

In [54]:
# Set the hyperparameters for optimization
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] = \
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectkbest__k'] = [7, 10, 'all']
# hyperparams['linearregression__normalize'] = [False, True]

In [55]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);

In [56]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score');

In [57]:
# Access the best score
grid.best_score_.round(3)

0.994

In [58]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'mean', 'selectkbest__k': 'all'}

In [59]:
# Display the regression coefficients of the features
pipe.named_steps.linearregression.coef_.round(3)

array([  6.972, -12.946,   6.372,  -9.053,   3.107,   5.197,  -8.02 ,
        -0.037,  -0.063,  -0.054])

In [60]:
# Display the regression intercept
pipe.named_steps.linearregression.intercept_.round(3)

68.749

In [61]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X12', 'X13'], dtype='object')

In [62]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.994

## SelectFromModel + Lasso

In [63]:
# Create the feature selection object
selection = SelectFromModel(estimator=lasso_selection, threshold=None)

In [64]:
# Create the workflow object
pipe = make_pipeline(ct, selection, lasso)

In [65]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

In [66]:
# Set the hyperparameters for optimization
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] = \
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectfrommodel__threshold'] = [None, 0.25,
                                             'mean', 'median']
hyperparams['lasso__max_iter'] = [100, 1000]
hyperparams['lasso__selection'] = ['cyclic', 'random']

In [67]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);

In [68]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score');

In [69]:
# Access the best score
grid.best_score_.round(3)

0.988

In [70]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'most_frequent',
 'lasso__max_iter': 100,
 'lasso__selection': 'random',
 'selectfrommodel__threshold': 'median'}

In [71]:
# Display the regression coefficients of the features
pipe.named_steps.lasso.coef_.round(3)

array([ 6.9280e+00, -1.2803e+01,  2.0330e+00, -8.8620e+00,  1.5740e+00,
       -7.8160e+00, -3.0000e-03])

In [72]:
# Display the regression intercept
pipe.named_steps.lasso.intercept_.round(3)

69.089

In [73]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X3', 'X4', 'X6', 'X7', 'X13'], dtype='object')

In [74]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.988

# SelectFromModel + DecisionTreeRegression

In [75]:
# Create the feature selection object
selection = SelectFromModel(estimator=dtr_selection,
                            threshold='median')

In [76]:
# Create the workflow object
pipe = make_pipeline(ct, selection, dtr)

In [77]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

In [78]:
# Set the hyperparameters for optimization
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] = \
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectfrommodel__threshold'] = [None, 'mean', 'median']
hyperparams['decisiontreeregressor__criterion'] = \
    ['mse', 'friedman_mse', 'mae']
hyperparams['decisiontreeregressor__splitter'] =['best', 'random']

In [79]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);













In [80]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score');

In [81]:
# Access the best score
grid.best_score_.round(3)

0.711

In [82]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'median',
 'decisiontreeregressor__criterion': 'mse',
 'decisiontreeregressor__splitter': 'random',
 'selectfrommodel__threshold': 'mean'}

In [83]:
# Display the regression coefficients of the features
pipe.named_steps.decisiontreeregressor.feature_importances_.round(3)

array([0.25 , 0.45 , 0.119, 0.136, 0.017, 0.019, 0.009])

In [84]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X4', 'X7', 'X8', 'X11', 'X12'], dtype='object')

In [85]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.622

# SelectFromModel + DecisionTreeRegressor + LinearRegression

In [86]:
# Create the feature selection object
selection = SelectFromModel(estimator=dtr_selection,
                            threshold='median')

In [87]:
# Create the workflow object
pipe = make_pipeline(ct, selection, linreg)

In [88]:
# Determine the linear regression model
pipe.fit(X_train, y_train)

In [89]:
# Set the hyperparameters for optimization
hyperparams = {}
hyperparams['columntransformer__simpleimputer__strategy'] = \
    ['mean', 'median', 'most_frequent', 'constant']
hyperparams['selectfrommodel__threshold'] = [None, 'mean', 'median']
# hyperparams['linearregression__normalize'] = [False, True]

In [90]:
# Perform a grid search
grid = GridSearchCV(pipe, hyperparams, cv=5)
grid.fit(X_train, y_train);

In [91]:
# Present the results
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score');

In [92]:
# Access the best score
grid.best_score_.round(3)

0.987

In [93]:
# Access the best hyperparameters
grid.best_params_

{'columntransformer__simpleimputer__strategy': 'constant',
 'selectfrommodel__threshold': 'median'}

In [94]:
# Display the regression coefficients of the features
pipe.named_steps.linearregression.coef_.round(3)

array([  6.975, -13.08 ,   6.106,  -9.286,  -8.139,  -0.044,   0.054])

In [95]:
# Display the regression intercept
pipe.named_steps.linearregression.intercept_.round(3)

68.87

In [96]:
pipe.named_steps.selectfrommodel.estimator

In [97]:
# Show the selected features 
X.columns[selection.get_support()]

Index(['X1', 'X2', 'X3', 'X4', 'X7', 'X8', 'X11'], dtype='object')

In [98]:
# Cross-validate the updated pipeline
cross_val_score(pipe, X_train, y_train, cv=5).mean().round(3)

0.986

# References

## numpy

- [API](https://numpy.org/devdocs/reference/index.html)

## pandas

- [API](https://pandas.pydata.org/docs/reference/index.html)

- [isna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html)

- [mask](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html#pandas.DataFrame.mask)

- [options.display.max_rows, options.display.max_columns](https://pandas.pydata.org/docs/reference/api/pandas.set_option.html#pandas.set_option)

- [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)

- [shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)

## python

- [library reference](https://docs.python.org/3/library/index.html)

- [datetime](https://docs.python.org/3/library/datetime.html)

## scikit-learn

- [API](https://scikit-learn.org/stable/modules/classes.html#)

- [compose module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose)

- [compose.make_column_transformer function](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer)

- [ensemble module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)

- [ensemble.RandomForestRegressor class](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)

- [feature_selection module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)

- [feature_selection.SelectFromModel class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel)

- [feature_selection.SelectFromModel.get_support() method](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel.get_support)

- [feature_selection.SelectKBest class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)

- [feature_selection.f_regression scoring function](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression)

- [impute module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute)

- [impute SimpleImputer class](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer)

- [linear_model module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)

- [linear_model.Lasso class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso)

- [linear_model.LassoCV class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV)

- [linear_model.LinearRegression class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

- [linear models User Guide](https://scikit-learn.org/stable/modules/linear_model.html#linear-model)

- [linear regression example](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

- [model_selection module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

- [model_selection cross_val_score function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)

- [model_selection GridSearchCV class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)

- [model_selection train_test_split function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split)

- [pipeline module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline)

- [pipeline.make_pipeline function](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline)

- [tree module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)

- [tree.DecisionTreeRegressor class](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor)

## XGBoost

- [API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)

- [XGBRegressor class](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn)


## Machine learning

- [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

- [feature selection](https://en.wikipedia.org/wiki/Feature_selection)

- [hyperparameter optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization)

- [imputation](https://en.wikipedia.org/wiki/Imputation_(statistics))

- [linear regression](https://en.wikipedia.org/wiki/Linear_regression)

# Glossary

**ColumnTransformer** It is a Class in scikit-learn that applies transformers to columns in a data set.

**Cross validation** It is a model validation technique for estimating how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is performed (training data set) and a data set of unknown data (or first-seen data) against which the model is tested (testing data set). Cross validation tests the model's ability to predict new data that was not used in estimating it in order to identify problems such as overfitting or selection bias, and to give insight on how the model will generalize to an independent (unknown) data set. One round of cross validation involves partitioning a sample of data into complementary subsets, performing the analyis on one subset (training set) and validating the analysis on the other subset (testing set). To reduce variability in the results, multiple rounds of cross validation are performed using different partitions and the validation results are combined (averaged) over the rounds to give an estimate of the model's predictive performance.

**Data leakage** It is inadvertently including knowledge from the testing data when training a model. The model will be less reliable. This may lead to incorrect decisions when tuning hyperparameters. This may lead to overestimating how well the model will perform on new data.

**Decision tree** It is a non-parametric supervised machine learning method used for classification and regression. It uses feature importance to determine potential features that could be in the model.

**Feature** It is an independent variable that is controlled in order to cause an outcome in the dependent variable.

**Feature selection** It is the process of selecting a set of features for a model. The data set probably contains features that are redundant or irrelevant, and can be removed with little effect on a model.

**Hyperparameter** It is a value you set during the model fitting process.

**Linear regression** It is a linear approach to modeling the relationship between a target and one or more features, using linear predictor functions where unknown model parameters are estimated from the data.

**Machine Learning** Machine learning algortihms build a mathematical model on a training data subset in order to make predictions on a test data subset. Various measures are used to compare the actual and predicted data in the test data subset to estimate the performance of the model.

**Mask** It is a pandas function that replaces a value with another value, a NaN by default.

**Parameter** It is a value learned during the model fitting process.

**Pipeline** A pipeline is a series of sequential steps. The output of each step is passed to the next step. It is a scikit-learn Class that applies one or more column transformations and a final estimator. The final estimator only needs to implement fit.The purpose is to assemble several steps that can be cross-validated together while setting different hyperparameters.

**Target** It is a dependent variable that represents the outcome resulting from altering features (independent variables).

**Testing data set** It is the data set upon which we use the model and put the values of the features to predict the target in order to compare the actual target values with the predicted values in order to evaluate the performance of the model.

**Training data set** It contains known values of the target. The model learns from these data, that is, we fit a model to estimate the relationship between the target and the features.

**Transformer** It is a scikit-learn object that transforms a column. For example, it can replace NaN with the average of all values in a column.

**Workflow** It is a repeatable pattern of activity, a sequence of operations. In scikit-learn, workflow is achieved through the Pipeline class.