<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-scikit-learn's-pipelines" data-toc-modified-id="Introduction-to-scikit-learn's-pipelines-1">Introduction to scikit-learn's pipelines</a></span></li><li><span><a href="#-Brian-Spiering" data-toc-modified-id="-Brian-Spiering-2"> Brian Spiering</a></span></li><li><span><a href="#Who-am-I?" data-toc-modified-id="Who-am-I?-3">Who am I?</a></span></li><li><span><a href="#Motivating-Story" data-toc-modified-id="Motivating-Story-4">Motivating Story</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-5">Learning Outcomes</a></span></li><li><span><a href="#Scikit-learn-Review" data-toc-modified-id="Scikit-learn-Review-6">Scikit-learn Review</a></span></li><li><span><a href="#Scikit-learn's-Estimator" data-toc-modified-id="Scikit-learn's-Estimator-7">Scikit-learn's Estimator</a></span></li><li><span><a href="#What-if…" data-toc-modified-id="What-if…-8">What if…</a></span></li><li><span><a href="#Pipelines" data-toc-modified-id="Pipelines-9">Pipelines</a></span></li><li><span><a href="#&quot;Programming-is-all-about-managing-complexity&quot;" data-toc-modified-id="&quot;Programming-is-all-about-managing-complexity&quot;-10">"Programming is all about managing complexity"</a></span></li><li><span><a href="#Fit-many-models-automatically" data-toc-modified-id="Fit-many-models-automatically-11">Fit many models automatically</a></span></li><li><span><a href="#Transformations-based-on-column-type-" data-toc-modified-id="Transformations-based-on-column-type--12">Transformations based on column type </a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-13">Takeaways</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-14">Bonus Material</a></span></li><li><span><a href="#Pipelines-for-cross-validation-" data-toc-modified-id="Pipelines-for-cross-validation--15">Pipelines for cross validation </a></span></li><li><span><a href="#Pipeline-advantages" data-toc-modified-id="Pipeline-advantages-16">Pipeline advantages</a></span></li><li><span><a href="#Sources-of-Inspiration" data-toc-modified-id="Sources-of-Inspiration-17">Sources of Inspiration</a></span></li><li><span><a href="#Check-for-understanding" data-toc-modified-id="Check-for-understanding-18">Check for understanding</a></span></li></ul></div>

<center><img src="images/pipeline_intro.png" width="100%"/></center>
<br>
<center><h2>Introduction to scikit-learn's pipelines</h2></center>
<center><h2> Brian Spiering</h2></center>

<center><h2>Who am I?</h2></center>
<center><img src="images/hi.png" width="40%"/></center>
<center>Brian Spiering</center>
<center>A Data Science Instructor at Metis</center>

<center><img src="images/olympic_rings.jpeg" width="95%"/></center>

[Image Source](https://compote.slate.com/images/005266c4-3c9f-416e-b2a4-f421e23ce879.jpeg?width=780&height=520&rect=1560x1040&offset=0x0)

<center><h2>Motivating Story</h2></center>

Image you are working as junior Data Scientist at a media company related to the Olympics. You have just built your first model to predict which athletes will be the biggest influencers in the near future. 

It took a while to find the right data, clean it and organize it, and fit that first algorithm. Now you boss says those dreaded words, "Does your code scale?"...

Today I'm going to show you how to start on that path towards scale and improving your machine learning programming skills.

<center><h2>Learning Outcomes</h2></center>

__By the end of this session, you should be able to__:

- Describe what is scikit-learn's pipeline.
- List the advantages of using scikit-learn's pipeline.
- Code scikit-learn pipelines to automate machine learning.

<center><h2>Scikit-learn Review</h2></center>

<br>
<center><img src="images/1200px-Scikit_learn_logo_small.svg.png" width="55%"/></center>

You should already be familiar the fundamentals of scikit-learn.

Scikit-learn is a Python library for traditional machine learning.

Scikit-learn features:

- Open source
- Well-established
- Comprehensive
    - Handles pre-preprocessing, cleaning, feature extraction and selection, and cross-validation
    - Most popular algorithms are included
- Consistent, easy-to-use interface
- Object-oriented programming (OOP) / class based.

In [94]:
# Scikit-learn Example - Build a machine learning model

In [95]:
reset -fs

In [96]:
# Create sample data 
import numpy as np

#             Number of followers, number of posts
X = np.array([[121_883, 2_921],    [192_981, 6_9231]])      

In [97]:
#             Number of likes
y = np.array([[10_342],            [17_841]])    

In [98]:
# Build a scikit-learn model 
from sklearn.linear_model  import LinearRegression

lr = LinearRegression()
lr.fit(X, y)

In [99]:
# For a new account, what is the predicted number of likes?
# Number of followers, number of posts
new_account = np.array([[143_231, 5_823]])
predicted_number_of_likes = lr.predict(new_account)[0][0]
print(f"{predicted_number_of_likes:,.0f}")

11,699


<center><h2>Scikit-learn's Estimator</h2></center>

Estimator is a consistent interface for all scikit-learn algorithms. 

All estimators have a fit method:   
`estimator.fit(X, [y])` 



__Estimators can have other methods__:

- `estimator.predict()`
    - Classification
    - Regression
    - Clustering

- `estimator.transform()` 
    - Preprocessing
    - Dimensionality reduction
    - Feature selection 

X is required.  
y is optional.

<center><h2>What if…</h2></center>
<br>
You have built a baseline model. Now you wonder about possible improvements:


- What if I change the algorithm?




- What if I change a hyperparamter?

- What if I do different feature engineering?

With scikit-learn's Pipelines is easier:
- To explore different hyperparameter options.
- To reorder or add/remove steps.
- To write and __read__ more complex code.

<center><h2>Pipelines</h2></center>
<center><img src="images/pipeline_real_world.jpg" width="100%"/></center>

Pipelines are a very useful idea.

Pipelines allow things (oil or data) to move without losing any of the precious resource.

Pipelines have modular segments. If section needs to be improved, only that section needs to be replaced.

In [100]:
# scikit-learn pipeline example
from sklearn.pipeline      import make_pipeline
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(StandardScaler(), LinearRegression())
pipe.fit(X, y)

[Source](https://stackoverflow.com/questions/33091376/python-what-is-exactly-sklearn-pipeline-pipeline)

In [101]:
# Predict new, unlabeled data 
predicted_number_of_likes = pipe.predict(new_account)[0][0]
print(f"{predicted_number_of_likes:,.0f}")

11,632


<center><img src="images/pipeline-diagram-1.png" width="65%"/></center>
<center>Pipelines automatically applies the appropriate methods at the appropriate time</center>

<center><h2>"Programming is all about managing complexity"</h2></center>

Abstractions help us do that. Pipelines are a great abstraction.

Instead of managing every step manually, we can create and use a single Pipeline object.

In [102]:
from sklearn.datasets        import load_diabetes 
from sklearn.linear_model    import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing   import StandardScaler

In [103]:
# Use Pipelines for diabetes dataset
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test= train_test_split(X, y)

In [104]:
# Create and use a pipeline class
from sklearn.pipeline        import Pipeline

pipe = Pipeline([('scaler', StandardScaler()), 
                 ('lr',     LinearRegression())])

In [105]:
pipe.fit(X_train, y_train)
pipe.predict(X_test);

<center><h2>Fit many models automatically</h2></center>


In [106]:
from sklearn.decomposition import PCA
from sklearn.linear_model  import Lasso, Ridge, ElasticNet, HuberRegressor, BayesianRidge
from sklearn.metrics       import mean_squared_error

In [107]:
X_train, X_validation, y_train, y_validation= train_test_split(X, y, random_state=42)

In [108]:
# Create a list of many algorithms
algorithms = [LinearRegression(), 
              Lasso(), 
              Ridge(), 
              ElasticNet(), 
              HuberRegressor(), 
              BayesianRidge()]


In [109]:
results = dict()

In [110]:
print("Mean squared error for diabetes dataset (regression): ")
for algo in algorithms:
    pipe = Pipeline([('scaler', StandardScaler()), 
                     ('pca',    PCA(n_components=5)),
                     ('lm',     algo)])

    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_validation)
    mse = mean_squared_error(y_validation, y_pred)
    print(f"    {algo.__class__.__name__:<16}: {mse:,.01f}")

Mean squared error for diabetes dataset (regression): 
    LinearRegression: 2,770.3
    Lasso           : 2,753.3
    Ridge           : 2,770.2
    ElasticNet      : 2,868.0
    HuberRegressor  : 2,769.5
    BayesianRidge   : 2,769.9


Source: https://iaml.it/blog/optimizing-sklearn-pipelines/

In [111]:
# Visualize pipeline
from sklearn import set_config

set_config(display='diagram')
pipe

<center><img src="images/pipeline-diagram-1.png" width="75%"/></center>
<center>Pipelines automatically apply the correct steps</center>

<center><h2>Transformations based on column type </h2></center>

In [112]:
import pandas as pd

In [113]:
# Load income dataset  
data = pd.read_csv("adult.csv", index_col=0)
data.head() 

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [114]:
# Define target and features
target   = data.income
features = data.drop("income", axis=1)

In [115]:
# Find the categorical columns
categorical_columns = (features.dtypes==object)
categorical_columns

age               False
workclass          True
education          True
education-num     False
marital-status     True
occupation         True
relationship       True
race               True
gender             True
capital-gain      False
capital-loss      False
hours-per-week    False
native-country     True
dtype: bool

In [116]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target)

In [117]:
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import StandardScaler, OneHotEncoder
from sklearn.impute          import SimpleImputer

In [118]:
# Define a pipeline for continuous features
con_pipe = Pipeline([('scaler',   StandardScaler()),
                     ('imputer', SimpleImputer(strategy='median', add_indicator=True))])

In [119]:
# Define a pipeline for categorical features
cat_pipe = Pipeline([('ohe',     OneHotEncoder(handle_unknown='ignore')),
                     ('imputer', SimpleImputer(strategy='most_frequent', add_indicator=True))])

In [120]:
# Put the two different pipelines together
from sklearn.compose import ColumnTransformer

preprocessing = ColumnTransformer([('categorical', cat_pipe,  categorical_columns),
                                   ('continuous',  con_pipe, ~categorical_columns),])

In [121]:
# Choose the algorithm
from sklearn.linear_model    import LogisticRegression

# Defin the pipeline
pipe = Pipeline([('preprocessing', preprocessing), 
                 ('clf',           LogisticRegression(solver='liblinear'))])
pipe.fit(X_train, y_train)
pipe.predict(X_test)

array([' <=50K', ' <=50K', ' <=50K', ..., ' <=50K', ' <=50K', ' <=50K'],
      dtype=object)

<center><h2>Takeaways</h2></center>

- Scikit-learn's Pipelines encapsulate all modeling steps.


- Pipelines act like a regular scikit-learn Estimator. Everything you would do with a regular algorithm, you can do with a Pipeline.



- Scikit-learn's Pipelines are helpful for writing more readable, robust, and maintainable code.


<center><img src="images/thank_you.png" width="50%"/></center>
<center>All these materials are in a GitHub repo:</center>   
<center><a href="https://bit.ly/pipeline-metis">bit.ly/pipeline-metis</a></center>   



<center><h2>Bonus Material</h2></center>

<center><h2>Pipelines for cross validation </h2></center>

In [122]:
# Load and split the data
from sklearn.datasets        import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, 
                                                    iris.target, 
                                                    test_size=0.2)

In [123]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline      import Pipeline
from sklearn.tree          import DecisionTreeClassifier

In [124]:
pipe_dt = Pipeline([('scl', StandardScaler()),          
                    ('pca', PCA(n_components=2)),       
                    ('clf', DecisionTreeClassifier())]) 

In [125]:
from sklearn.model_selection import cross_val_score, KFold

kfold = KFold(n_splits=10)
results = cross_val_score(pipe_dt, # Put your pipeline where an Estimator would go
                          X_train, 
                          y_train, 
                          cv=kfold)
print(f"{results.mean():.4f}")

0.8917


<center><h2>Pipeline advantages</h2></center>

- All steps in a single object (aka, encapsulation).

- Apply the appropriate steps to separately to training and test datasets.



- Pipeline will also help prevent data leakage, i.e. disclosing some testing data in your training data.


- Plays nicely with cross validation (CV) - feature selection within CV loops.

<center><img src="https://imgs.xkcd.com/comics/data_pipeline.png" width="100%"/></center>


<center><h2>Sources of Inspiration</h2></center>

- [Advanced Machine Learning with Scikit-learn](https://www.youtube.com/watch?v=7l_WQO3JbWE&ab_channel=rakutentech)
- https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html
- [Kevin Goetsch | Deploying Machine Learning using sklearn pipelines](https://www.youtube.com/watch?v=URdnFlZnlaE&ab_channel=PyData)
- https://towardsdatascience.com/introduction-to-scikit-learns-pipelines-565cc549754a
- https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html

<center><h2>Check for understanding</h2></center> 

Please order the lettered steps to build a valid pipeline

```python
from sklearn.datasets        import load_boston 
from sklearn.linear_model    import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import StandardScaler


# A 
pipe.fit(X_train, y_train)

# B
X_train, X_test, y_train, y_test= train_test_split(X, y)

# C
X, y = load_boston(return_X_y=True)

# D
pipe.predict(X_test)

# E
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('lr',     LinearRegression())])
```

<br>
<br> 
<br>

----