<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-Scikit-Learn's-Pipelines" data-toc-modified-id="Introduction-to-Scikit-Learn's-Pipelines-1">Introduction to Scikit-Learn's Pipelines</a></span></li><li><span><a href="#Motivating-Story" data-toc-modified-id="Motivating-Story-2">Motivating Story</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-3">Learning Outcomes</a></span></li><li><span><a href="#Review-of-Scikit-learn-Package" data-toc-modified-id="Review-of-Scikit-learn-Package-4">Review of Scikit-learn Package</a></span></li><li><span><a href="#Scikit-learn's-Estimator" data-toc-modified-id="Scikit-learn's-Estimator-5">Scikit-learn's Estimator</a></span></li><li><span><a href="#Scikit-learn's-Pipeline" data-toc-modified-id="Scikit-learn's-Pipeline-6">Scikit-learn's Pipeline</a></span></li><li><span><a href="#&quot;Programming-is-all-about-managing-complexity&quot;" data-toc-modified-id="&quot;Programming-is-all-about-managing-complexity&quot;-7">"Programming is all about managing complexity"</a></span></li><li><span><a href="#Pipeline-advantages" data-toc-modified-id="Pipeline-advantages-8">Pipeline advantages</a></span></li><li><span><a href="#Check-for-understanding" data-toc-modified-id="Check-for-understanding-9">Check for understanding</a></span></li><li><span><a href="#Automate-fitting-many-models-with-a-pipeline" data-toc-modified-id="Automate-fitting-many-models-with-a-pipeline-10">Automate fitting many models with a pipeline</a></span></li><li><span><a href="#Pipelines-automatically-apply-the-correct-steps-" data-toc-modified-id="Pipelines-automatically-apply-the-correct-steps--11">Pipelines automatically apply the correct steps </a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-12">Takeaways</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-13">Bonus Material</a></span></li><li><span><a href="#Transformations-based-on-column-type-" data-toc-modified-id="Transformations-based-on-column-type--14">Transformations based on column type </a></span></li><li><span><a href="#Sources-of-Inspiration" data-toc-modified-id="Sources-of-Inspiration-15">Sources of Inspiration</a></span></li></ul></div>

<center><h2>Introduction to Scikit-Learn's Pipelines</h2></center>
<br>
<br>
<center><img src="https://imgs.xkcd.com/comics/data_pipeline.png" width="75%"/></center>

<center><h2>Motivating Story</h2></center>

Image you are working as junior Data Scientist at a media company related to the Olympics. You have just built your first model to predict which athletes will be the biggest influencers in the near future. 

It took a while to find the right data, clean it and organize it, and fit that first algorithm. Now you boss says those dreaded words, "Does your code scale?"...

Today I'm going to show you how to start on that path towards scale and improving your machine learning programming skills.

<center><h2>Learning Outcomes</h2></center>

__By the end of this session, you should be able to__:

- Describe what is scikit-learn's pipeline.
- List the advantages of using scikit-learn's pipeline.
- Code scikit-learn pipelines to automate machine learning.

<center><h2>Review of Scikit-learn Package</h2></center>

<br>
<center><img src="images/1200px-Scikit_learn_logo_small.svg.png" width="35%"/></center>

You should already be familiar the fundamentals of scikit-learn.

Scikit-learn is a Python library for traditional machine learning.

Scikit-learn features:

- Open source
- Well-established
- Comprehensive
    - Handles pre-preprocessing, cleaning, feature extraction and selection, and cross-validation
    - Most popular algorithms are included
- Consistent, easy-to-use interface
- Object-oriented programming (OOP) / class based.

<center><h2>Scikit-learn's Estimator</h2></center>

Estimator is a consistent interface for all algorithms. 

All estimators have a fit method:
`estimator.fit(X, [y])`

X is required.  
y is optional.

Estimators either have a predict or transform method.

- `estimator.predict`
    - Classification
    - Regression
    - Clustering
- `estimator.transform` (Sometimes called Transformers)
    - Preprocessing
    - Dimensionality reduction
    - Feature selection



<center><h2>Scikit-learn's Pipeline</h2></center>
<br>
<center><img src="images/Automate-Machine-Learning-Workflows-with-Pipelines-in-Python-and-scikit-learn.jpg" width="75%"/></center>

Pipelines are very useful concept.

Pipelines allow things (oil or data) to move without losing any of the precious resource.

Pipelines have modular segments. If section needs to be improved, only that section needs to be replaced.

<center><h2>"Programming is all about managing complexity"</h2></center>

Abstractions help us do that.

Pipelines are a great abstraction.

Instead of tracking every step, we just create and pass a Pipeline object. An example of encapsulation.

[Source](https://stackoverflow.com/questions/33091376/python-what-is-exactly-sklearn-pipeline-pipeline)

<center><h2>Pipeline advantages</h2></center>

- Encapsulation all steps in a single object. Pipeline can be evaluated and tuned as a single entity.
- Apply the appropriate steps to training and test datasets.
- Pipeline will also help prevent data leakage, i.e. disclosing some testing data in your training data.
- Plays nicely with cross validation (CV) - feature selection within CV loops.


In [45]:
reset -fs

In [46]:
# Create sample data 
import numpy as np

X = np.array([[1.0, 2],
              [5.6, 3.4]])
y = np.array([[.5], 
              [.4]])

In [47]:
from sklearn.pipeline      import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model  import LinearRegression

pipe = Pipeline([('scaler', StandardScaler()), 
                 ('lr',     LinearRegression())])

pipe.fit(X, y)

In [48]:
# Predict new, unlabeled data 
pipe.predict([[3.2, 5.4]])

array([[0.35465839]])

<center><h2>Check for understanding</h2></center> 

Please order the lettered steps to build a valid pipeline

```python
from sklearn.datasets        import load_boston 
from sklearn.linear_model    import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import StandardScaler


# A 
pipe.fit(X_train, y_train)

# B
X_train, X_test, y_train, y_test= train_test_split(X, y)

# C
X, y = load_boston(return_X_y=True)

# D
pipe.predict(X_test)

# E
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('lr',     LinearRegression())])
```

In [49]:
from sklearn.datasets        import load_diabetes 
from sklearn.linear_model    import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import StandardScaler

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test= train_test_split(X, y)
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('lr',     LinearRegression())])
pipe.fit(X_train, y_train)
pipe.predict(X_test)

array([ 69.24755414, 245.95913   ,  44.65336419, 161.5511203 ,
       111.54762196, 104.64166865, 269.52158035, 163.13999079,
       151.65590642, 100.1180136 , 148.84519682, 122.82295722,
       198.08282835,  97.56478248, 107.00683262, 241.67743723,
        65.17053337, 185.04632695, 148.2868894 ,  56.31221281,
       172.85971386, 124.05689226,  53.22499208, 145.44512585,
       156.90729155, 157.20390548, 165.45484087, 106.60980046,
       199.14585387, 123.38210953,  87.86562253,  71.62794336,
       167.84424532,  69.56329691, 146.01147354, 213.70763172,
       119.92080345, 166.20299619, 191.22496344, 189.82231486,
       157.41481894,  97.60374732,  42.84949052, 133.61253311,
       156.15257676, 216.80962354,  53.75273145, 180.91143248,
       180.42989344, 166.4189052 , 150.85549002, 110.97862847,
        51.54180291, 109.03043226, 191.76337321, 278.27814946,
       104.48549053,  72.45736949, 199.11962228, 201.75291516,
       169.12463872,  71.30234905,  76.57457921, 170.63

<center><h2>Automate fitting many models with a pipeline</h2></center>


In [50]:
from sklearn.decomposition import PCA
from sklearn.linear_model  import Lasso, Ridge, ElasticNet, HuberRegressor, BayesianRidge
from sklearn.metrics       import mean_squared_error

X_train, X_validation, y_train, y_validation= train_test_split(X, y, random_state=42)

# Programmatically fit 
algorithms = [LinearRegression(), Lasso(), Ridge(), ElasticNet(), HuberRegressor(), BayesianRidge()]
results = dict()

print("Mean squared error for diabetes dataset (regression): ")
for algo in algorithms:
    pipe = Pipeline([('scaler', StandardScaler()), 
                     ('pca',    PCA(n_components=5)),
                     ('lm',     algo)])

    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_validation)
    mse = mean_squared_error(y_validation, y_pred)
    print(f"    {algo.__class__.__name__:<16}: {mse:,.2f}")

Mean squared error for diabetes dataset (regression): 
    LinearRegression: 2,770.29
    Lasso           : 2,753.35
    Ridge           : 2,770.18
    ElasticNet      : 2,867.98
    HuberRegressor  : 2,769.49
    BayesianRidge   : 2,769.91


<center><h2>Pipelines automatically apply the correct steps </h2></center>
<br>
<center><img src="images/pipeline-diagram.png" width="80%"/></center>

Source: https://iaml.it/blog/optimizing-sklearn-pipelines/

<center><h2>Takeaways</h2></center>

- Scikit-learn Pipelines are helpful for writing production-level code.
- Pipelines encapsulate all the modeling steps.
- Pipelines act like a regular estimator. Every thing you would do with a regular algorithm, you can do with a Pipeline.

<center><h2>Bonus Material</h2></center>

In [51]:
# Visualize pipeline

from sklearn import set_config

set_config(display='diagram')

pipe

<center><h2>Transformations based on column type </h2></center>

In [53]:
import pandas as pd

# Load dataset of income 
data = pd.read_csv("adult.csv", index_col=0)
data.tail() # Data analysis protip - Always look at the last rows. It is sometimes the most recent data. Sometimes the first rows are mock data.


Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
32556,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [54]:
target   = data.income
features = data.drop("income", axis=1)

# Find the categorical columns
categorical_columns = (features.dtypes == object)
categorical_columns

age               False
workclass          True
education          True
education-num     False
marital-status     True
occupation         True
relationship       True
race               True
gender             True
capital-gain      False
capital-loss      False
hours-per-week    False
native-country     True
dtype: bool

In [59]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target)

In [61]:
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import StandardScaler, OneHotEncoder
from sklearn.impute          import SimpleImputer

# Setup two preprocessing pipelines
con_pipe = Pipeline([('scaler', StandardScaler()),
                      ('imputer', SimpleImputer(strategy='median', add_indicator=True))])

cat_pipe = Pipeline([('ohe', OneHotEncoder(handle_unknown='ignore')),
                     ('imputer', SimpleImputer(strategy='most_frequent', add_indicator=True))])

# Put them to together
from sklearn.compose         import ColumnTransformer

preprocessing = ColumnTransformer([('categorical', cat_pipe,  categorical_columns),
                                   ('continuous',  con_pipe, ~categorical_columns),
                                   ])

# Add the algorithm 
from sklearn.linear_model    import LogisticRegression

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('clf', LogisticRegression(solver='liblinear'))])
pipe.fit(X_train, y_train)
pipe.predict(X_test)

array([' <=50K', ' <=50K', ' <=50K', ..., ' >50K', ' <=50K', ' <=50K'],
      dtype=object)

<center><h2>Sources of Inspiration</h2></center>

- [Advanced Machine Learning with Scikit-learn](https://www.youtube.com/watch?v=7l_WQO3JbWE&ab_channel=rakutentech)
- https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html