<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-scikit-learn's-pipelines" data-toc-modified-id="Introduction-to-scikit-learn's-pipelines-1">Introduction to scikit-learn's pipelines</a></span></li><li><span><a href="#by-Brian-Spiering" data-toc-modified-id="by-Brian-Spiering-2">by Brian Spiering</a></span></li><li><span><a href="#Who-am-I?" data-toc-modified-id="Who-am-I?-3">Who am I?</a></span></li><li><span><a href="#Motivating-Story" data-toc-modified-id="Motivating-Story-4">Motivating Story</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-5">Learning Outcomes</a></span></li><li><span><a href="#Scikit-learn-Package-Review" data-toc-modified-id="Scikit-learn-Package-Review-6">Scikit-learn Package Review</a></span></li><li><span><a href="#Scikit-learn's-Estimator" data-toc-modified-id="Scikit-learn's-Estimator-7">Scikit-learn's Estimator</a></span></li><li><span><a href="#Pipelines" data-toc-modified-id="Pipelines-8">Pipelines</a></span></li><li><span><a href="#&quot;Programming-is-all-about-managing-complexity&quot;" data-toc-modified-id="&quot;Programming-is-all-about-managing-complexity&quot;-9">"Programming is all about managing complexity"</a></span></li><li><span><a href="#Pipeline-advantages" data-toc-modified-id="Pipeline-advantages-10">Pipeline advantages</a></span></li><li><span><a href="#Automate-fitting-many-models-with-a-pipeline" data-toc-modified-id="Automate-fitting-many-models-with-a-pipeline-11">Automate fitting many models with a pipeline</a></span></li><li><span><a href="#Pipelines-automatically-apply-the-correct-steps-" data-toc-modified-id="Pipelines-automatically-apply-the-correct-steps--12">Pipelines automatically apply the correct steps </a></span></li><li><span><a href="#Transformations-based-on-column-type-" data-toc-modified-id="Transformations-based-on-column-type--13">Transformations based on column type </a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-14">Takeaways</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-15">Bonus Material</a></span></li><li><span><a href="#Sources-of-Inspiration" data-toc-modified-id="Sources-of-Inspiration-16">Sources of Inspiration</a></span></li><li><span><a href="#Check-for-understanding" data-toc-modified-id="Check-for-understanding-17">Check for understanding</a></span></li></ul></div>

<center><h2>Introduction to scikit-learn's pipelines</h2></center>
<center><h2>by Brian Spiering</h2></center>

<br>
<br>
<center><img src="images/pipeline_intro.png" width="100%"/></center>

<center><h2>Who am I?</h2></center>
<center><img src="images/hi.png" width="45%"/></center>
<center>Brian Spiering</center>
<center>A Data Science Instructor at Metis</center>

<br>
<center><img src="images/olympic_rings.jpeg" width="75%"/></center>

[Image Source](https://compote.slate.com/images/005266c4-3c9f-416e-b2a4-f421e23ce879.jpeg?width=780&height=520&rect=1560x1040&offset=0x0)

<center><h2>Motivating Story</h2></center>

Image you are working as junior Data Scientist at a media company related to the Olympics. You have just built your first model to predict which athletes will be the biggest influencers in the near future. 

It took a while to find the right data, clean it and organize it, and fit that first algorithm. Now you boss says those dreaded words, "Does your code scale?"...

Today I'm going to show you how to start on that path towards scale and improving your machine learning programming skills.

<center><h2>Learning Outcomes</h2></center>

__By the end of this session, you should be able to__:

- Describe what is scikit-learn's pipeline.
- List the advantages of using scikit-learn's pipeline.
- Code scikit-learn pipelines to automate machine learning.

<center><h2>Scikit-learn Package Review</h2></center>

<br>
<center><img src="images/1200px-Scikit_learn_logo_small.svg.png" width="35%"/></center>

You should already be familiar the fundamentals of scikit-learn.

Scikit-learn is a Python library for traditional machine learning.

Scikit-learn features:

- Open source
- Well-established
- Comprehensive
    - Handles pre-preprocessing, cleaning, feature extraction and selection, and cross-validation
    - Most popular algorithms are included
- Consistent, easy-to-use interface
- Object-oriented programming (OOP) / class based.

In [89]:
# Scikit-learn Example

In [90]:
reset -fs

In [91]:
# Create sample data 
import numpy as np

#             Number of followers, number of posts
X = np.array([[121_883, 2_921],    [192_981, 6_9231]]) 
#             Number of likes
y = np.array([[10_342],            [17_841]])         

In [92]:
# Build a scikit-learn model 
from sklearn.linear_model  import LinearRegression

lr = LinearRegression()
lr.fit(X, y)

In [93]:
# For a new account, what is the predicted number of likes?
# Number of followers, number of posts
new_account = np.array([[143_231, 5_823]])
predicted_number_of_likes = lr.predict(new_account)[0][0]
print(f"{predicted_number_of_likes:,.0f}")

11,699


<center><h2>Scikit-learn's Estimator</h2></center>

Estimator is a consistent interface for all scikit-learn algorithms. 

All estimators have a fit method:
`estimator.fit(X, [y])` 



__Estimators have other methods__:

- `estimator.predict`
    - Classification
    - Regression
    - Clustering

- `estimator.transform` (thus, called Transformers)
    - Preprocessing
    - Dimensionality reduction
    - Feature selection

X is required.  
y is optional.

<center><h2>Pipelines</h2></center>
<center><img src="images/pipeline_real_world.jpg" width="100%"/></center>

Pipelines are a very useful idea.

Pipelines allow things (oil or data) to move without losing any of the precious resource.

Pipelines have modular segments. If section needs to be improved, only that section needs to be replaced.

In [94]:
# scikit-learn pipeline example
from sklearn.pipeline      import make_pipeline
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(StandardScaler(), LinearRegression())
pipe.fit(X, y)

[Source](https://stackoverflow.com/questions/33091376/python-what-is-exactly-sklearn-pipeline-pipeline)

In [95]:
# Predict new, unlabeled data 
predicted_number_of_likes = pipe.predict(new_account)[0][0]
print(f"{predicted_number_of_likes:,.0f}")

11,632


<center><h2>"Programming is all about managing complexity"</h2></center>

Abstractions help us do that.

Pipelines are a great abstraction.

Instead of tracking every step, we just create and pass a Pipeline object. An example of encapsulation.


<center><h2>Pipeline advantages</h2></center>

- Encapsulation all steps in a single object. Pipeline can be evaluated and tuned as a single entity.


- Apply the appropriate steps to training and test datasets.



- Pipeline will also help prevent data leakage, i.e. disclosing some testing data in your training data.


- Plays nicely with cross validation (CV) - feature selection within CV loops.

In [96]:
from sklearn.datasets        import load_diabetes 
from sklearn.linear_model    import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import StandardScaler

In [97]:
# More complex example
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test= train_test_split(X, y)

In [98]:
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('lr',     LinearRegression())])
pipe.fit(X_train, y_train)
pipe.predict(X_test)

array([127.45794407, 200.00052924, 181.08097621, 126.15624501,
       180.06583649, 152.82885942,  87.24854341, 147.03459737,
        50.76464676, 189.99980731,  97.10458805, 200.17312234,
       144.00983657, 247.01588918, 168.5488558 , 189.4830258 ,
       253.11197931, 129.51407912, 190.24698676,  97.02849352,
       198.72668071, 104.93329416, 145.80610135, 255.36893366,
       134.99335461,  84.18428237, 144.33525853, 159.88486233,
       206.68293339,  76.28001216, 177.25101338, 142.01375307,
       170.46227529, 117.65078897, 218.50928565, 125.87047148,
       199.92012462, 250.11674501, 144.03588096,  77.32937469,
       198.3421902 ,  95.36871678, 148.0029773 , 152.36383179,
        68.78158525, 217.6206231 , 205.36218815, 268.59571017,
       103.24781079, 194.06071612,  70.21955291, 179.28652086,
       217.50188078, 199.19724839, 147.33720405, 204.8392049 ,
       149.42341798, 195.60768228,  68.65462989, 114.17727716,
       126.91315169, 113.12510826,  63.4345053 , 203.81

<center><h2>Automate fitting many models with a pipeline</h2></center>


In [99]:
from sklearn.decomposition import PCA
from sklearn.linear_model  import Lasso, Ridge, ElasticNet, HuberRegressor, BayesianRidge
from sklearn.metrics       import mean_squared_error

In [100]:
X_train, X_validation, y_train, y_validation= train_test_split(X, y, random_state=42)

# Programmatically fit 
algorithms = [LinearRegression(), Lasso(), Ridge(), ElasticNet(), HuberRegressor(), BayesianRidge()]
results = dict()

print("Mean squared error for diabetes dataset (regression): ")
for algo in algorithms:
    pipe = Pipeline([('scaler', StandardScaler()), 
                     ('pca',    PCA(n_components=5)),
                     ('lm',     algo)])

    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_validation)
    mse = mean_squared_error(y_validation, y_pred)
    print(f"    {algo.__class__.__name__:<16}: {mse:,.2f}")

Mean squared error for diabetes dataset (regression): 
    LinearRegression: 2,770.29
    Lasso           : 2,753.35
    Ridge           : 2,770.18
    ElasticNet      : 2,867.98
    HuberRegressor  : 2,769.49
    BayesianRidge   : 2,769.91


Source: https://iaml.it/blog/optimizing-sklearn-pipelines/

In [101]:
# Visualize pipeline

from sklearn import set_config

set_config(display='diagram')

pipe

<center><h2>Pipelines automatically apply the correct steps </h2></center>
<br>
<center><img src="images/pipeline-diagram.png" width="65%"/></center>

<center><h2>Transformations based on column type </h2></center>

In [102]:
import pandas as pd

In [103]:


# Load dataset of income 
data = pd.read_csv("adult.csv", index_col=0)
data.tail() # Data analysis protip - Always look at the last rows. It is sometimes the most recent data. Sometimes the first rows are mock data.


Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
32556,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [104]:
target   = data.income
features = data.drop("income", axis=1)

# Find the categorical columns
categorical_columns = (features.dtypes == object)
categorical_columns

age               False
workclass          True
education          True
education-num     False
marital-status     True
occupation         True
relationship       True
race               True
gender             True
capital-gain      False
capital-loss      False
hours-per-week    False
native-country     True
dtype: bool

In [105]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target)

In [106]:
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import StandardScaler, OneHotEncoder
from sklearn.impute          import SimpleImputer

# Setup two preprocessing pipelines
con_pipe = Pipeline([('scaler', StandardScaler()),
                      ('imputer', SimpleImputer(strategy='median', add_indicator=True))])

cat_pipe = Pipeline([('ohe', OneHotEncoder(handle_unknown='ignore')),
                     ('imputer', SimpleImputer(strategy='most_frequent', add_indicator=True))])

# Put them to together
from sklearn.compose         import ColumnTransformer

preprocessing = ColumnTransformer([('categorical', cat_pipe,  categorical_columns),
                                   ('continuous',  con_pipe, ~categorical_columns),
                                   ])

# Add the algorithm 
from sklearn.linear_model    import LogisticRegression

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('clf', LogisticRegression(solver='liblinear'))])
pipe.fit(X_train, y_train)
pipe.predict(X_test)

array([' <=50K', ' <=50K', ' <=50K', ..., ' <=50K', ' <=50K', ' <=50K'],
      dtype=object)

<center><h2>Takeaways</h2></center>

- Scikit-learn's Pipelines can encapsulate all modeling steps.


- Pipelines act like a regular scikit-learn Estimator. Everything you would do with a regular algorithm, you can do with a Pipeline.



- Scikit-learn's Pipelines are helpful for writing more readable, robust, and maintainable code.


<br>
<center><img src="images/thank_you.png" width="75%"/></center>
<center>All these materials are in a GitHub repo:</center>   
<center><a href="https://bit.ly/pipeline-metis">bit.ly/pipeline-metis</a></center>   



<center><h2>Bonus Material</h2></center>

<center><img src="https://imgs.xkcd.com/comics/data_pipeline.png" width="75%"/></center>


<center><h2>Sources of Inspiration</h2></center>

- [Advanced Machine Learning with Scikit-learn](https://www.youtube.com/watch?v=7l_WQO3JbWE&ab_channel=rakutentech)
- https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html
- [Kevin Goetsch | Deploying Machine Learning using sklearn pipelines](https://www.youtube.com/watch?v=URdnFlZnlaE&ab_channel=PyData)
- https://towardsdatascience.com/introduction-to-scikit-learns-pipelines-565cc549754a
- https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html

<center><h2>Check for understanding</h2></center> 

Please order the lettered steps to build a valid pipeline

```python
from sklearn.datasets        import load_boston 
from sklearn.linear_model    import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import StandardScaler


# A 
pipe.fit(X_train, y_train)

# B
X_train, X_test, y_train, y_test= train_test_split(X, y)

# C
X, y = load_boston(return_X_y=True)

# D
pipe.predict(X_test)

# E
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('lr',     LinearRegression())])
```

<br>
<br> 
<br>

----