<a href="https://colab.research.google.com/github/stepthom/869_course/blob/main/pipeline/slides_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Playground for Pipeline Slides

- Stephen W. Thomas
- Used for MMA 869, MMAI 869, and GMMA 869

Pipelines are awesome and have many benefits. 

Pipelines are a place to put all your cleaning, preprocessing, and feature engineering steps into one nice-little package. You can use your pipeline to easily transform training, testing, and any other data in the exact same way!

What you put into pipelines is up to you - they are very flexible. Think of pipelines as Legos, or building blocks. You have lots of small blocks that you can combine in many different ways to create almost anything you wish. You can steps to scale numeric featuers, encode caterical features, derive new features, etc.

Pipelines can be tricky to learn, since they have so much flexibility, but once you get a hang of the basics, you'll never go back the old, "manual" way of transforming your data.

Great documentation can be found on scikit-learn's website:

https://scikit-learn.org/stable/modules/compose.html

In [1]:
import pandas as pd
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 1.0.2.


In [3]:
import os
os.getcwd()

'c:\\Users\\james\\OneDrive\\869_course\\pipeline'

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/stepthom/869_course/main/data/generated_german.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 58 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   UserID                                  1000 non-null   object 
 1   FirstName                               1000 non-null   object 
 2   LastName                                1000 non-null   object 
 3   DateOfBirth                             1000 non-null   object 
 4   Sex                                     1000 non-null   object 
 5   Street                                  1000 non-null   object 
 6   City                                    1000 non-null   object 
 7   LicensePlate                            1000 non-null   object 
 8   Married                                 1000 non-null   float64
 9   NumberPets                              1000 non-null   float64
 10  Duration                                1000 non-null   int64

In [5]:
df.head()

Unnamed: 0,UserID,FirstName,LastName,DateOfBirth,Sex,Street,City,LicensePlate,Married,NumberPets,...,Housing.Rent,Housing.Own,Housing.ForFree,Job.UnemployedUnskilled,Job.UnskilledResident,Job.SkilledEmployee,Job.Management.SelfEmp.HighlyQualified,EmploymentDuration,SavingsAccountBonds,BadCredit
0,216-88-4089,Christopher,Ramos,1953-09-02,M,011 Amy Village Suite 982,North Judithbury,0-E4505,0.0,0.0,...,0,1,0,0,0,1,0,7.0,281.0,0
1,880-61-3645,Mary,Gomez,1999-09-30,F,232 Nicole Burg Apt. 563,East Jill,160F,1.0,0.0,...,0,1,0,0,0,1,0,2.0,0.0,1
2,183-75-4350,Curtis,Sellers,1973-03-01,M,3303 Murphy Way Apt. 458,West Michael,VR1 K4A,0.0,2.0,...,0,1,0,0,1,0,0,4.0,0.0,0
3,681-43-5144,Andrew,Carlson,1975-10-17,M,569 Paul Ports Apt. 406,North Judithbury,8QZ3561,0.0,1.0,...,0,0,1,0,0,1,0,4.0,0.0,0
4,416-71-0007,Matthew,Martinez,1969-05-15,M,465 Terry Plaza Apt. 366,Ramirezstad,299X4,0.0,0.0,...,0,0,1,0,0,1,0,2.0,0.0,1


In [6]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df.describe().T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Married,1000.0,0.402,0.490547,0.0,0.0,0.0,1.0,1.0
NumberPets,1000.0,1.074,0.805713,0.0,0.0,1.0,2.0,2.0
Duration,1000.0,20.903,12.058814,4.0,12.0,18.0,24.0,72.0
Amount,1000.0,5888.252,5080.928343,450.0,2458.0,4175.0,7150.25,33163.0
InstallmentRatePercentage,1000.0,2.973,1.118715,1.0,2.0,3.0,4.0,4.0
ResidenceDuration,1000.0,2.845,1.103718,1.0,2.0,3.0,4.0,4.0
NumberExistingCredits,1000.0,1.407,0.577654,1.0,1.0,1.0,2.0,4.0
NumberPeopleMaintenance,1000.0,1.155,0.362086,1.0,1.0,1.0,1.0,2.0
OwnCar,1000.0,0.596,0.490943,0.0,0.0,1.0,1.0,1.0
ForeignWorker,1000.0,0.963,0.188856,0.0,1.0,1.0,1.0,1.0


In [7]:
from sklearn.model_selection import train_test_split

X = df.drop(['BadCredit'], axis=1)
y = df['BadCredit']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Simple Pipeline 1

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score

# For our very first pipeline, we're going to keep it very simple.
# We're going to define some features to keep, and we'll drop the rest.
# We won't do any preprocessing, and we'll just pass the features straight on
# to an estimator/classifier.  
# That's it!

keep_features = ['Amount', 'Duration']

clf = RandomForestClassifier(random_state=42)

preprocessor1 = ColumnTransformer(
    transformers=[
        ('keep_and_do_nothing', 'passthrough', keep_features),
        ],
        remainder = 'drop')

pipe1 = Pipeline(steps=[("preprocessor", preprocessor1), ("clf", clf)])

scores1 = cross_val_score(pipe1, X_train, y_train, 
                          scoring='f1_macro', cv=10, n_jobs=-1)
print(scores1)
print(np.mean(scores1))

[0.5        0.60286817 0.54403383 0.59272727 0.59329693 0.51566952
 0.47525343 0.43452381 0.66437834 0.5826087 ]
0.5505359994404786


In [9]:
X_train[keep_features]

Unnamed: 0,Amount,Duration
29,12305,60
535,4174,21
695,2225,6
557,9005,21
836,1595,12
...,...,...
106,11624,18
270,4792,18
860,10447,24
435,2671,12


In [10]:
# What did the features look like after preprocessing?

# First, have to fit the pipe! (the cross_val_score above didn't actually save 
# the fitting)
pipe1 = pipe1.fit(X_train, y_train)

# Make a dataframe so it prints prettier
_tmp = pd.DataFrame(pipe1.named_steps['preprocessor'].transform(X_train))
_tmp.head()

Unnamed: 0,0,1
0,12305,60
1,4174,21
2,2225,6
3,9005,21
4,1595,12


In [11]:
# What were the feature importances (or whatever you're interested in) of the classifier?
print(pipe1.named_steps['clf'].feature_importances_)

[0.84207977 0.15792023]


# Simple Pipeline 2

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

numeric_features = ['Amount', 'Duration', 'NumberPets', 
                    'ResidenceDuration', 'Married', 'EmploymentDuration']

categorical_features = ['City', 'Sex']

drop_features = ['UserID', 'DateOfBirth', 'FirstName', "LastName", 
                 'Street', 'LicensePlate']

clf = RandomForestClassifier(random_state=42)

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ])

categorical_transformer = Pipeline(steps=[
      ('encoder', OneHotEncoder(handle_unknown='ignore')),
      ])

preprocessor2 = Pipeline(steps=[
      ('ct', ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features),
            ('drop', 'drop', drop_features)],
            remainder = 'passthrough', 
            sparse_threshold=0)),
    ])

pipe2 = Pipeline(steps=[('preprocessor', preprocessor2),  ('clf', clf)])

scores2 = cross_val_score(pipe2, X_train, y_train, 
                          scoring='f1_macro', cv=10, n_jobs=-1)
print(scores2)
print(np.mean(scores2))

[0.62036238 0.66600747 0.63085036 0.62036238 0.56989247 0.59914102
 0.74570884 0.70116458 0.62036238 0.63333333]
0.6407185216279683


In [13]:
# What did the features look like after preprocessing?

# First, have to fit the pipe! (the cross_val_score above didn't actually save 
# the fitting)
pipe2 = pipe2.fit(X_train, y_train)

# Make a dataframe so it prints prettier
_tmp = pd.DataFrame(pipe2.named_steps['preprocessor'].transform(X_train))
_tmp.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,61,62,63,64,65,66,67,68,69,70
0,1.199953,3.297082,-0.103271,1.044509,-0.829315,1.73915,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1,-0.359666,-0.008051,-0.103271,-1.67144,-0.829315,-1.142989,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,-0.733507,-1.279256,-0.103271,1.044509,-0.829315,-0.921286,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,852.0
3,0.566975,-0.008051,-0.103271,1.044509,1.205814,-0.47788,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,130.0
4,-0.854348,-0.770774,-1.355038,-0.766124,1.205814,-0.47788,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,474.0


In [14]:
# What were the feature importances (or whatever you're interested in) of the classifier?
print(pipe2.named_steps['clf'].feature_importances_)

[0.09591925 0.07294257 0.0240989  0.03132846 0.01200917 0.06381552
 0.00271897 0.0079207  0.00895789 0.00422677 0.00556936 0.00148015
 0.0038158  0.0056202  0.00525146 0.00822997 0.00473592 0.00325496
 0.00848567 0.01676543 0.0031448  0.00329898 0.0035372  0.00226122
 0.00306096 0.00295967 0.01243699 0.01238957 0.03296967 0.01760942
 0.0095418  0.01519741 0.00273081 0.03995237 0.01517252 0.00651855
 0.04744008 0.0129897  0.01245579 0.01393637 0.00809382 0.01889883
 0.01823881 0.00878105 0.01308798 0.01327113 0.00187625 0.00298409
 0.00910227 0.         0.00104668 0.00968492 0.00236956 0.01071255
 0.00716616 0.00832383 0.01457032 0.01298999 0.01294557 0.01035928
 0.01389978 0.00634799 0.01368854 0.01074054 0.01674542 0.00727676
 0.00175692 0.01138105 0.01610366 0.01371005 0.04709522]


# Pipeline 3

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import KernelPCA, PCA, TruncatedSVD
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

numeric_features = ['Amount', 'Duration', 'NumberPets', 
                    'ResidenceDuration', 'Married', 'EmploymentDuration']

categorical_features = ['City', 'Sex']

drop_features = ['UserID', 'FirstName', "LastName", 
                 'Street', 'LicensePlate']

# A Custom transformer that takes in a feature that is a date/time (e.g., Date 
# of birth) and calculates the age in years from 2021.
def get_age_years(feature):
  res = np.array([])
  for instance in feature:
    age = 2021 - int(instance[0:4])
    res = np.append(res, age)
  return res.reshape(-1, 1)

clf = RandomForestClassifier(random_state=42)

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ])

categorical_transformer = Pipeline(steps=[
      ('encoder', OneHotEncoder(handle_unknown='ignore')),
      ])

preprocessor3 = Pipeline(steps=[
      ('ct', ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('amount_log', FunctionTransformer(np.log10, validate=False), ['Amount']),
            ('cat', categorical_transformer, categorical_features),
            ('age', FunctionTransformer(get_age_years, validate=False), 'DateOfBirth'),
            ('drop', 'drop', drop_features)],
            remainder = 'passthrough', 
            sparse_threshold=0)),
    ('pca', PCA(n_components=10)),
    ])

pipe3 = Pipeline(steps=[('preprocessor', preprocessor3),  ('clf', clf)])

scores3 = cross_val_score(pipe3, X_train, y_train, 
                          scoring='f1_macro', cv=10, n_jobs=-1)
print(scores3)
print(np.mean(scores3))

[0.52537904 0.52278692 0.55133736 0.64399237 0.59914102 0.5703125
 0.60573477 0.616      0.75404796 0.62651727]
0.6015249210706621


# Pipeline 4

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV

numeric_features = ['Amount', 'Duration', 'NumberPets', 
                    'ResidenceDuration', 'Married', 'EmploymentDuration']

categorical_features = ['City', 'Sex']

drop_features = ['UserID', 'FirstName', "LastName", 
                 'Street', 'LicensePlate']

# A Custom transformer that takes in a feature that is a date/time (e.g., Date 
# of birth) and calculates the age in years from 2021.
def get_age_years(feature):
  res = np.array([])
  for instance in feature:
    age = 2021 - int(instance[0:4])
    res = np.append(res, age)
  return res.reshape(-1, 1)

clf = RandomForestClassifier(random_state=42)

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ])

categorical_transformer = Pipeline(steps=[
      ('encoder', OneHotEncoder(handle_unknown='ignore')),
      ])

preprocessor4 = Pipeline(steps=[
      ('ct', ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('amount_log', FunctionTransformer(np.log10, validate=False), ['Amount']),
            ('cat', categorical_transformer, categorical_features),
            ('age', FunctionTransformer(get_age_years, validate=False), 'DateOfBirth'),
            ('drop', 'drop', drop_features)],
            remainder = 'passthrough', 
            sparse_threshold=0)),
      ('feature_selector', SelectKBest(k=10)),
    ])

pipe4 = Pipeline(steps=[('preprocessor', preprocessor4),  ('clf', clf)])

param_grid = {
    'preprocessor__ct__num__scaler__with_mean': [True, False],
    'preprocessor__ct__num__scaler__with_std': [True, False],
    'preprocessor__feature_selector__k': [5, 10, 15],
    'clf__max_depth': [None, 3, 10],
    'clf__criterion': ['gini', 'entropy'], 
    'clf__class_weight':[None, 'balanced'],
}

pipe4 = GridSearchCV(pipe4, param_grid, cv=10, n_jobs=-1, 
                     scoring='f1_macro', return_train_score=True, verbose=2)

pipe4 = pipe4.fit(X_train, y_train)

print(pipe4.best_score_)
print(pipe4.best_params_)

Fitting 10 folds for each of 144 candidates, totalling 1440 fits
0.6736580120537148
{'clf__class_weight': 'balanced', 'clf__criterion': 'entropy', 'clf__max_depth': 10, 'preprocessor__ct__num__scaler__with_mean': True, 'preprocessor__ct__num__scaler__with_std': True, 'preprocessor__feature_selector__k': 15}


  f = msb / msw
