# Pipelines

## Agenda

- Why/how/when are pipelines useful?
- How to set up a simple preprocessing pipeline

## Why Pipeline?

Pipelines can keep our code neat and clean - from gathering & cleaning our data, to creating models & fine-tuning them!

**Advantages**: 
- Reduces complexity
- Convenient 
- Flexible 
- Can help prevent mistakes (like data leakage between train and test set - for example, during cross validation!) 

## Scenario

In [1]:
# Imports
import pandas as pd
pd.set_option("display.max_columns", 24)

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

### Read In and Explore the Data

In [2]:
ames = pd.read_csv("data/ames.csv")
ames.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,...,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,,,,0,12,2008,WD,Normal,250000


In [3]:
ames.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [4]:
# Explore continuous variables
ames.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,567.240411,...,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,441.866955,...,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,223.0,...,1.0,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,477.5,...,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,808.0,...,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,...,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [5]:
# Explore our categorical columsn
# Using list comprehension to list only columns with 'object' dtype
ames[[c for c in ames.columns if ames[c].dtype == 'object']].describe()

Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,...,Functional,FireplaceQu,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
count,1460,1460,91,1460,1460,1460,1460,1460,1460,1460,1460,1460,...,1460,770,1379,1379,1379,1379,1460,7,281,54,1460,1460
unique,5,2,2,4,4,2,5,3,25,9,8,5,...,7,5,6,3,5,5,3,3,4,4,9,6
top,RL,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,...,Typ,Gd,Attchd,Unf,TA,TA,Y,Gd,MnPrv,Shed,WD,Normal
freq,1151,1454,50,925,1311,1459,1052,1382,225,1260,1445,1220,...,1360,380,870,605,1311,1326,1340,3,157,49,1267,1198


### Observations:

Mixture of categorical and numeric data. Some columns have null values

Numeric data is all on different scales, and some categorical columns have many different options (for example, Neighborhood has 25 unique values - probably too many to one hot encode).

### Outline an Initial Approach

- take the numeric columns only (excluding years)
- impute missing values with the median using a SimpleImputer
- scale the data using a StandardScaler
- model using LinearRegression

### Process the Data

In [6]:
# Explore our target value
ames['SalePrice'].describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

In [7]:
# Define which numeric columns we DON'T want to use
# Not including target-related cols, plus the Id and year-related cols
not_used_num_cols = ['SalePrice', 'Id', 'YearBuilt', 'YearRemodAdd', 'MoSold', 'YrSold']

# Define which columns to use
# Grabbing all numeric columns that aren't in the above list
used_cols = [c for c in ames.columns if 
             (ames[c].dtype in ['float64', 'int64']) &
             (c not in not_used_num_cols)]

In [8]:
# Define our X and y
X = ames[used_cols]
y = ames['SalePrice']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=13)

In [9]:
# Check our work
X_train.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,...,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal
998,30,60.0,9786,3,4,0.0,0,0,1007,1007,1077,0,...,6,1,1922.0,1,210,0,100,48,0,0,0,0
883,75,60.0,6204,4,5,0.0,0,0,795,795,954,795,...,10,0,1997.0,1,440,0,188,0,0,0,0,0
792,60,92.0,9920,7,5,0.0,862,0,255,1117,1127,886,...,8,1,1997.0,2,455,180,130,0,0,0,0,0
617,45,59.0,7227,6,6,0.0,0,0,832,832,832,0,...,4,0,1962.0,2,528,0,0,0,0,0,0,0
1028,50,79.0,9492,5,5,0.0,368,41,359,768,968,408,...,6,1,1941.0,1,240,0,0,0,0,0,0,0


In [10]:
# Instantiate our imputer
imputer = SimpleImputer(strategy='median')

# Fit our imputer on the training data
imputer.fit(X_train)

# Create no-null versions of our train and test data
X_train_no_nulls = imputer.transform(X_train)
X_test_no_nulls = imputer.transform(X_test)

In [11]:
# Instantiate our scaler
scaler = StandardScaler()

# Fit our scaler on the no-null training data
scaler.fit(X_train_no_nulls)

# Create processed versions of our train and test data
X_train_processed = scaler.transform(X_train_no_nulls)
X_test_processed = scaler.transform(X_test_no_nulls)

In [12]:
# Explore the result
X_train_processed.shape

(1095, 32)

### Model the Data

In [13]:
# Instantiate our Logistic Regression model
linreg = LinearRegression()

# Fit our model on our processed training data
linreg.fit(X_train_processed, y_train)

# Grab predictions out on our train and test sets, to evaluate
train_preds = linreg.predict(X_train_processed)
test_preds = linreg.predict(X_test_processed)

In [14]:
# Print out R2-Score and Root Mean Squared Error for our train and test data
print(f"Train Set R2-Score: {r2_score(y_train, train_preds)}")
print(f"Train Set RMSE: {mean_squared_error(y_train, train_preds, squared=False)}")
print("*"*20)
print(f"Test Set R2-Score: {r2_score(y_test, test_preds)}")
print(f"Test Set RMSE: {mean_squared_error(y_test, test_preds, squared=False)}")

Train Set R2-Score: 0.870722088080975
Train Set RMSE: 28303.645888817668
********************
Test Set R2-Score: 0.5100499070501185
Test Set RMSE: 56956.52249398129


### Evaluate

Scores on the training data show that about 87% of the variance in SalePrice is explained by our inputs, and on average we are off by about $28,300 when we make a prediction.

BUT on the test set, we're only explaining about 51% of the variance in SalePrice, and on average we're off by about $56,950!

This is a classic sign of overfitting - our model memorized some noise from the training data, rather than finding a useful pattern that allows the model to generalize to data its never seen before.

### Next Steps

We'd like to cross validate, to see if our modeling approach needs work or if this is just a really bad split in the data.

BUT currently we're using the median, mean and standard deviation of the training data in our processing steps (median for imputation, mean/std for our scaler). If we did cross validation on already-processed data, there would be some test set leakage in each of our folds, since the test data in each fold would be affect our processing. Not good!

<img src="images/grid_search_cross_validation.png" alt="cross validation image from sklearn's documentation" width=500>

## Enter: Pipelines

Pipelines are needed in this case because each split inside the cross validation should be processed using only parameters from the training data _for that split_. Pipelines make that effortless!

In [15]:
# Import pipeline
from sklearn.pipeline import Pipeline

**Recap our processing steps:**
- Imputed null values using SimpleImputer
- Scaled the data using StandardScaler

In [16]:
# Now define those steps for our pipeline
num_processor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

In [17]:
# We can go ahead and test this on our X_train from before, to make sure we get the same result
num_processor.fit_transform(X_train).shape # Same result as our shape we explored before!

(1095, 32)

In [18]:
# Now, add the model - using another Pipeline!
linreg = Pipeline(steps = [
    ('preprocessor', num_processor),
    ('linreg', LinearRegression())
])

linreg.fit(X_train, y_train) # don't use the processed data here - the pipeline does that for us

# Grab predictions out on our train and test sets, to evaluate
train_preds = linreg.predict(X_train)
test_preds = linreg.predict(X_test)

In [19]:
# Print out R2-Score and Root Mean Squared Error for our train and test data
print(f"Train Set R2-Score: {r2_score(y_train, train_preds)}")
print(f"Train Set RMSE: {mean_squared_error(y_train, train_preds, squared=False)}")
print("*"*20)
print(f"Test Set R2-Score: {r2_score(y_test, test_preds)}")
print(f"Test Set RMSE: {mean_squared_error(y_test, test_preds, squared=False)}")
# Note - same scores as before!

Train Set R2-Score: 0.870722088080975
Train Set RMSE: 28303.645888817668
********************
Test Set R2-Score: 0.5100499070501185
Test Set RMSE: 56956.52249398129


### Cross Validate

In [20]:
# Import our cross validation
# Note that cross_val_score and cross_val_predict are variants of this function
from sklearn.model_selection import cross_validate

In [24]:
# Time to cross val!
# Pass in pipeline and training data, set cv=5, return train score, and set scoring
# Reference: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
results = cross_validate(linreg, X_train, y_train,
                         cv=5, 
                         return_train_score=True,
                         scoring=['r2', 'neg_root_mean_squared_error'])
results

{'fit_time': array([0.012568  , 0.01167321, 0.00819564, 0.00764608, 0.00798488]),
 'score_time': array([0.00354099, 0.00263095, 0.0026083 , 0.00237894, 0.00240588]),
 'test_r2': array([0.85128524, 0.87875713, 0.84034008, 0.84331124, 0.88917455]),
 'train_r2': array([0.87468439, 0.86804971, 0.87488225, 0.87650839, 0.86494074]),
 'test_neg_root_mean_squared_error': array([-31959.35077226, -24903.53867147, -28322.78878081, -34522.81518905,
        -26482.56894404]),
 'train_neg_root_mean_squared_error': array([-27481.10664728, -29209.92959238, -28479.59841584, -26860.02182131,
        -28851.86943949])}

In [30]:
# Let's look at the average, plus a measure of variance, for train and test
print(f"Average Train Set R2-Score: {results['train_r2'].mean()} +/- { results['train_r2'].std()}")
print(f"Average Train Set RMSE: {results['train_neg_root_mean_squared_error'].mean()*-1} +/- { results['train_neg_root_mean_squared_error'].std()}")
print("*"*20)
print(f"Average Test Set R2-Score: {results['test_r2'].mean()} +/- { results['test_r2'].std()}")
print(f"Average Test Set RMSE: {results['test_neg_root_mean_squared_error'].mean()*-1} +/- {results['test_neg_root_mean_squared_error'].std()}")

Average Train Set R2-Score: 0.871813094847045 +/- 0.004496711785358259
Average Train Set RMSE: 28176.50518326214 +/- 875.5176710659689
********************
Average Test Set R2-Score: 0.8605736470029729 +/- 0.019709455337951112
Average Test Set RMSE: 29238.2124715266 +/- 3537.0210140971762


### Evaluate

These scores are MUCH better and closer - looks like it's likely we found a really bad split for our initial train test split!