# Scikit Learn Pipelines
- Pipelines chains together multiple steps so that the output of each step is input to the next step.

- Pipeline makes our work easy to apply the same preprocessing to train and test.

## Let's begine but first I'll take you all through a process of what would happen if we don't use pipelines

In [139]:
# Lets gear up 
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier

***I'll not focus more on exploring data here ; The main goal is to see the importance of pipelines, so you have to bear with that but if you want there's a step by step basic analysis as well. We're doing this work on very famous titanic dataset***

In [10]:
df = pd.read_csv('train.csv')

In [11]:
df.sample(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
373,374,0,1,"Ringhini, Mr. Sante",male,22.0,0,0,PC 17760,135.6333,,C
447,448,1,1,"Seward, Mr. Frederic Kimber",male,34.0,0,0,113794,26.55,,S


In [12]:
df.drop(columns = ['PassengerId' , 'Name' ,'Ticket' , 'Cabin'],inplace = True)

In [13]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [14]:
df.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

# Plan of action

### Train test split data before all these steps


>Now the worst part is we have to do this whole process again when we need to use this model , Sounds confusing  so let's demistify.  

In [15]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['Survived']),
                                                df['Survived'],
                                                test_size=0.2,
                                                random_state =42)
                                                

In [29]:
X_train['Embarked']

331    S
733    S
382    S
704    S
813    S
      ..
106    S
270    S
860    S
435    S
102    S
Name: Embarked, Length: 712, dtype: object

In [33]:
#Applyting Imputation

si_age = SimpleImputer()  # mean value default is going to replace
si_embarked = SimpleImputer(strategy='most_frequent') # most frequent value i.e "S"

X_train_age = si_age.fit_transform(X_train[['Age']])
X_train_embarked = si_embarked.fit_transform(X_train[['Embarked']])

X_test_age = si_age.transform(X_test[['Age']])
X_test_embarked = si_embarked.transform(X_test[['Embarked']])

In [34]:
X_test_embarked

array([['C'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['Q'],
       ['S'],
       ['Q'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['Q'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['Q'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['Q'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['Q'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['Q'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['Q'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['C'],
       ['S'],
      

In [35]:
# one hot encoding Sex and Embarked
#  Now here I am not going to use drop ='First' 
# as there will no dummy variable trap because we're using Decision trees;
# which is not a linear model so no issue.
# Here I am creating two different OHE objects because there are still missing
# values present .
# Yes we did filled up missing values but we got that in new numpy array in return 
# handle_unknown is for future like if we get any values which we don't know about
# then we're going to ignore them eg(getting 'M' which is not a station)
# phew


ohe_sex = OneHotEncoder(sparse = False , handle_unknown='ignore')
ohe_embarked = OneHotEncoder(sparse=False ,handle_unknown= 'ignore')

X_train_sex = ohe_sex.fit_transform(X_train[['Sex']])
X_train_embarked = ohe_embarked.fit_transform(X_train_embarked)

X_test_sex = ohe_sex.transform(X_test[['Sex']])
X_test_embarked = ohe_embarked.transform(X_test_embarked)


In [38]:
X_train_embarked

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

In [39]:
# Remaining columns looks fine so well use them as it is

X_train_rem = X_train.drop(columns = ['Sex','Age','Embarked'])

X_test_rem = X_test.drop(columns = ['Sex','Age','Embarked'])

In [40]:
# COncatinating everything 
# Remeber the order 

X_train_transformed = np.concatenate((X_train_rem ,X_train_age,X_train_sex,X_train_embarked), axis =1)
X_test_transformed = np.concatenate((X_test_rem ,X_test_age,X_test_sex,X_test_embarked), axis =1)


In [44]:
# 1 col of Age, 2 Col of Sex , 3 COl of Embarked , 4 different features[Pclass , Sibsp, Parch, fare]
X_train_transformed.shape ,X_test_transformed.shape

((712, 10), (179, 10))

In [45]:
# Building the classifier
clf =DecisionTreeClassifier()
clf.fit(X_train_transformed,y_train)

DecisionTreeClassifier()

In [48]:
y_pred = clf.predict(X_test_transformed)
y_pred

array([0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 1], dtype=int64)

In [49]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.776536312849162

In [50]:
# Let's say if we want to deploy this model to a website and predict
# if someone gives us the same data and we predict whether he/she lives or not 
# then what??
import pickle

In [51]:
# wb -> write binary , rb-> read binary 
# I have not taken the imputer one because I know I am going to get the age
# because I am going to give the age hahaha , so this is just to understand
# the gist of pipelines remember 

pickle.dump(ohe_sex ,open('models/ohe_sex.pkl' , 'wb'))
pickle.dump(ohe_embarked ,open('models/ohe_embarked.pkl' , 'wb'))
pickle.dump(clf ,open('models/clf.pkl' , 'wb'))

# Let's think of this part as web page now

In [52]:
ohe_sex_web = pickle.load(open('models/ohe_sex.pkl' , 'rb'))
ohe_embarked_web = pickle.load(open('models/ohe_embarked.pkl' , 'rb'))
clf_web = pickle.load(open('models/clf.pkl' , 'rb'))

In [53]:
#  A user input 
# Pclass / gender / age / SibSp / Parch/Fare /Embarked
test_input = np.array([2,'male' ,31.0 , 0 ,0 ,10.5 ,'S'] ,dtype =object).reshape(1,7)

In [54]:
test_input

array([[2, 'male', 31.0, 0, 0, 10.5, 'S']], dtype=object)

In [58]:
test_input_sex = ohe_sex_web.transform(test_input[:,1].reshape(1,1))

  "X does not have valid feature names, but"


In [61]:
test_input_sex

array([[0., 1.]])

In [59]:
test_input_embardked = ohe_embarked_web.transform(test_input[:,-1].reshape(1,1))

In [62]:
test_input_embardked

array([[0., 0., 1.]])

In [63]:
test_input_age = test_input[:,2].reshape(1,1)

In [64]:
test_input_age

array([[31.0]], dtype=object)

In [65]:
# Mantain the order
test_input_transformed =np.concatenate((test_input[:,[0,3,4,5]] ,test_input_age,
                                       test_input_sex,test_input_embardked) ,axis =1)

In [67]:
test_input_transformed.shape

(1, 10)

In [69]:
clf_web.predict(test_input_transformed)

array([0], dtype=int64)

- Now think how much of hardwork we went through for this plus
-  How much much changes are required in production code which is considered very bad practice
- Now , if we make some changes or updates we have to remember the order of our X_transformed , do similar process over @prodcution as well.
- This makes difficult and at the same time confusing 

> IN basic words , we need to follow same process / worklow in both uat and production when we don't make pipelines.So what we can do ??
Scikit learn comes to rescue here 

***That is why we should use pipelines and let's understand***

# PIPELINE Workflow

In [70]:
# Let's have a clean start, I am importing again just as to refer this as a 
# new notebook

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline , make_pipeline
from sklearn.feature_selection import SelectKBest ,chi2
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder , MinMaxScaler
from sklearn.tree import DecisionTreeClassifier

In [71]:
data = pd.read_csv('train.csv')

In [73]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


 # Let's Plan Again 

***Lets create a chain /pipeline***
***For Each step we'll be using column transformer***

1. So now we can see that Age and Embarked got missing values -- Imputer needs to be used here because scikit learn gets angry when it sees missing values. 

2. Sex and Embarked are Nominal categorical Values --- OneHotEncoding needs to be done because scikit learn only loves to work with numbers.

3. We'll be scaling data so as to get everything in same scale because it helps the scikit learn attributes to understand the features with same scale

4. I'll be using feature selection as well just to see how it works , however this will affect the model performance , but here I want to make this as a reference to look out on how to make ML Pipeline.

5. Now at last, we'll be using our Decision Tree Algorithm and then save the model


In [74]:
# dropping some columns
data.drop(columns = ['PassengerId' , 'Name' ,'Ticket' , 'Cabin'],inplace = True)

In [75]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data.drop(columns=['Survived']),
                                                data['Survived'],
                                                test_size=0.2,
                                                random_state =42)

In [76]:
X_train.sample(5)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
387,2,female,36.0,0,0,13.0,S
859,3,male,,0,0,7.2292,C
166,1,female,,0,1,55.0,S
180,3,female,,8,2,69.55,S
856,1,female,45.0,1,1,164.8667,S


In [77]:
# Imputation Transformer
# Using Index value helps in numpy array as we get numpy array in return 
# rather than dataframe 
# So creating column transformer in a way that it takes indexes rather names of columns

trf1 = ColumnTransformer([
    ('impute_age',SimpleImputer() , [2]),
    ('impute_embarked',SimpleImputer(strategy='most_frequent'),[6])
] , remainder = 'passthrough')

In [78]:
# One Hot Encoding Transformer
trf2 = ColumnTransformer([
    ('ohe_sex_embarked' , OneHotEncoder(sparse = False , handle_unknown = 'ignore'),[1,6])
],remainder = 'passthrough')

In [79]:
# Scaling 

trf3 = ColumnTransformer([
    ('scale', MinMaxScaler() , slice(0,10))
])

In [80]:
# let's not think about this much 
trf4 = SelectKBest(score_func= chi2 , k= 8)

In [81]:
trf5 = DecisionTreeClassifier()

In [123]:
pipe = Pipeline([
    ('trf1',trf1),
    ('trf2',trf2),
    ('trf3',trf3),
    ('trf4',trf4),
    ('trf5',trf5),
])

In [124]:
from sklearn import set_config
set_config(display = 'diagram')

In [125]:
# Train 
pipe.fit(X_train,y_train )

>Isn't it beautiful

# Exploring the pipeline

In [126]:
pipe.named_steps

{'trf1': ColumnTransformer(remainder='passthrough',
                   transformers=[('impute_age', SimpleImputer(), [2]),
                                 ('impute_embarked',
                                  SimpleImputer(strategy='most_frequent'),
                                  [6])]),
 'trf2': ColumnTransformer(remainder='passthrough',
                   transformers=[('ohe_sex_embarked',
                                  OneHotEncoder(handle_unknown='ignore',
                                                sparse=False),
                                  [1, 6])]),
 'trf3': ColumnTransformer(transformers=[('scale', MinMaxScaler(), slice(0, 10, None))]),
 'trf4': SelectKBest(k=8, score_func=<function chi2 at 0x0000029C33FBEAF8>),
 'trf5': DecisionTreeClassifier()}

In [127]:
# getting the mean value
pipe.named_steps['trf1'].transformers_[0][1].statistics_

array([29.49884615])

In [128]:
pipe.named_steps['trf1'].transformers_[1][1].statistics_

array(['S'], dtype=object)

In [130]:
X_test.shape

(179, 7)

In [131]:
y_pred = pipe.predict(X_test)

In [132]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test ,y_pred)

0.6256983240223464

# Cross Validation using Pipeline

In [97]:
from sklearn.model_selection import cross_val_score
cross_val_score(pipe ,X_train ,y_train ,scoring= 'accuracy').mean()


0.6391214419383433

# GridSearch using Pipeline

In [99]:
params = {
    'trf5__max_depth' : [1,2,3,4,5,None]
}

In [100]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe ,params ,cv =5 ,scoring = 'accuracy')
grid.fit(X_train,y_train)

In [101]:
grid.best_score_

0.6391214419383433

In [102]:
grid.best_params_

{'trf5__max_depth': 2}

In [103]:
# Standard Scaling
from sklearn.preprocessing import StandardScaler
trf_ss = ColumnTransformer([
    ('standardscale' ,StandardScaler(),slice(0,10) )
])

In [114]:
pipe_without_fs = Pipeline([
    ('trf1',trf1),
    ('trf2',trf2),
    ('trf_ss',trf_ss),
    ('trf5',trf5),
])

In [115]:
pipe_without_fs.fit(X_train ,y_train)

In [116]:
y_pred_ = pipe_without_fs.predict(X_test)

In [117]:
accuracy_score(y_test ,y_pred)

0.6256983240223464

# Exporting the pipeline for production

In [133]:
pickle.dump(pipe,open('models/pipe_S.pkl','wb'))

# Let's use this

In [134]:
 pipe_prod = pickle.load(open('models/pipe_S.pkl','rb'))

In [136]:
test_input_2  = np.array([2,'male',31.0 ,0,0,10.5,'S'] , dtype = object).reshape(1,7)

In [138]:
pipe_prod.predict(test_input_2)

  "X does not have valid feature names, but"
  "X does not have valid feature names, but"


array([0], dtype=int64)

- So no changes required @production level; cool right just download new pipe and continue with work even if you do changes in the pipeline by adding or doing anything.
- Column Transformers made our work clean and readable plus less confusion which is an indicator of good code.
- There will be no problem at production and this one is a well managed and I guess well written code(hahah).
- Try to work in pipelines, this will help alot in future. 


# Thanks for reading.