<h1 align='center'> Titanic using Pipline</h1>

In [83]:
import pandas as pd 
import numpy as np


In [84]:
df = pd.read_csv('titanic_train.csv')

In [85]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [86]:
#Droping unwanted columns
df.drop(columns=['PassengerId','Name','Ticket','Cabin'], inplace=True)

 What is plan?
   1. Impute columns-- Age, Embarked
   2. Ohe -- Sex, Embarked
   3. Scaling
   4. feature Selection
   5. Model training using DT

### Train Test Split 

In [87]:
from sklearn.model_selection import train_test_split

In [88]:
X= df.drop(columns='Survived')

In [89]:
X

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.2500,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.9250,S
3,1,female,35.0,1,0,53.1000,S
4,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...
886,2,male,27.0,0,0,13.0000,S
887,1,female,19.0,0,0,30.0000,S
888,3,female,,1,2,23.4500,S
889,1,male,26.0,0,0,30.0000,C


In [90]:
y=df['Survived']

In [91]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [92]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,1,male,45.5,0,0,28.5,S
733,2,male,23.0,0,0,13.0,S
382,3,male,32.0,0,0,7.925,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.275,S


### Column Transformer

- Here we are not calling columns in class by there name but by index values as after transformer it will be convert into numpy array. Which will be given to next pipline 
- So it's Good strtatergy to apply index

In [93]:
from sklearn.compose import ColumnTransformer

### Simple inmputer : Age, Embarked

In [94]:
from sklearn.impute import SimpleImputer

In [95]:
trf1 = ColumnTransformer([
    ('impute_age', SimpleImputer(),[2]) ,# by Mean # [2] index of age
    ('impute_embarked',SimpleImputer(strategy='most_frequent'),[6]) # By most Frequent value in columns
], remainder='passthrough')

### One Hot Encoding: Sex, Embarkred

In [96]:
from sklearn.preprocessing import OneHotEncoder

In [97]:
# One hot encoder
trf2= ColumnTransformer([
    ('ohe_sex_embarked', OneHotEncoder(sparse=False,handle_unknown='ignore'),[1,6])
], remainder='passthrough')

- After appling above 2 transformer we get 10 column total

### Scaling :  All the Features

In [98]:
from sklearn.preprocessing import MinMaxScaler

In [99]:
# Saling
trf3 = ColumnTransformer([
    ('scaling',MinMaxScaler(), slice(0,10)) # All Column
])

### Feature Selection

In [100]:
from sklearn.feature_selection import SelectKBest, chi2

- Here for Chi - Squared stats of non-negative features for classification task
- k =8 means top 8 features to be selected

In [101]:
trf4 =SelectKBest(score_func= chi2, k=5)

### Model Training : DT

In [102]:
from sklearn.tree import DecisionTreeClassifier

In [103]:
trf5= DecisionTreeClassifier()

### Pipline

In [112]:
from sklearn.pipeline import Pipeline, make_pipeline

In [113]:
pipe =Pipeline([
    ('trf1',trf1), # Name and transformer object
    ('trf2',trf2),
    ('trf3',trf3),
    ('trf4',trf4),
    ('trf5',trf5)
])

### Pipline Vs make_pipline

- Pipline required nameing of steps, make_pipliine does not required
- (same applys To ColumnTransformer vs make_column_transformer)

In [106]:
#Alternative to pipline
pipe =make_pipeline(trf1,trf2,trf3,trf4,trf5)

### Train the model

- one thing sholud be noted that in pipline if we are only doing preprocessing i.e
- If we have only made pipline of SimpleImputer -> OneHotEncoder -> scaling , that time we would have apply __fit_transform__ 
- But in this pipline we did model training also that's why we will apply __fit__ to it and then __predict__

In [107]:
pipe.fit(X_train,y_train)

Pipeline(steps=[('columntransformer-1',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('impute_age', SimpleImputer(),
                                                  [2]),
                                                 ('impute_embarked',
                                                  SimpleImputer(strategy='most_frequent'),
                                                  [6])])),
                ('columntransformer-2',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ohe_sex_embarked',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  [1, 6])])),
                ('columntransformer-3',
                 ColumnTransformer(transformers=[('scaling', MinMaxScaler(),
                           

### To see Visual Representaion of pipline

In [114]:
## Display pipline
from sklearn import set_config

In [115]:
set_config(display='diagram')

In [116]:
pipe.fit(X_train,y_train)

In [118]:
## To see with transformer we use on the object

pipe.named_steps['trf1'].transformers_

[('impute_age', SimpleImputer(), [2]),
 ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6]),
 ('remainder', 'passthrough', [0, 1, 3, 4, 5])]

In [120]:
##  If we want to find mean value we apply to impute_
##age
pipe.named_steps['trf1'].transformers_[0] 

('impute_age', SimpleImputer(), [2])

- It gives tuple of trf1 to get simple imputer we will use index 1 which is SimpleImputer()

In [121]:
pipe.named_steps['trf1'].transformers_[0][1]

- SimpleImuter have attribute statistics  if we take it it will give us mean value

In [122]:
pipe.named_steps['trf1'].transformers_[0][1].statistics_

array([29.49884615])

In [123]:
pipe.named_steps['trf1'].transformers_

[('impute_age', SimpleImputer(), [2]),
 ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6]),
 ('remainder', 'passthrough', [0, 1, 3, 4, 5])]

- Now we want to See Most Frequent Value in impute_embarked

- 1 index Tuple of trf1, 1 index value which SimpleImputer object and apply statistics_

In [126]:
pipe.named_steps['trf1'].transformers_[1][1].statistics_

array(['S'], dtype=object)

### Predict model

In [127]:
y_pred = pipe.predict(X_test)

In [128]:
y_pred

array([1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 0], dtype=int64)

### Performance Metrics

In [129]:
from sklearn.metrics import accuracy_score

In [130]:
accuracy_score(y_test,y_pred)

0.6256983240223464

- Here we use feature Selection where we reduce the features that's why accuracy is less
- So feature selction should be use wisely

### Crosss validation using pipline

In [131]:
from sklearn.model_selection import cross_val_score

In [133]:
cross_val_score(pipe, X_train, cv=5, scoring='accuracy').mean()

5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Amruta\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 678, in _fit_and_score
    estimator.fit(X_train, **fit_params)
  File "C:\Users\Amruta\anaconda3\lib\site-packages\sklearn\pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\Amruta\anaconda3\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\Amruta\anaconda3\lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "C:\Use

nan

### Hyper Paramerter tuning using GridSearchCV

In [134]:
parm = {
    'trf5__max_depth':[1,2,3,4,5, None]
}

In [135]:
from sklearn.model_selection import GridSearchCV

In [136]:
grid = GridSearchCV(pipe, parm, cv=5, scoring='accuracy')

In [137]:
grid.fit(X_train, y_train)

In [138]:
grid.best_score_

0.6391214419383433

In [139]:
grid.best_params_

{'trf5__max_depth': 2}

### Exporting the Pipline

In [140]:
# Export 
import pickle

In [141]:
pickle.dump(pipe,open('pipe.pkl','wb'))