**Note :** This whole model building process is through sklearn pipeline.This is a very good beginning if you would like to know how in real world models are build.Refer sklearn documentation while working with ColumnTransformer and Sklearn Pipeline if you are not familiar with it already.<br>
<br>
To keep this notebook clear I have not included any Exploratory data analysis part or any fancy plot. You can do plotting and mess with the data as per your wish.<br>

The preprocessed data is pickled and povided here.

Imports

In [126]:
import pandas as pd 
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.metrics import accuracy_score

Imports for model

In [127]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
DT = DecisionTreeClassifier()
RF = RandomForestClassifier()
LR = LogisticRegression()

Imports for feature scaling

In [128]:
from sklearn.preprocessing import MinMaxScaler
minmaxscaler = MinMaxScaler()

Imports for feature selection

In [130]:
from sklearn.feature_selection import SelectKBest,chi2

In [129]:
#Display Pipeline
from sklearn import set_config
set_config(display='diagram')

*IMPORT DATA*

In [131]:
train = pickle.load(open("TRAIN.pkl","rb"))
test = pickle.load(open("TEST.pkl","rb"))
passengerId = pickle.load(open("passengerId.pkl","rb"))

*Handling Outliers*

Age Column

Conclusion after visualisation<br>
Age column - little skewed and normal distributed<br>
Fare column - highly skewed distribution<br>
Both of the column contains outlier


*Outlier handling using IQR technique*

In [106]:
# Finding the IQR
percentile25 = train['Age'].quantile(0.25)
percentile75 = train['Age'].quantile(0.75)
iqr = percentile75 - percentile25
#Limit for age
High_for_age = percentile75 + 1.5 * iqr
Low_for_age = percentile25 - 1.5 * iqr

In [107]:
#For train data
train1 = train.copy()

train1['Age'] = np.where(
    train1['Age'] > High_for_age,
    High_for_age,
    np.where(
        train1['Age'] < Low_for_age,
        Low_for_age,
        train1['Age']
    )
)

In [108]:
#For test data
test1 = test.copy()

test1['Age'] = np.where(
    test1['Age'] > High_for_age,
    High_for_age,
    np.where(
        test1['Age'] < Low_for_age,
        Low_for_age,
        test1['Age']
    )
)

Fare Column

In [109]:
# Finding the IQR
percentile25 = train['Fare'].quantile(0.25)
percentile75 = train['Fare'].quantile(0.75)
iqr = percentile75 - percentile25
#Limit for fare
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr

In [110]:
train2 = train1.copy()

train2['Fare'] = np.where(
    train2['Fare'] > upper_limit,
    upper_limit,
    np.where(
        train2['Fare'] < lower_limit,
        lower_limit,
        train2['Fare']
    )
)

In [111]:
test2 = test1.copy()

test2['Fare'] = np.where(
    test2['Fare'] > upper_limit,
    upper_limit,
    np.where(
        test2['Fare'] < lower_limit,
        lower_limit,
        test2['Fare']
    )
)

Performing train test split

In [132]:
X_train,X_test,y_train,y_test = train_test_split(train.drop(columns=["Survived"]),train["Survived"],test_size=0.2,random_state=20)

In [133]:
X_train

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,FamilySize,NameTitle,Cabin
811,3,male,39.0,24.1500,S,1,Mr,0
29,3,male,,7.8958,S,1,Mr,0
49,3,female,18.0,17.8000,S,2,Mrs,0
105,3,male,28.0,7.8958,S,1,Mr,0
616,3,male,34.0,14.4000,S,3,Mr,0
...,...,...,...,...,...,...,...,...
218,1,female,32.0,76.2917,C,1,Miss,1
223,3,male,,7.8958,S,1,Mr,0
271,3,male,25.0,0.0000,S,1,Mr,0
474,3,female,22.0,9.8375,S,1,Miss,0


**Creating transformer using ColumnTransformer for differentcoperation**

*ColumnTransformer for IMPUTATION*

In [116]:
transformer1=ColumnTransformer([
                          ("Mean Imputation",SimpleImputer(),[2]),
                          ("Most frequent imputation",SimpleImputer(strategy="most_frequent"),[4])
],remainder="passthrough") 

*ColumnTransformer for FEATURE ENCODING*

In [117]:
#Transformer for encoding
transformer2 = ColumnTransformer([
                                  ("OHE",OneHotEncoder(sparse=False),[1,3,6])                               
],remainder="passthrough")

*Creating an instance FEATURE SCALING*

In [118]:
#Creating instance for different Scaling techniques
minmaxscaler = MinMaxScaler()

MODEL FITTING WITH PIPE CREATION

*PIPELINE FOR DECISION TREE*

In [134]:

pipe1 = Pipeline([
                 ('imputation',transformer1),
                 ("encoding",transformer2),
                 ("StandardSCaler",minmaxscaler),
                 ('Feature selection',SelectKBest(k=3)),
                 ("model fitting",DT)
])
pipe1.fit(X_train,y_train)
y_pred1 = pipe1.predict(X_test)
accuracy_score(y_test,y_pred1)

0.8324022346368715

In [120]:
#To display pipeline visually
pipe1.fit(X_train,y_train)

*PIPELINE FOR RANDOM FOREST*

In [135]:
#PIPELINE FOR RANDOM FOREST
pipe2 = Pipeline([
                 ('imputation',transformer1),
                 ("encoding",transformer2),
                 ("Scaling",minmaxscaler),
                 ('Feature selection',SelectKBest(chi2,k=3)),
                 ("model fitting",RF)
])
pipe2.fit(X_train,y_train)
y_pred2 = pipe2.predict(X_test)
accuracy_score(y_test,y_pred2)

0.8324022346368715

In [122]:
#To display pipeline visually
pipe2.fit(X_train,y_train)

*PIPELINE FOR LOGISTIC REGRESSION*

In [136]:
#PIPELINE FOR LOGISTIC REGRESSION
pipe3 = Pipeline([
                 ('imputation',transformer1),
                 ("encoding",transformer2),
                 ("Scaling",minmaxscaler),
                # ('Feature selection',SelectKBest(chi2,k=5)),
                 ("model fitting",LR)
])
pipe3.fit(X_train,y_train)
y_pred3 = pipe3.predict(X_test)
accuracy_score(y_test,y_pred3)

0.888268156424581

In [124]:
#To display pipeline visually
pipe3.fit(X_train,y_train)

**Sumission dataframe creation**

Decision Tree Model

In [None]:
pipe1.fit(train2.drop(columns=["Survived"]),train2["Survived"])
final_predict = pipe1.predict(test2)
subdata = pd.DataFrame()
subdata["PassengerId"] = passengerId
subdata["Survived"] = final_predict
subdata.to_csv("submission_DT.csv",index=False)

Random Forest Model

In [None]:
pipe2.fit(train2.drop(columns=["Survived"]),train2["Survived"])
final_predict = pipe2.predict(test2)
subdata = pd.DataFrame()
subdata["PassengerId"] = passengerId
subdata["Survived"] = final_predict
subdata.to_csv("submission_RF.csv",index=False)

Logistic Regression Model

In [138]:
pipe3.fit(train2.drop(columns=["Survived"]),train2["Survived"])
final_predict = pipe2.predict(test2)
subdata = pd.DataFrame()
subdata["PassengerId"] = passengerId
subdata["Survived"] = final_predict
subdata.to_csv("submission_LR.csv",index=False)