<a href="https://colab.research.google.com/github/ghanshyam17/gsportfolio/blob/master/Titanic_survival_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np

## **Overview**
* `PassengerId` is the unique id of the row and it doesn't have any effect on target
* `Survived` is the target variable we are trying to predict (**0** or **1**):
    - **1 = Survived**
    - **0 = Not Survived**
* `Pclass` (Passenger Class) is the socio-economic status of the passenger and it is a categorical ordinal feature which has **3** unique values (**1, 2 or 3**):
    - **1 = Upper Class**
    - **2 = Middle Class**
    - **3 = Lower Class**
* `Name`, `Sex` and `Age` are self-explanatory
* `SibSp` is the total number of the passengers' siblings and spouse
* `Parch` is the total number of the passengers' parents and children
* `Ticket` is the ticket number of the passenger
* `Fare` is the passenger fare
* `Cabin` is the cabin number of the passenger
* `Embarked` is port of embarkation and it is a categorical feature which has **3** unique values (**C**, **Q** or **S**):
    - **C = Cherbourg**
    - **Q = Queenstown**
    - **S = Southampton**

In [None]:
train = pd.read_csv()
test = pd.read_csv()

Basically, the columns `SibSp` and `Parch` tells us whether the corresponding person was accompanied by anyone or not. So we will create a new column `Is_alone` which will tell us whether the person was accompanied (**1**) or not (**0**).

In [None]:
def is_alone(x):
    if  (x['SibSp'] + x['Parch'])  > 0:
        return 1
    else:
        return 0

train['Is_alone'] = train.apply(is_alone, axis = 1)
test['Is_alone'] = test.apply(is_alone, axis = 1)

## Inference

1. Column `PassengerId` won't help us.
2. I've seen people use column `Name` cleverly like [here](https://www.kaggle.com/gunesevitan/titanic-advanced-feature-engineering-tutorial) but I won't be using in this notebook because:
    - Not important from prespective of our main objective.
    - Requires extra efforts.
    - Might not bring a huge change.    
3. Now that we have created a new feature `Is_alone` using features `SibSp` and `Parch`, we can delete them from our dataset.

In [None]:
train = train.drop(['PassengerId','Name','SibSp','Parch'], axis = 1)
test = test.drop(['Name','SibSp','Parch'], axis = 1)

## Explore

In [None]:
print("Train columns:", ', '.join(map(str, train.columns))) 
display(train.head())
print("\nTest columns:",  ', '.join(map(str, test.columns)))
display(test.head())

Train columns: Survived, Pclass, Sex, Age, Ticket, Fare, Cabin, Embarked, Is_alone


Unnamed: 0,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Is_alone
0,0,3,male,22.0,A/5 21171,7.25,,S,1
1,1,1,female,38.0,PC 17599,71.2833,C85,C,1
2,1,3,female,26.0,STON/O2. 3101282,7.925,,S,0
3,1,1,female,35.0,113803,53.1,C123,S,1
4,0,3,male,35.0,373450,8.05,,S,0



Test columns: PassengerId, Pclass, Sex, Age, Ticket, Fare, Cabin, Embarked, Is_alone


Unnamed: 0,PassengerId,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Is_alone
0,892,3,male,34.5,330911,7.8292,,Q,0
1,893,3,female,47.0,363272,7.0,,S,1
2,894,2,male,62.0,240276,9.6875,,Q,0
3,895,3,male,27.0,315154,8.6625,,S,0
4,896,3,female,22.0,3101298,12.2875,,S,1


## Checking for missing values

In [None]:
print("TRAIN DATA:")
train.isnull().sum()

TRAIN DATA:


Survived      0
Pclass        0
Sex           0
Age         177
Ticket        0
Fare          0
Cabin       687
Embarked      2
Is_alone      0
dtype: int64

In [None]:
print("TEST DATA:")
test.isnull().sum()

TEST DATA:


PassengerId      0
Pclass           0
Sex              0
Age             86
Ticket           0
Fare             1
Cabin          327
Embarked         0
Is_alone         0
dtype: int64

#### Observations:
- **177** values missing from `Age` from training data.
- **687** values missing from `Cabin` from training data.
- **2** values missing from `Embarked` from training data.

- **86** values missing from `Age` from testing data.
- **1** value missing from `Fare` from testing data.
- **327** values missing from `Cabin` from testing data.

## Dealing with missing values

- We have two types of missing values:
    - Integer/Float (int64/float64)
    - Text (object)

- We will use [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) for numerical values and [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for categorical values.
- We will follow a pipeline which goes like this: 
            
            Impute Numerical values > Impute Categorical Values > Trasfrom Columns > Define model
- Let's find out which are numerical and categorical columns in our dataset.

In [None]:
train.dtypes

Survived      int64
Pclass        int64
Sex          object
Age         float64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
Is_alone      int64
dtype: object

#### Observation
`Pclass, Age, Is_alone, Fare` are numerical columns.

`Sex, Ticket, Cabin, Embarked` are categorical columns.

In [None]:
numerical = ['Pclass','Age','Is_alone','Fare']
categorical = ['Sex','Ticket','Cabin','Embarked']

In [None]:
features = numerical + categorical
target = ['Survived']
print('Features:', features, '\nTarget:', target)

Features: ['Pclass', 'Age', 'Is_alone', 'Fare', 'Sex', 'Ticket', 'Cabin', 'Embarked'] 
Target: ['Survived']


In [None]:
from sklearn.model_selection import train_test_split

train_set, valid_set = train_test_split(train, test_size = 0.3, random_state = 0)

## Transforming the data
We will use combination of [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) with [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to carry out the necessary transformation on our data.

Transformers we are going to use:

|Data type|Transformer|
|:---|:---|
|Numerical|[SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) & [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)|
|Categorical|[OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)|                                                                                                                                                                                                                                                                    


We will use **mean** strategy to fill the missing values in numerical columns and **most_frequent** strategy for categorical columns.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
numerical_transformer = Pipeline(steps=[
                        ('simple', SimpleImputer(strategy='mean')),
                        ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
               transformers=[
                    ('num', numerical_transformer, numerical),
                    ('cat', categorical_transformer, categorical)])

## Defining Models

Here, we are going to try two approaches:

1. Ensembling.
2. Random Forest Classifier (used for submission).

### 1.1 Ensembling
[Ensemble](https://scikit-learn.org/stable/modules/ensemble.html) methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produces more accurate solutions than a single model would. The models used to create such ensemble models are called ‘base models’.

We will use [Linear SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC), [Radial SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC), [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), and use their results to predict.

We will do ensembling with the [Voting Ensemble](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html). Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms. It works by first creating two or more standalone models from your training dataset. A Voting Classifier can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data.

We will be using weighted Voting Classifier. We will assign to the classifiers according to their accuracies. So the classifier with single accuracy will be assigned the highest weight and so on.

But before directly moving to using Voting Classifier, let's take a look at how the above mentioned classification algorithms work individually.

In [None]:
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics

accuracy = []
classifiers = ['Linear Svm', 'Radial Svm', 'Logistic Regression', 'Random Forest Classifier']

models = [svm.SVC(kernel='linear'),
          svm.SVC(kernel='rbf'),
          LogisticRegression(),
          RandomForestClassifier(n_estimators=200, random_state=0)]

for i in models:
    model = i
    pipe = Pipeline(steps=[
                    ('preprocessor', preprocessor),
                    ('model', model)])
    
    pipe.fit(train_set[features], np.ravel(train_set[target]))
    prediction = pipe.predict(valid_set[features])
    accuracy.append(pipe.score(valid_set[features], valid_set[target]))

observations = pd.DataFrame(accuracy, index=classifiers, columns=['Score'])
observations.sort_values(by = 'Score', ascending = False)

Unnamed: 0,Score
Random Forest Classifier,0.839552
Linear Svm,0.828358
Radial Svm,0.820896
Logistic Regression,0.813433


### 1.2 Voting Ensemble
We will select the top 3 models based on their scores i.e. Linear Svm, Radial Svm and Random Forest Classifier.

In [None]:
from sklearn.ensemble import VotingClassifier

linear_svm = svm.SVC(kernel='linear', C=0.1,gamma=10, probability=True)
pipe_linear = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', linear_svm)])

radial_svm = svm.SVC(kernel='rbf', C=0.1,gamma=10, probability=True)
pipe_radial = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', radial_svm)])

rand = RandomForestClassifier(n_estimators=200, random_state=0)
pipe_rand = Pipeline(steps=[('preprocessor', preprocessor),
                            ('model', rand)])


ensemble_all = VotingClassifier(estimators=[('Linear_svm', pipe_linear),
                                            ('Radial_svm', pipe_radial), 
                                            ('Random Forest Classifier', pipe_rand)],
                                voting='soft')

ensemble_all.fit(train_set[features], np.ravel(train_set[target]))
pred_valid = ensemble_all.predict(valid_set[features])

#### Evaluation of model with 3 classifiers.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

acc_train = round(ensemble_all.score(train_set[features], train_set[target]) * 100, 2)
acc_valid = round(ensemble_all.score(valid_set[features], valid_set[target]) * 100, 2)

print("Train set Accuracy: ", acc_train, "%\nValidation set Accuracy: ", acc_valid, "%")

print("\nConfusion Matrix:\n", confusion_matrix(valid_set[target], pred_valid))
print("\nClassification Report:\n", classification_report(valid_set[target], pred_valid))

Train set Accuracy:  99.68 %
Validation set Accuracy:  83.21 %

Confusion Matrix:
 [[149  19]
 [ 26  74]]

Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.89      0.87       168
           1       0.80      0.74      0.77       100

    accuracy                           0.83       268
   macro avg       0.82      0.81      0.82       268
weighted avg       0.83      0.83      0.83       268



### 2. Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200, random_state=0)

pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)])

pipe.fit(train_set[features], np.ravel(train_set[target]))

pred_valid = pipe.predict(valid_set[features])

#### Evaluation of model with the best classifier.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

acc_ran_train = round(pipe.score(train_set[features], train_set[target]) * 100, 2)
acc_ran_valid = round(pipe.score(valid_set[features], valid_set[target]) * 100, 2)

print("Train set Accuracy: ", acc_ran_train, "%\nValidation set Accuracy: ", acc_ran_valid, "%")

print("\nConfusion Matrix:\n", confusion_matrix(valid_set[target], pred_valid))
print("\nClassification Report:\n", classification_report(valid_set[target], pred_valid))

Train set Accuracy:  99.84 %
Validation set Accuracy:  83.96 %

Confusion Matrix:
 [[153  15]
 [ 28  72]]

Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.91      0.88       168
           1       0.83      0.72      0.77       100

    accuracy                           0.84       268
   macro avg       0.84      0.82      0.82       268
weighted avg       0.84      0.84      0.84       268



In [None]:
pred_test = pipe.predict(test[features])

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': pred_test})
output.to_csv('submission.csv', index=False)