## Task: Train a logistic regression classifier to predict survival of passengers in titanic dataset

You are provided with code to download and load titanic dataset in the form of a csv

In the dataset, each row represents information about the passengers of titanic, Like their name, gender, class etc(See the dataframe below for more info).

The target column is 'Survived' which tells us whether this particular passenger sirvived or not

Use any of all the other columns as the input features (You can choose to drop the columns you see are not worth keeping).

Your task is to train a logistic regression model which takes the input featues (make sure to not accidentaly feed the 'Survived' column to the model as input) and predicts the whether a passenger with these features would survive or not.

Make sure to put emphasis on code quality and to include a way to judge how good your model is performing on **un-seen data (untrained data)**.

As a bonus, see if you can figure out which feature is most likely to affect the survivability of a passenger.

In [None]:
from IPython.display import clear_output

In [None]:
#Don't change this code

%pip install gdown==4.5

clear_output()

In [None]:
!gdown 18YfCgT3Rk7uYWrUzgjb2UR3Nyo9Z68bK  # Download the csv file.

Downloading...
From: https://drive.google.com/uc?id=18YfCgT3Rk7uYWrUzgjb2UR3Nyo9Z68bK
To: /content/titanic.csv
  0% 0.00/60.3k [00:00<?, ?B/s]100% 60.3k/60.3k [00:00<00:00, 119MB/s]


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
titanic_data = pd.read_csv('titanic.csv')

In [None]:
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
data_y = titanic_data['Survived']
data_x = titanic_data.drop(columns=['Survived'])

In [None]:
data_x.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
data_y

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [None]:
data_x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Name         891 non-null    object 
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB


In [None]:
# Checking for NaN values
for column in data_x.columns:
    print(f'{column}: {data_x[column].isna().sum()}')

PassengerId: 0
Pclass: 0
Name: 0
Sex: 0
Age: 177
SibSp: 0
Parch: 0
Ticket: 0
Fare: 0
Cabin: 687
Embarked: 2


In [None]:
# Too many NaN in Cabin so dropping it
data_x.drop('Cabin', inplace=True, axis=1)
# Imputing Age with the average
data_x['Age'].fillna(data_x['Age'].mean(), inplace=True)
# Imputing Embarked with mode
data_x['Embarked'].fillna(data_x['Embarked'].mode()[0], inplace=True)

In [None]:
# Checking again
for column in data_x.columns:
    print(f'{column}: {data_x[column].isna().sum()}')

data_x.head()

PassengerId: 0
Pclass: 0
Name: 0
Sex: 0
Age: 0
SibSp: 0
Parch: 0
Ticket: 0
Fare: 0
Embarked: 0


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [None]:
data_x['Embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

In [None]:
data_x['Sex'].unique()

array(['male', 'female'], dtype=object)

In [None]:
data_x.drop(['Name', 'Ticket'], inplace=True, axis=1)
data_x = pd.get_dummies(data_x, columns=['Embarked', 'Sex'], drop_first=True)
data_x.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S,Sex_male
0,1,3,22.0,1,0,7.25,0,1,1
1,2,1,38.0,1,0,71.2833,0,0,0
2,3,3,26.0,0,0,7.925,0,1,0
3,4,1,35.0,1,0,53.1,0,1,0
4,5,3,35.0,0,0,8.05,0,1,1


In [None]:
# Checking for NaN in y
data_y.isna().sum()

0

In [None]:
X = data_x.values
Y = data_y.values

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=20)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

# Training the model
model = LogisticRegression(max_iter=10000).fit(x_train, y_train)

# Testing the model
y_pred = model.predict(x_test)

print(f'Log Loss={log_loss(y_test, y_pred)}')

correct_label = (y_pred == y_test).sum()
print(f'Correctly Labeled: {correct_label}/{len(y_test)}. Accuracy = {correct_label/len(y_test) * 100}%')

Log Loss=6.846280532011079
Correctly Labeled: 145/179. Accuracy = 81.00558659217877%


In [None]:
# Finding the most important feature
print(f'Model Coefficients: {model.coef_}')
index = model.coef_.argmax()
print(f'Most imporatnt feature index: {index}')
print(f'Most imporatnt feature: {data_x.columns[index]}')

Model Coefficients: [[ 2.19896365e-04 -1.11589911e+00 -3.67510421e-02 -2.32193720e-01
  -1.00123001e-01  2.17566142e-03  2.21371040e-01 -2.63913584e-01
  -2.47795462e+00]]
Most imporatnt feature index: 6
Most imporatnt feature: Embarked_Q
