# Project: Titanic Survival Prediction
In this project, we will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

The data is provided by Kaggle here: https://www.kaggle.com/c/titanic/data#_=_

## Part 1: Load and clean the data

In [24]:
import pandas as pd
import numpy as np

In [48]:
# load the data
passengers_tr = pd.read_csv('13 ML_LogesticRegression_KaggleChallengeTitanicSurvivalPredictionTrainDataset.csv')
passengers_te = pd.read_csv('13 ML_LogesticRegression_KaggleChallengeTitanicSurvivalPredictionTestDataset.csv')
j = [passengers_tr, passengers_te]
passengers = pd.concat(j, sort=False)
passengers.head(5)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [49]:
# sex will be a feature. We transfrom into numbers 
passengers.Sex.replace(to_replace=dict(female=1, male=0), inplace=True)
passengers.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S
5,6,0.0,3,"Moran, Mr. James",0,,0,0,330877,8.4583,,Q
6,7,0.0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,E46,S
7,8,0.0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S
8,9,1.0,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S
9,10,1.0,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C


In [50]:
# Age will be another feature. We need to replace missing values with average age 

# check NaN values and replace them with 0 (becuase we can't make average on NaN)
passengers.Age.isna().any()
passengers.fillna({'Age':0}, inplace=True)
# replace all NAN values with the average of the column
passengers.Age.loc[passengers.Age == 0] = passengers['Age'].mean()
passengers.head(10)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S
5,6,0.0,3,"Moran, Mr. James",0,23.877517,0,0,330877,8.4583,,Q
6,7,0.0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,E46,S
7,8,0.0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S
8,9,1.0,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S
9,10,1.0,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C


In [51]:
# Ticket class will be another feature. Adding a new column for first class and Second classes. 

# If 1st class, value will be 1, otherwise 0
passengers['FirstClass'] = passengers.apply(lambda row: 1 if row['Pclass'] == 1 else 0, axis=1)
# If 2nd class, value will be 1, otherwise 0
passengers['SecondClass'] = passengers.apply(lambda row: 1 if row['Pclass'] == 2 else 0, axis=1)
passengers.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass
0,1,0.0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,0,0
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1.0,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,1,0
4,5,0.0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,0,0
5,6,0.0,3,"Moran, Mr. James",0,23.877517,0,0,330877,8.4583,,Q,0,0
6,7,0.0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,E46,S,1,0
7,8,0.0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S,0,0
8,9,1.0,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S,0,0
9,10,1.0,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C,0,1


## Part 2: Feature Selection, Data Segregation and Normalization

In [34]:
features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]
survival = passengers[['Survived']]

In [53]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, survival, train_size=0.8, test_size=0.2, random_state=1)


In [57]:
# sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data. 
# Create a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_features = scaler.fit_transform(x_train)
test_features = scaler.transform(x_test)

  


## Part 4: Logestic Regression Model

In [61]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train,y_train)
# Scoring the model on the training data will run the data through the model and make final classifications on survival 
# for each passenger in the training set. The score returned is the percentage of correct classifications, or the accuracy.
model.score(x_train, y_train)

0.7808988764044944

In [62]:
# Scoring the model on the test data
model.score(x_test, y_test)

0.7821229050279329

In [74]:
# Evaluate feature importance 
#print('Sex', 'Age', 'FirstClass', 'SecondClass']
coefficients = model.coef_
print("Sex coefficient: " + str(coefficients[0][0])+ ", Age coefficient: " + str(coefficients[0][1])+ ", FirstClass coefficient: " + str(coefficients[0][2])+ ", SecondClass coefficient: " + str(coefficients[0][3]))


Sex coefficient: 2.478758330512527, Age coefficient: -0.03226602199385562, FirstClass coefficient: 2.1994867670138807, SecondClass coefficient: 1.2418481910386083


In [76]:
print("Sex and Ticket Class are the most important factors of survival.") 

Sex and Ticket Class are the most important factors of survival.
