## Predict Titanic Survival
The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project you will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

In [1]:
# Data manipulation tool
import pandas as pd
# Scientific computing 
import numpy as np
# Visualization
import matplotlib.pyplot as plt
# ------------------------------------ Machine Learning 
# Logistic Regression model
from sklearn.linear_model import LogisticRegression
# Split data
from sklearn.model_selection import train_test_split
# Coeff
from sklearn.preprocessing import StandardScaler

## Preprocessing the Data

## load the data

In [4]:
passengers = pd.read_csv('Titanic-Dataset.csv')
passengers

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Clean the Data
Given the saying, “women and children first,” Sex and Age seem like good features to predict survival. All values female are replaced with 1 and all values male are replaced with 0.

In [5]:
# Update sex column to numerical
passengers.Sex = passengers.Sex.map({'male':0,'female':1})
passengers

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",0,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",1,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",0,26.0,0,0,111369,30.0000,C148,C


In [6]:
passengers['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

Replacing multiple missing values, or nans, with the mean age.

In [7]:
passengers.Age.fillna(value=passengers.Age.mean(),inplace=True)
passengers['Age']

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

## Utilize The Feature Pclass
Given the strict class system onboard the Titanic, dividing the Pclass feature into the sub-features FirstClass and SecondClass will help modeling the prediction of who is more likely survive the sinking of the Titanic.

Create a new column named FirstClass that stores 1 for all passengers in first class and 0 for all other passengers.

In [8]:
passengers['FirstClass']  = passengers['Pclass'].apply(lambda x: 1 if x == 1 else 0) 
passengers[['Pclass', 'Name', 'FirstClass']]

Unnamed: 0,Pclass,Name,FirstClass
0,3,"Braund, Mr. Owen Harris",0
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
2,3,"Heikkinen, Miss. Laina",0
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
4,3,"Allen, Mr. William Henry",0
...,...,...,...
886,2,"Montvila, Rev. Juozas",0
887,1,"Graham, Miss. Margaret Edith",1
888,3,"Johnston, Miss. Catherine Helen ""Carrie""",0
889,1,"Behr, Mr. Karl Howell",1


Create a new column named SecondClass that stores 1 for all passengers in second class and 0 for all other passengers.

In [10]:
passengers['SecondClass']  = passengers['Pclass'].apply(lambda x: 1 if x == 2 else 0) 
passengers[['Pclass', 'Name', 'FirstClass', 'SecondClass']]

Unnamed: 0,Pclass,Name,FirstClass,SecondClass
0,3,"Braund, Mr. Owen Harris",0,0
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,0
2,3,"Heikkinen, Miss. Laina",0,0
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,0
4,3,"Allen, Mr. William Henry",0,0
...,...,...,...,...
886,2,"Montvila, Rev. Juozas",0,1
887,1,"Graham, Miss. Margaret Edith",1,0
888,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,0
889,1,"Behr, Mr. Karl Howell",1,0


## Logistic regression
Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.

## Select and Split the Data
Now that the data is clean, to build my model, I selected the columns Sex, Age, FirstClass, and SecondClass (independent variables) and store them in a variable named features. I selected the column Survived (binary dependent variable) and store it a variable named survival.

In [11]:
features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]
survival = passengers.Survived

In [12]:
# Perform train, test, split
features_train, features_test, labels_train,  labels_test = train_test_split(features,survival,test_size=0.25, random_state=42)

## Normalize the Data
Since I am using Logistic Regression models, I need to implement the Regularization method on the features, I need to scale the feature data, for that purpose, I created a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features.

In [13]:
# Scale the feature data so it has mean = 0 and standard deviation = 1
scaler = StandardScaler()
norm_train_features = scaler.fit_transform(features_train)
norm_test_features = scaler.fit_transform(features_test)

## Create and Evaluate the Model

In [14]:
# Create and train the model
model = LogisticRegression()
model.fit(norm_train_features , labels_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Evaluate the Model

In [15]:
# Score the model on the train dataset
model.score(norm_train_features , labels_train)

0.7949101796407185

In [16]:
# Score the model on the test dataset
model.score(norm_test_features , labels_test)

0.7982062780269058

In [17]:
# Score the model on the test data
model.coef_

array([[ 1.21279289, -0.36626498,  0.87326543,  0.5229584 ]])

In [18]:
#To print each feature with its respectice coefficient value, you can use the following expression:

print(list(zip(['Sex','Age','FirstClass','SecondClass'],model.coef_[0])))

[('Sex', 1.2127928945459112), ('Age', -0.36626498420166315), ('FirstClass', 0.8732654291428308), ('SecondClass', 0.5229584041786292)]


## Predict with the Model
Let’s use the model to make predictions on the survival of a few fateful passengers. Provided in the code editor is information for 3rd class passenger Jack and 1st class passenger Rose, stored in NumPy arrays. The arrays store 4 feature values, in the following order:

Sex, represented by a 0 for male and 1 for female
Age, represented as an integer in years
FirstClass, with a 1 indicating the passenger is in first class
SecondClass, with a 1 indicating the passenger is in second class
The third array, John_Doe, is a random passenger array

In [19]:
# Sample passenger features
# Male, 20 years old, No-first class, No-Second class
Jack = np.array([0.0,20.0,0.0,0.0])
# Female, 17 years old, Yes-first class, No-Second class
Rose = np.array([1.0,17.0,1.0,0.0])
# Female, 49 years old, no-first class, No-Second class
John_Doe = np.array([1.0,49.0,0.0,0.0])

## Predict

In [20]:
# Combine passenger arrays
sample_passengers = np.array([Jack , Rose, John_Doe])
# Scale the sample passenger features
sample_passengers = scaler.transform(sample_passengers)
# Make survival predictions!
print(model.predict(sample_passengers))

[0 1 0]


Jack and John_Doe are predicted to Not survive

## Probability of Survival

In [21]:
# Probability 
prob = model.predict_proba(sample_passengers)
print(prob)

[[0.89430113 0.10569887]
 [0.08494566 0.91505434]
 [0.61258305 0.38741695]]


In [22]:
prob_df = pd.DataFrame({'Passenger':['Rose', 'Jack', 'John_Doe'], '% Likely To Survive':[ val[0]*100 for val in prob], '% Likely To Not-Survive':[ val[1]*100 for val in prob]})
prob_df

Unnamed: 0,Passenger,% Likely To Survive,% Likely To Not-Survive
0,Rose,89.430113,10.569887
1,Jack,8.494566,91.505434
2,John_Doe,61.258305,38.741695


Jack, a 20 years old male in 3rd class, is 91.50% likely to Not-survive the sinking of the Titanic.
John_Doe, a 49 years old female in 3rd class, is 38.74% likely to Not-survive the sinking of the Titanic.
Rose, a 17 years old female in 1st class, is 10.57% likely to Not-survive the sinking of the Titanic.