# Project: Titanic Survival Prediction
In this project, we will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

Logistic Regression is the go-to method for binary classification problems (e.g., survival vs death, male vs female, email vs spam). Logistic regression models the probability of the default class. In this case, the class of survival. 

Logistic regression is named for the function used at the core of the method, the logistic function, also called the sigmoid function. It was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

sigmoid function = 1 / (1 + e^-value)

The data for this code is provided by Kaggle here: https://www.kaggle.com/c/titanic/data#_=_

## Part 1: Load and clean the data

In [24]:
import pandas as pd
import numpy as np

In [48]:
# load the data
passengers_tr = pd.read_csv('13 ML_LogesticRegression_KaggleChallengeTitanicSurvivalPredictionTrainDataset.csv')
passengers_te = pd.read_csv('13 ML_LogesticRegression_KaggleChallengeTitanicSurvivalPredictionTestDataset.csv')
j = [passengers_tr, passengers_te]
passengers = pd.concat(j, sort=False)
passengers.head(5)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [49]:
# sex will be a feature. We transfrom into numbers 
passengers.Sex.replace(to_replace=dict(female=1, male=0), inplace=True)
passengers.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S
5,6,0.0,3,"Moran, Mr. James",0,,0,0,330877,8.4583,,Q
6,7,0.0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,E46,S
7,8,0.0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S
8,9,1.0,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S
9,10,1.0,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C


In [50]:
# Age will be another feature. We need to replace missing values with average age 

# check NaN values and replace them with 0 (becuase we can't make average on NaN)
passengers.Age.isna().any()
passengers.fillna({'Age':0}, inplace=True)
# replace all NAN values with the average of the column
passengers.Age.loc[passengers.Age == 0] = passengers['Age'].mean()
passengers.head(10)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S
5,6,0.0,3,"Moran, Mr. James",0,23.877517,0,0,330877,8.4583,,Q
6,7,0.0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,E46,S
7,8,0.0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S
8,9,1.0,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S
9,10,1.0,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C


In [51]:
# Ticket class will be another feature. Adding a new column for first class and Second classes. 

# If 1st class, value will be 1, otherwise 0
passengers['FirstClass'] = passengers.apply(lambda row: 1 if row['Pclass'] == 1 else 0, axis=1)
# If 2nd class, value will be 1, otherwise 0
passengers['SecondClass'] = passengers.apply(lambda row: 1 if row['Pclass'] == 2 else 0, axis=1)
passengers.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass
0,1,0.0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,0,0
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1.0,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,1,0
4,5,0.0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,0,0
5,6,0.0,3,"Moran, Mr. James",0,23.877517,0,0,330877,8.4583,,Q,0,0
6,7,0.0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,E46,S,1,0
7,8,0.0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S,0,0
8,9,1.0,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S,0,0
9,10,1.0,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C,0,1


## Part 2: Feature Selection, Data Segregation and Normalization

Logestic Regression models the defualt class. The defaul class for this problem is survival, so the logistic regression models is the probability of survival given a passenger sex, age, and Ticket class, or more formally:

- P(survival=yes|sex, age, & Ticket class)

Written another way, we are modeling the probability that an input (X) belongs to the default class (Y=1), we can write this formally as:

- P(X) = P(Y=1|X)

In [34]:
features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]
survival = passengers[['Survived']]

In [53]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, survival, train_size=0.8, test_size=0.2, random_state=1)


In [57]:
# sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data. 
# Create a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_features = scaler.fit_transform(x_train)
test_features = scaler.transform(x_test)

  


## Part 3: Logestic Regression Model

Logistic regression uses an equation as the representation, very much like linear regression.

Input values (x) are combined linearly using weights or coefficient values to predict an output value (y). But the predictions are transformed using the logistic function (i.e., Logistic regression is a linear algorithm with a non-linear transform on output). A key difference from linear regression is that the output value being modeled is a binary values (0 or 1) rather than a numeric value. So, the model can be written as:

- p(X) = e^(b0 + b1*X) / (1 + e^(b0 + b1*X))

If we apply a natural logarithm (ln):

- ln(p(X) / 1 – p(X)) = b0 + b1 * X

This is useful because we can see that the calculation of the output on the right is linear again (just like linear regression), and the input on the left is a log of the probability of the default class.

p(X) / 1 – p(X) is called the odds of the default class. Odds are calculated as a ratio of the probability of the event divided by the probability of not the event.

- ln(odds) = b0 + b1 * X

We can move the exponent back to the right and write it as:

- odds = e^(b0 + b1 * X)

All of this helps us understand that indeed the model is still a linear combination of the inputs, but that this linear combination relates to the log-odds of the default class.

More on Logestic Regression: https://machinelearningmastery.com/logistic-regression-for-machine-learning/

In [61]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train,y_train)
# Scoring the model on the training data will run the data through the model and make final classifications on survival 
# for each passenger in the training set. The score returned is the percentage of correct classifications, or the accuracy.
model.score(x_train, y_train)

0.7808988764044944

In [62]:
# Scoring the model on the test data
model.score(x_test, y_test)

0.7821229050279329

The coefficients (Beta values b) of the logistic regression algorithm must be estimated from your training data. This is done using maximum-likelihood estimation.

The best coefficients would result in a model that would predict a value very close to 1 for the default class and a value very close to 0 for the other class. The intuition for maximum-likelihood for logistic regression is that a search procedure seeks values for the coefficients (Beta values) that minimize the error in the probabilities predicted by the model to those in the data.


In [74]:
# Evaluate feature importance 
#print('Sex', 'Age', 'FirstClass', 'SecondClass']
coefficients = model.coef_
print("Sex coefficient: " + str(coefficients[0][0])+ ", Age coefficient: " + str(coefficients[0][1])+ ", FirstClass coefficient: " + str(coefficients[0][2])+ ", SecondClass coefficient: " + str(coefficients[0][3]))


Sex coefficient: 2.478758330512527, Age coefficient: -0.03226602199385562, FirstClass coefficient: 2.1994867670138807, SecondClass coefficient: 1.2418481910386083


In [76]:
print("Sex and Ticket Class are the most important factors of survival.") 

Sex and Ticket Class are the most important factors of survival.


## Part 4: Test Sample Data

In [78]:
# Sample passenger features
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
me = np.array([0.0,33.0,0.0,1.0])

# Combine passenger arrays
sample_passengers = np.array([Jack, Rose, me])

# Scale the sample passenger features
sample_passengers = scaler.transform(sample_passengers)
print(sample_passengers)

[[-0.73334642 -0.65757092 -0.56870034 -0.51662744]
 [ 1.36361202 -0.87975987  1.7583953  -0.51662744]
 [-0.73334642  0.30524786 -0.56870034  1.93563082]]


In [79]:
# Make survival predictions!
print(model.predict(sample_passengers))

[0 1 0]


In [80]:
# The 1st column is the probability of a passenger perishing on the Titanic, 
# and the 2nd column is the probability of a passenger surviving the sinking 
print(model.predict_proba(sample_passengers))

[[0.99388262 0.00611738]
 [0.0053094  0.9946906 ]
 [0.88857022 0.11142978]]
