# Predict Titanic Survival

The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project you will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

The data we will be using for training our model is provided by Kaggle. Feel free to make the model better on your own and submit it to the Kaggle Titanic competition (https://www.kaggle.com/c/titanic)!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Load the Data

In [22]:
# Load the passenger data

passengers = pd.read_csv('passengers.csv')

passengers

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [5]:
passengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Clean the Data

*Features that could be used to predict survival: `PClass`, `Sex`, `Age`, `Fare`, `Cabin`*

2. Given the saying, “women and children first,” `Sex` and `Age` seem like good features to predict survival. Let’s map the text values in the `Sex` column to a numerical value. Update `Sex` such that all values `female` are replaced with `1` and all values `male` are replaced with `0`.

In [23]:
# Update sex column to numerical
# passengers['Sex'] = passengers(['Sex']).map({'female':1,'male':0}) codecademy solution doesn't work
# passengers['Sex'] = np.where(passengers['Sex'] == 'female', 1,0) doesn't work

passengers['Sex'] = passengers['Sex'].apply(lambda x: 1 if x == 'female' else 0)
# passengers

In [31]:
# Fill the nan values in the age column with the mean age
# passengers['Age'].values
passengers['Age'].mean()
passengers['Age'].fillna(value = passengers['Age'].mean(), inplace=True)
passengers['Age']= round(passengers['Age'])

4. Given the strict class system onboard the Titanic, let’s utilize the `Pclass` column, or the passenger class, as another feature. Create a new column named `FirstClass` that stores `1` for all passengers in first class and `0` for all other passengers.

In [36]:
passengers['FirstClass'] = passengers['Pclass'].apply(lambda x: 1 if x == 1 else 0)

In [38]:
passengers['SecondClass'] = passengers['Pclass'].apply(lambda x: 1 if x == 2 else 0)

In [39]:
passengers

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.2500,,S,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.9250,,S,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1000,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.0500,,S,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",0,27.0,0,0,211536,13.0000,,S,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",1,19.0,0,0,112053,30.0000,B42,S,1,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,30.0,1,2,W./C. 6607,23.4500,,S,0,0
889,890,1,1,"Behr, Mr. Karl Howell",0,26.0,0,0,111369,30.0000,C148,C,1,0


## Select and Split the Data

6. Now that we have cleaned our data, let’s select the columns we want to build our model on. Select columns `Sex`, `Age`, `FirstClass`, and `SecondClass` and store them in a variable named `features`. Select column `Survived` and store it a variable named `survival`.

In [42]:
# Select the desired features
features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]
features

Unnamed: 0,Sex,Age,FirstClass,SecondClass
0,0,22.0,0,0
1,1,38.0,1,0
2,1,26.0,0,0
3,1,35.0,1,0
4,0,35.0,0,0
...,...,...,...,...
886,0,27.0,0,1
887,1,19.0,1,0
888,1,30.0,0,0
889,0,26.0,1,0


In [43]:
survival = passengers['Survived']
survival

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

7. Split the data into training and test sets using sklearn‘s `train_test_split()` method. We’ll use the training set to train the model and the test set to evaluate the model.

In [44]:
# Perform train, test, split
X_train, X_test, y_train, y_test = train_test_split(features, survival, test_size = 0.25)

## Normalize the Data

8. Since sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data. Create a `StandardScaler` object, `.fit_transform()` it on the training features, and `.transform()` the test features.


In [45]:
# Scale the feature data so it has mean = 0 and standard deviation = 1
scaler = StandardScaler()
scaler.fit_transform(X_train)
X_train = scaler.transform(X_train)

## Create and Evaluate the Model

9. Create a `LogisticRegression` model with sklearn and `.fit()` it on the training data.

    Fitting the model will perform gradient descent to find the feature coefficients that minimize the log-loss for the training data.


In [47]:
# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

10. `.score()` the model on the training data and print the training score.

    Scoring the model on the training data will run the data through the model and make final classifications on survival for each passenger in the training set. The score returned is the percentage of correct classifications, or the accuracy.


In [48]:
# Score the model on the train data
model.score(X_train, y_train)

0.781437125748503

In [49]:
# Score the model on the test data
model.score(X_test, y_test)

0.5919282511210763

How well did your model perform?  No idea.

12. Print the feature coefficients determined by the model. Which feature is most important in predicting survival on the sinking of the Titanic?

In [50]:
# Analyze the coefficients
print(model.coef_)

[[ 1.20415786 -0.35377803  1.10448047  0.53806481]]


In [51]:
print(list(zip(['Sex','Age','FirstClass','SecondClass'],model.coef_[0])))

[('Sex', 1.2041578555211552), ('Age', -0.35377802964096844), ('FirstClass', 1.1044804652673264), ('SecondClass', 0.5380648099884385)]


*The larger coefficients indicate a larger probability the feature will be strongly associated with survival.*

## Predict with the Model

13. Let’s use our model to make predictions on the survival of a few fateful passengers. Provided in the code editor is information for 3rd class passenger `Jack` and 1st class passenger `Rose`, stored in `NumPy` arrays. The arrays store 4 feature values, in the following order:

    -    `Sex`, represented by a `0` for male and `1` for female
    -    `Age`, represented as an integer in years
    -    `FirstClass`, with a `1` indicating the passenger is in first class
    -    `SecondClass`, with a `1` indicating the passenger is in second class

    A third array, `You`, is also provided in the code editor with empty feature values. Uncomment the line containing `You` and update the array with your information, or the information for some fictitious passenger. Make sure to enter all values as floats with a `.`!

In [52]:
# Sample passenger features
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
You = np.array([1.0,50.0,0.0,1.0])

In [56]:
# Combine passenger arrays
sample_passengers = np.array([Jack, Rose, You])
# sample_passengers

15. Since our Logistic Regression model was trained on scaled feature data, we must also scale the feature data we are making predictions on. Using the `StandardScaler` object created earlier, apply its `.transform()` method to `sample_passengers` and save the result to `sample_passengers`.

    Print `sample_passengers` to view the scaled features.


In [58]:
sample_passengers = scaler.transform(sample_passengers)
sample_passengers

array([[-2.27942401, -2.36657618, -1.89090279, -1.72854789],
       [ 2.10585436, -2.38472591,  3.530029  , -1.72854789],
       [ 2.10585436, -2.18507886, -1.89090279,  4.68907878]])

16. Who will survive, and who will sink? Use your model’s `.predict()` method on `sample_passengers` and print the result to find out.

    Want to see the probabilities that led to these predictions? Call your model’s `.predict_proba()` method on `sample_passengers` and print the result. The 1st column is the probability of a passenger perishing on the Titanic, and the 2nd column is the probability of a passenger surviving the sinking (which was calculated by our model to make the final classification decision).


In [60]:
# Make survival predictions!
model.predict(sample_passengers)

array([0, 1, 1], dtype=int64)

In [61]:
model.predict_proba(sample_passengers)

array([[0.99646664, 0.00353336],
       [0.00356744, 0.99643256],
       [0.04619932, 0.95380068]])

*Jack: perished, Rose: survived, You: survived.*