# Predict Titanic Survival
The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project we will create a __Logistic Regression model__ that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

The data we will be using for training our model is provided by __Kaggle.__ 

In [1]:
# important imports
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

### Load the Data
The file passengers.csv contains the data of 892 passengers onboard the Titanic when it sank that fateful day. Let's begin by loading the data into a pandas DataFrame named passengers. 

Print passengers and inspect the columns. 

What features could we use to predict survival?

In [2]:
df = pd.read_csv('passengers.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

### Clean the Data
Given the saying, "women and children first," Sex and Age seem like good features to predict survival. 

Let's map the text values in the Sex column to a numerical value. 

Update Sex such that all values female are replaced with 1 and all values male are replaced with 0.

In [4]:
df['Sex'] = df['Sex'].map({'male':0, 'female':1})

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S


Let's take a look at Age. Print passengers['Age'].values. We can see we have multiple missing values, or nans.

Fill all the empty Age values in passengers with the mean age.

In [6]:
df['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5       NaN
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17      NaN
18     31.0
19      NaN
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26      NaN
27     19.0
28      NaN
29      NaN
       ... 
861    21.0
862    48.0
863     NaN
864    24.0
865    42.0
866    27.0
867    31.0
868     NaN
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878     NaN
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [7]:
df.Age.fillna(inplace=True, value=round(df.Age.mean()))

In [8]:
df.Age


0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5      30.0
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17     30.0
18     31.0
19     30.0
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26     30.0
27     19.0
28     30.0
29     30.0
       ... 
861    21.0
862    48.0
863    30.0
864    24.0
865    42.0
866    27.0
867    31.0
868    30.0
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878    30.0
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888    30.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64


Given the strict class system onboard the Titanic, let's utilize the Pclass column, or the passenger class, as another feature. 

Create a new column named __FirstClass__ that stores 1 for all passengers in first class and 0 for all other passengers.

In [9]:
df['FirstClass'] = df['Pclass'].apply(lambda x:1 if x==1 else 0)

In [10]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,0


Create a new column named __SecondClass__ that stores 1 for all passengers in second class and 0 for all other passengers.

Print passengers and inspect the DataFrame to ensure all the updates have been made.

In [11]:
df['SecondClass'] = df['Pclass'].apply(lambda x:1 if x==2 else 0)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,0,0



### Select and Split the Data

Now that we have cleaned our data, let's select the columns we want to build our model on. Select columns __Sex, Age, FirstClass, and SecondClass__ and store them in a variable named __features.__

Select column __Survived__ and store it a variable named __survival.__

In [12]:
features = df[['Sex', 'Age', 'FirstClass', 'SecondClass']]
survival = df.Survived

Split the data into training and test sets using sklearn's train_test_split() method. We'll use the training set to train the model and the test set to evaluate the model.

In [13]:
train_features, test_features, train_labels, test_labels = train_test_split(features, survival, 
                                                                            test_size=0.2, random_state=100)

### Normalize the Data
Since sklearn's Logistic Regression implementation uses Regularization, we need to scale our feature data. 

Create a __StandardScaler__ object, __.fit_transform()__ it on the training features, and __.transform()__ the test features.

In [14]:
scaler = StandardScaler()
train_features = scaler.fit_transform(train_features)
test_features = scaler.transform(test_features)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  This is separate from the ipykernel package so we can avoid doing imports until


### Create and Evaluate the Model
Create a __LogisticRegression__ model with sklearn and __.fit()__ it on the training data.

Fitting the model will perform gradient descent to find the feature coefficients that minimize the log-loss for the training data.

In [15]:
model = LogisticRegression()
model.fit(train_features, train_labels)




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

__.score()__ the model on the training data and print the training score.

Scoring the model on the training data will run the data through the model and make final classifications on survival for each passenger in the training set. 

The score returned is the percentage of correct classifications, or the accuracy.

In [16]:
model.score(train_features, train_labels)

0.7963483146067416

.score() the model on the test data and print the test score.

Similarly, scoring the model on the testing data will run the data through the model and make final classifications on survival for each passenger in the test set.

How well did your model perform?

In [17]:
model.score(test_features, test_labels)

0.7932960893854749

Print the feature coefficients determined by the model. Which feature is most important in predicting survival on the sinking of the Titanic?

In [18]:
model.coef_

array([[ 1.23436123, -0.43387324,  1.00295156,  0.5353255 ]])

In [19]:
list(zip(['Sex','Age','FirstClass','SecondClass'],model.coef_[0]))

[('Sex', 1.2343612340850254),
 ('Age', -0.4338732418449023),
 ('FirstClass', 1.0029515645670446),
 ('SecondClass', 0.5353254952282516)]

### Predict with the Model
Let's use our model to make predictions on the survival of a few fateful passengers. Provided in the code editor is information for 3rd class passenger Jack and 1st class passenger Rose, stored in NumPy arrays. The arrays store 4 feature values, in the following order:

_Sex, represented by a 0 for male and 1 for female_

_Age, represented as an integer in years_

_FirstClass, with a 1 indicating the passenger is in first class_

_SecondClass, with a 1 indicating the passenger is in second class_

A third array, You, is also provided in the code editor with empty feature values. Update the array You with your information, or for some fictitious passenger. Make sure to enter all values as floats with a .!

In [20]:
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
me = np.array([0.0,45.0,0.0,1.0])

Combine Jack, Rose, and You into a single NumPy array named sample_passengers.

In [21]:
sample_passengers = np.array([Jack, Rose, me])

Since our Logistic Regression model was trained on scaled feature data, we must also scale the feature data we are making predictions on. 

Using the StandardScaler object created earlier, apply its .transform() method to sample_passengers and save the result to sample_passengers.

Print sample_passengers to view the scaled features.

In [22]:
sample_passengers = scaler.transform(sample_passengers)

In [23]:
sample_passengers

array([[-0.7243102 , -0.7769537 , -0.58383755, -0.5078883 ],
       [ 1.38062393, -1.00702656,  1.7128052 , -0.5078883 ],
       [-0.7243102 ,  1.14032018, -0.58383755,  1.96893685]])

Who will survive, and who will sink? Use your model's .predict() method on sample_passengers and print the result to find out.

Want to see the probabilities that led to these predictions? Call your model's .predict_proba() method on sample_passengers and print the result. 

The 1st column is the probability of a passenger perishing on the Titanic, and the 2nd column is the probability of a passenger surviving the sinking (which was calculated by our model to make the final classification decision).

In [24]:
model.predict(sample_passengers)

array([0, 1, 0])

### Conclusion:
From the model's predictions, it appears Rose has a better fate than Jack. That's Sex co-efficient influenced more.