<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-the-Titanic-sampled-dataset" data-toc-modified-id="Import-the-Titanic-sampled-dataset-1">Import the Titanic sampled dataset</a></span><ul class="toc-item"><li><span><a href="#Data-Dictionary" data-toc-modified-id="Data-Dictionary-1.1">Data Dictionary</a></span></li><li><span><a href="#The-functional-form" data-toc-modified-id="The-functional-form-1.2">The functional form</a></span></li></ul></li><li><span><a href="#Cleaning-/-recoding-the-data" data-toc-modified-id="Cleaning-/-recoding-the-data-2">Cleaning / recoding the data</a></span></li><li><span><a href="#Keeping-it-oversimplyfied" data-toc-modified-id="Keeping-it-oversimplyfied-3">Keeping it oversimplyfied</a></span></li></ul></div>

In [2]:
# in case i want to move data around between R and python
%load_ext rpy2.ipython

In [1]:
# Import required modules
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import pickle

# to print everything in the code chunk
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Import the Titanic sampled dataset

In [3]:
tc = pd.read_csv('titanic.csv')

### Data Dictionary


|Variable  |	Definition	        |  Key                      |
| -------  |  --------------------  |  ------------------------ |
|survival  |	Survival	        |0 = No, 1 = Yes            |
|pclass	   | Ticket class	        |1 = 1st, 2 = 2nd, 3 = 3rd  |
|sex	   | Sex	                |                           |
|Age	   | Age in years	        |                           |
|sibsp     |# of siblings / spouses |	                        |
|parch	   |# of parents / children |                           |
|ticket	   |Ticket number           |                           |	
|fare	   |Passenger fare	        |                           |
|cabin     |Cabin number	        |                           |
|embarked  |Port of Embarkation	    |C = Cherbourg, Q = Queenstown, S = Southampton|

**Variable Notes**

**_pclass_**: A proxy for socio-economic status (SES)<br>
* 1st = Upper
* 2nd = Middle
* 3rd = Lower

**_age_**: Age is fractional if less than 1.

**_sibsp_**: The dataset defines family relations in this way<br>
* Sibling = brother, sister, stepbrother, stepsister
* Spouse = husband, wife (mistresses and fiancés were ignored)

**_parch_**: The dataset defines family relations in this way<br>
* Parent = mother, father
* Child = daughter, son, stepdaughter, stepson
* Some children travelled only with a nanny, therefore parch=0 for them.

### The functional form

* The logistic function can be written as:

$  F(x) = \cfrac{1}{1 + e^{-(β_0 + β_1 x)}} $

In [4]:
# quick inspection
tc.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
91,92,0,3,"Andreasson, Mr. Paul Edvin",male,20.0,0,0,347466,7.8542,,S
418,419,0,2,"Matthews, Mr. William John",male,30.0,0,0,28228,13.0,,S
268,269,1,1,"Graham, Mrs. William Thompson (Edith Junkins)",female,58.0,0,1,PC 17582,153.4625,C125,S
668,669,0,3,"Cook, Mr. Jacob",male,43.0,0,0,A/5 3536,8.05,,S
44,45,1,3,"Devaney, Miss. Margaret Delia",female,19.0,0,0,330958,7.8792,,Q


In [5]:
# how many survived / died
tc['Survived'].value_counts().transform(lambda x: x/sum(x))

0    0.616162
1    0.383838
Name: Survived, dtype: float64

In [6]:
# column types
tc.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

## Cleaning / recoding the data

In [7]:
# lets not use the passenger names
del tc['Name']

In [8]:
# what letter do tickets start with?
tc['Ticket'].apply(lambda x: x[0]).unique()

array(['A', 'P', 'S', '1', '3', '2', 'C', '7', 'W', '4', 'F', 'L', '9',
       '6', '5', '8'], dtype=object)

In [9]:
# get first letter of passenger ticket, it might have predictive information
tc['Ticket'] = tc['Ticket'].apply(lambda x: 'Ticket_starts_with_' + x[0])

In [10]:
# Same for cabin
tc['Cabin'] = tc['Cabin'].apply(lambda x: 'Cabin_strats_with_' + x[0] if type(x) == str else np.nan)

In [11]:
# recode gender
tc['Sex'] = tc['Sex'].map({'female': 1, 'male': 0})

In [12]:
# recode embarked
tc['Embarked'] = 'Embarked_from_' + tc['Embarked']

In [13]:
# inspect
tc.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,0,22.0,1,0,Ticket_starts_with_A,7.25,,Embarked_from_S
1,2,1,1,1,38.0,1,0,Ticket_starts_with_P,71.2833,Cabin_strats_with_C,Embarked_from_C
2,3,1,3,1,26.0,0,0,Ticket_starts_with_S,7.925,,Embarked_from_S
3,4,1,1,1,35.0,1,0,Ticket_starts_with_1,53.1,Cabin_strats_with_C,Embarked_from_S
4,5,0,3,0,35.0,0,0,Ticket_starts_with_3,8.05,,Embarked_from_S


In [14]:
# Convert categorical columns to binaries
tc = pd.concat([tc, pd.get_dummies(tc.Ticket), pd.get_dummies(tc.Cabin), pd.get_dummies(tc.Embarked)], axis=1)

# and drop the categorical columns
del tc['Ticket']
del tc['Cabin']
del tc['Embarked']

# inspect
tc.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Ticket_starts_with_1,Ticket_starts_with_2,...,Cabin_strats_with_B,Cabin_strats_with_C,Cabin_strats_with_D,Cabin_strats_with_E,Cabin_strats_with_F,Cabin_strats_with_G,Cabin_strats_with_T,Embarked_from_C,Embarked_from_Q,Embarked_from_S
0,1,0,3,0,22.0,1,0,7.25,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,1,1,38.0,1,0,71.2833,0,0,...,0,1,0,0,0,0,0,1,0,0
2,3,1,3,1,26.0,0,0,7.925,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,1,1,35.0,1,0,53.1,1,0,...,0,1,0,0,0,0,0,0,0,1
4,5,0,3,0,35.0,0,0,8.05,0,0,...,0,0,0,0,0,0,0,0,0,1


## Keeping it oversimplyfied

We did everything above for no reason

In [15]:
# Keep only the outcome, the ticket class, gender and age
tc = tc[['Survived', 'Pclass', 'Sex', 'Age']]

In [16]:
# inspect
tc.head()

Unnamed: 0,Survived,Pclass,Sex,Age
0,0,3,0,22.0
1,1,1,1,38.0
2,1,3,1,26.0
3,1,1,1,35.0
4,0,3,0,35.0


In [17]:
# do we have missing info?
tc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 4 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null int64
Age         714 non-null float64
dtypes: float64(1), int64(3)
memory usage: 27.9 KB


In [18]:
# mean imputation
tc['Age'].fillna((tc['Age'].mean()), inplace=True)

In [19]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression()
X = tc[['Pclass', 'Sex', 'Age']]
y = tc['Survived']
# fit the model with data
logreg.fit(X, y)
logreg.score(X, y, sample_weight=None)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

0.79797979797979801

In [20]:
# lets see some outcomes based on imput features
v = {'class': [3,     1,    3,    1],
     'sex':   [0,     0,    1,    1],
     'age':   [22.0, 22.0, 60.0, 60.0]}

result = logreg.predict([np.array(list(v.values()))[:, 0],
                         np.array(list(v.values()))[:, 1],
                         np.array(list(v.values()))[:, 2],
                         np.array(list(v.values()))[:, 3]])


['Died' if outcome == 0 else 'Survived' for outcome in result]

['Died', 'Survived', 'Died', 'Survived']

In [21]:
# confusion matrix
CM = pd.crosstab(pd.Series(y), pd.Series(list(logreg.predict(np.array(X)))), margins=True)
CM.index = ['Actual_Died', 'Actual Survived', 'All']
CM.columns = ['Predicted_Died', 'Predicted Survived', 'All']
CM
CM / CM.loc['All', 'All']

Unnamed: 0,Predicted_Died,Predicted Survived,All
Actual_Died,472,77,549
Actual Survived,103,239,342
All,575,316,891


Unnamed: 0,Predicted_Died,Predicted Survived,All
Actual_Died,0.529742,0.08642,0.616162
Actual Survived,0.1156,0.268238,0.383838
All,0.645342,0.354658,1.0


In [22]:
# class 3, male and 25 years old
v = {'class': 3, 'sex': 0, 'age': 25}
int(logreg.predict([[3, 0, 25]]))
print('She' if v['sex']==1 else 'He', 'Died' if logreg.predict([list(v.values())]) == 0 else 'Survived')

0

He Died


In [23]:
# save the model to disk
pickle.dump(logreg, open('logreg.logisticRegression', 'wb'))

In [24]:
# del logreg
# logreg = pickle.load(open('logreg.logisticRegression', 'rb'))