## Imports

In [9]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

In [2]:
titanic_data = pd.read_csv("titanic.csv")

In [3]:
# building out our dummy columns to assist in logistic regression
titanic_data = pd.get_dummies(titanic_data, columns=['Embarked', 'Sex', 'Pclass'], drop_first=True)
titanic_data = titanic_data.fillna({"Age":titanic_data.Age.mean(), "Fare":titanic_data.Fare.mean()})
titanic_data = sm.add_constant(titanic_data)

# get our target
y = titanic_data.Survived

titanic_data.sample(2)

Unnamed: 0,const,Survived,Name,Age,Sibsp,Parch,Ticket,Fare,Cabin,Boat,Body,home.dest,Embarked_Q,Embarked_S,Sex_male,Pclass_2,Pclass_3
1260,1.0,1,"Turja, Miss. Anna Sofia",18.0,0,0,4138,9.8417,,15.0,,,0,1,0,0,1
447,1.0,0,"Hocking, Mr. Richard George",23.0,2,1,29104,11.5,,,,"Cornwall / Akron, OH",0,1,1,1,0


## a)

In [4]:
model = sm.Logit(y, titanic_data[['const', 'Embarked_Q', 'Embarked_S']])
results = model.fit()
print(results.summary())

Optimization terminated successfully.
         Current function value: 0.647951
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                 1309
Model:                          Logit   Df Residuals:                     1306
Method:                           MLE   Df Model:                            2
Date:                Sat, 13 Jan 2018   Pseudo R-squ.:                 0.02567
Time:                        10:52:08   Log-Likelihood:                -848.17
converged:                       True   LL-Null:                       -870.51
                                        LLR p-value:                 1.977e-10
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2364      0.122      1.936      0.053      -0.003       0.476
Embarked_Q    -0.8216      0.

Notice that both Q and S are statistically significant

## b)

In [5]:
model = sm.Logit(y, titanic_data[['const', 'Embarked_Q', 'Embarked_S', 'Age', 'Fare', 'Sex_male', 'Pclass_2', 'Pclass_3']])
results = model.fit()
print(results.summary())

Optimization terminated successfully.
         Current function value: 0.463653
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                 1309
Model:                          Logit   Df Residuals:                     1301
Method:                           MLE   Df Model:                            7
Date:                Sat, 13 Jan 2018   Pseudo R-squ.:                  0.3028
Time:                        10:52:08   Log-Likelihood:                -606.92
converged:                       True   LL-Null:                       -870.51
                                        LLR p-value:                1.145e-109
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.7792      0.372     10.165      0.000       3.050       4.508
Embarked_Q    -0.4474      0.

In this case we see that Fare and embarked Q are not statistically significant while all of the other pieces of information are. This means that city of departure is not significant at all. It is also quite interesting to notice that being Male seems to drastically decrease survivale rates, as we would expect.

## Example of prediction

In [6]:
titanic_data.iloc[15]

const                           1
Survived                        0
Name          Baumann, Mr. John D
Age                       29.8811
Sibsp                           0
Parch                           0
Ticket                   PC 17318
Fare                       25.925
Cabin                         NaN
Boat                          NaN
Body                          NaN
home.dest            New York, NY
Embarked_Q                      0
Embarked_S                      1
Sex_male                        1
Pclass_2                        0
Pclass_3                        0
Name: 15, dtype: object

In [7]:
preds = results.predict(titanic_data.iloc[15][['const', 'Embarked_Q', 'Embarked_S', 'Age', 'Fare', 'Sex_male', 'Pclass_2', 'Pclass_3']].astype(float))

In [8]:
preds

const         0.405815
Embarked_Q    0.405815
Embarked_S    0.405815
Age           0.405815
Fare          0.405815
Sex_male      0.405815
Pclass_2      0.405815
Pclass_3      0.405815
dtype: float64

It looks like we would have properly predicted that Mr Baumann would not have survived

## c)

No, it would be silly to assume it did. We have to assume that there is some lurking dependence. It it very likely that more people in a lower p_class embarked as a certain city, while more rich people embarked at a certain other city. This would make much more sense than some weird correlation between city and survival rates