This notebook attempts to interpret a logistic regression, then prepares a simple model for prediction submission to Kaggle.

In [52]:
import pandas as pd
import numpy as np

train = pd.read_csv('train.csv', index_col='PassengerId')

In [53]:
train.drop(['Ticket'], axis=1, inplace=True)
train.drop(['Cabin'], axis=1, inplace=True)

train['first_class'] = np.where(train['Pclass']==1, 1, 0)
train['second_class'] = np.where(train['Pclass']==2, 1, 0)
train['is_female'] = np.where(train['Sex']=='female', 1, 0)

In [54]:
import statsmodels.api as sm
from statsmodels.formula.api import logit

model = logit('Survived ~ Pclass + Age + Sex + SibSp + Parch', train).fit()
display(model.summary())
display(model.pred_table())
display(model.get_margeff().summary())

Optimization terminated successfully.
         Current function value: 0.445814
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,714.0
Model:,Logit,Df Residuals:,708.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 15 Mar 2021",Pseudo R-squ.:,0.34
Time:,12:31:57,Log-Likelihood:,-318.31
converged:,True,LL-Null:,-482.26
Covariance Type:,nonrobust,LLR p-value:,1.0029999999999999e-68

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,5.6197,0.547,10.279,0.000,4.548,6.691
Sex[T.male],-2.6374,0.219,-12.021,0.000,-3.067,-2.207
Pclass,-1.3160,0.141,-9.342,0.000,-1.592,-1.040
Age,-0.0445,0.008,-5.448,0.000,-0.060,-0.028
SibSp,-0.3646,0.126,-2.882,0.004,-0.613,-0.117
Parch,-0.0371,0.120,-0.311,0.756,-0.272,0.197


array([[365.,  59.],
       [ 78., 212.]])

0,1
Dep. Variable:,Survived
Method:,dydx
At:,overall

Unnamed: 0,dy/dx,std err,z,P>|z|,[0.025,0.975]
Sex[T.male],-0.3766,0.017,-21.753,0.0,-0.411,-0.343
Pclass,-0.1879,0.016,-11.698,0.0,-0.219,-0.156
Age,-0.0063,0.001,-5.844,0.0,-0.008,-0.004
SibSp,-0.0521,0.018,-2.935,0.003,-0.087,-0.017
Parch,-0.0053,0.017,-0.311,0.756,-0.039,0.028


In [55]:
model = logit('Survived ~ first_class + second_class + Age + is_female + SibSp + Parch', train).fit()
display(model.summary())

Optimization terminated successfully.
         Current function value: 0.445700
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,714.0
Model:,Logit,Df Residuals:,707.0
Method:,MLE,Df Model:,6.0
Date:,"Mon, 15 Mar 2021",Pseudo R-squ.:,0.3401
Time:,12:31:57,Log-Likelihood:,-318.23
converged:,True,LL-Null:,-482.26
Covariance Type:,nonrobust,LLR p-value:,7.899e-68

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.9360,0.277,-3.381,0.001,-1.479,-0.393
first_class,2.6502,0.286,9.275,0.000,2.090,3.210
second_class,1.2348,0.246,5.029,0.000,0.754,1.716
Age,-0.0448,0.008,-5.448,0.000,-0.061,-0.029
is_female,2.6423,0.220,12.024,0.000,2.212,3.073
SibSp,-0.3683,0.127,-2.904,0.004,-0.617,-0.120
Parch,-0.0386,0.120,-0.323,0.747,-0.273,0.196


The logged odds of survival decrease by .045 with a 1-year increase in age and decrease by .368 with each additional sibling (or spouse). I will not interpret `Parch` as it is not statistically significant and the effect on the logged odds is near zero anyway.

The logged odds of survival are higher by 2.65 for first class and 1.235 for second class passengers, as compared to third class. The logged odds of survival for females are higher by 2.642 compared to males.

(However, mostly what matters here is the direction of the relationship.)

In [56]:
np.exp(model.params)

Intercept        0.392209
first_class     14.156427
second_class     3.437665
Age              0.956154
is_female       14.045001
SibSp            0.691923
Parch            0.962129
dtype: float64

These are now the odds rather than the logged odds, and the result is multiplicitive rather than additive.

The odds of survival are reduced by a multiplicitive factor of 0.956 with a 1-year increase in age, and are reduced by a multiplicitive factor of 0.692 with each additional sibling (or spouse). So, the predicted odds of survival for someone age 25 compared to someone aged 24 is 0.956. The predicted odds of survival for someone traveling with two siblings compared to somone traveling with one sibling is 0.692.

The odds of survival increase by 14.156 for first class and 3.438 for second class passengers, compared to third class. Being female increases the odds of survival by 14.045 as compared to males.

Making this even easier ...

In [57]:
(np.exp(model.params) - 1) * 100

Intercept        -60.779115
first_class     1315.642694
second_class     243.766544
Age               -4.384581
is_female       1304.500064
SibSp            -30.807668
Parch             -3.787129
dtype: float64

The odds of survival are lower by 4.39% with an increase in age by 1 year, and lower by 30.81% with each additional sibling/spouse relationship. They are higher by 1,315.64% for first class passengers compared to third class, 1,304.5% for women compared to men, and 243.77% for second class compared to third class passengers. Wow! Being female or not in third class really pays off.

In [58]:
display(model.get_margeff(at='mean').summary())
display(f"average survival rate: {np.mean(train['Survived'])}")

0,1
Dep. Variable:,Survived
Method:,dydx
At:,mean

Unnamed: 0,dy/dx,std err,z,P>|z|,[0.025,0.975]
first_class,0.6201,0.067,9.287,0.0,0.489,0.751
second_class,0.2889,0.057,5.072,0.0,0.177,0.401
Age,-0.0105,0.002,-5.445,0.0,-0.014,-0.007
is_female,0.6182,0.053,11.768,0.0,0.515,0.721
SibSp,-0.0862,0.03,-2.903,0.004,-0.144,-0.028
Parch,-0.009,0.028,-0.323,0.747,-0.064,0.046


'average survival rate: 0.3838383838383838'

Interpretation of marginal effects:
    
For a person at the average in all other ways, the marginal effect at the mean on the probability of survival decreases by .011 with an infinitely small increase in age, and by .086 with an infinitely small increase in the number of siblings or spouses (an infinitely small sibling?).

At the means, the marginal effect of being female increases the odds of surviving by 0.618. First class and second class passengers respectively enjoy a marginal increase in the predicted probability of survival by 0.62 and 0.289, respectively, compared to third class passengers.  

In [59]:
display(np.mean(train['Pclass']))
display(np.mean(train['Age']))
display(np.mean(train['is_female']))
display(np.mean(train['SibSp']))

2.308641975308642

29.69911764705882

0.35241301907968575

0.5230078563411896

By the way, our average passenger is a second-class male passenger aged 30 traveling with one sibling or spouse.

In [60]:
from sklearn.model_selection import train_test_split

y = train['Survived']
X = train[['first_class', 'second_class', 'Age', 'is_female', 'SibSp']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [61]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error

from sklearn.linear_model import LogisticRegression

imputer =  SimpleImputer(strategy='mean')
param_grid = {
    'logreg__penalty': ['l1', 'l2'],
    'logreg__solver': ['liblinear'],
    'logreg__multi_class': ['ovr']
}

steps = [('imputation', imputer), ('logreg', LogisticRegression())]
pipeline = Pipeline(steps)

logreg_cv = GridSearchCV(pipeline, param_grid, cv=5)
logreg_cv.fit(X_train, y_train)
y_pred = logreg_cv.predict(X_test)

r2 = logreg_cv.score(X_test, y_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
print(f"Winning parameters: {logreg_cv.best_params_}")
print(f"R-squared is: {r2}")
print(f"RMSE is: {rmse}, and MAE is: {mae}")

Winning parameters: {'logreg__multi_class': 'ovr', 'logreg__penalty': 'l2', 'logreg__solver': 'liblinear'}
R-squared is: 0.8134328358208955
RMSE is: 0.43193421279068006, and MAE is: 0.1865671641791045


In [62]:
imputer =  SimpleImputer(strategy='mean')
param_grid = {
    'logreg__penalty': ['l2', 'none'],
    'logreg__solver': ['lbfgs'],
    'logreg__multi_class': ['ovr']
}

steps = [('imputation', imputer), ('logreg', LogisticRegression())]
pipeline = Pipeline(steps)

logreg_cv = GridSearchCV(pipeline, param_grid, cv=5)
logreg_cv.fit(X_train, y_train)
y_pred = logreg_cv.predict(X_test)

r2 = logreg_cv.score(X_test, y_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
print(f"Winning parameters: {logreg_cv.best_params_}")
print(f"R-squared is: {r2}")
print(f"RMSE is: {rmse}, and MAE is: {mae}")

Winning parameters: {'logreg__multi_class': 'ovr', 'logreg__penalty': 'none', 'logreg__solver': 'lbfgs'}
R-squared is: 0.8134328358208955
RMSE is: 0.43193421279068006, and MAE is: 0.1865671641791045


In [63]:
# prepare test data

X_holdout = pd.read_csv('test.csv', index_col='PassengerId')

X_holdout['first_class'] = np.where(X_holdout['Pclass']==1, 1, 0)
X_holdout['second_class'] = np.where(X_holdout['Pclass']==2, 1, 0)
X_holdout['is_female'] = np.where(X_holdout['Sex']=='female', 1, 0)

X_holdout_for_pred = X_holdout[['first_class', 'second_class', 'Age', 'is_female', 'SibSp']]

In [64]:
imputer =  SimpleImputer(strategy='mean')
logreg = LogisticRegression(multi_class='ovr', penalty='l2', solver='liblinear')

steps = [('imputation', imputer), ('model', logreg)]
pipeline = Pipeline(steps)

pipeline.fit(X, y)
preds = pipeline.predict(X_holdout_for_pred)

In [65]:
submission_df = pd.DataFrame({'PassengerId': X_holdout_for_pred.index, 'Survived': preds})
submission_df.to_csv('survival_predictions.csv', index=False)