In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from acquire import get_titanic_data
from prepare import prep_titanic_data

In [None]:
df = prep_titanic_data(get_titanic_data())
df.shape

In [None]:
df.dropna(inplace=True)
df.shape

In [None]:
x = df[['pclass', 'age', 'fare', 'sibsp', 'parch']]
y = df[['survived']]

x_train_validate, x_test, y_train_validate, y_test = train_test_split(x, y, test_size = .20, random_state = 123)

x_train, x_validate, y_train, y_validate = train_test_split(x_train_validate, y_train_validate, test_size = .30, random_state = 123)

print("train: ", x_train.shape, ", validate: ", x_validate.shape, ", test: ", x_test.shape)
print("train: ", y_train.shape, ", validate: ", y_validate.shape, ", test: ", y_test.shape)

In [None]:
logit = LogisticRegression()

In [None]:
logit.fit(x_train, y_train)

In [None]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

In [None]:
y_pred = logit.predict(x_train)
y_pred_proba = logit.predict_proba(x_train)

In [None]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(x_train, y_train)))

In [None]:
print(confusion_matrix(y_train, y_pred))

In [None]:
print(classification_report(y_train, y_pred))

For all of the models you create, choose a threshold that optimizes for accuracy.

In [None]:
# x_train.survived.value_counts(normalize=True)
# x_train

Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

### Exercise 1
Create another model that includes age in addition to fare and pclass. 

In [None]:
x1 = df[['age', 'fare', 'pclass']]
y1 = df[['survived']]

x1_train_validate, x1_test, y1_train_validate, y1_test = train_test_split(x1, y1, test_size = .20, random_state = 123, stratify = survived)

x1_train, x1_validate, y1_train, y1_validate = train_test_split(x1_train_validate, y1_train_validate, test_size = .30, random_state = 123)

print("train: ", x1_train.shape, ", validate: ", x1_validate.shape, ", test: ", x1_test.shape)
print("train: ", y1_train.shape, ", validate: ", y1_validate.shape, ", test: ", y1_test.shape)

In [None]:
logit.fit(x1_train, y1_train)

In [None]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

In [None]:
y1_pred = logit.predict(x1_train)
y1_pred_proba = logit.predict_proba(x1_train)

In [None]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(x1_train, y1_train)))

In [None]:
print(confusion_matrix(y1_train, y1_pred))

In [None]:
print(classification_report(y1_train, y1_pred))

- Does this model perform better than your previous one?

No, it did not perform bette than the previous model. It seems that the coefficients of 'sibsp' and 'parch' canceled each other out so the model performed identical to the last.

### Exercise 2
Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [None]:
titanic_dummies = pd.get_dummies(df.sex, drop_first=True)
df = pd.concat([df, titanic_dummies], axis=1)
df

In [None]:
x2 = df[['age', 'fare', 'pclass', 'male']]
y2 = df[['survived']]

x2_train_validate, x2_test, y2_train_validate, y2_test = train_test_split(x2, y2, test_size = .20, random_state = 123)

x2_train, x2_validate, y2_train, y2_validate = train_test_split(x2_train_validate, y2_train_validate, test_size = .30, random_state = 123)

print("train: ", x2_train.shape, ", validate: ", x2_validate.shape, ", test: ", x2_test.shape)
print("train: ", y2_train.shape, ", validate: ", y2_validate.shape, ", test: ", y2_test.shape)

In [None]:
logit.fit(x2_train, y2_train)

In [None]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

In [None]:
y2_pred = logit.predict(x2_train)
y2_pred_proba = logit.predict_proba(x2_train)

In [None]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(x2_train, y2_train)))

In [None]:
print(confusion_matrix(y2_train, y2_pred))

In [None]:
print(classification_report(y2_train, y2_pred))

### Exercise 3
Try out other combinations of features and models.

### Exercise 4
Use you best 3 models to predict and evaluate on your validate sample.

### Exercise 5
Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

Bonus1 How do different strategies for handling the missing values in the age column affect model performance?

Bonus2: How do different strategies for encoding sex affect model performance?

Bonus3: scikit-learn's LogisticRegression classifier is actually applying a regularization penalty to the coefficients by default. This penalty causes the magnitude of the coefficients in the resulting model to be smaller than they otherwise would be. This value can be modified with the C hyper parameter. Small values of C correspond to a larger penalty, and large values of C correspond to a smaller penalty.
Try out the following values for C and note how the coefficients and the model's performance on both the dataset it was trained on and on the validate split are affected.


C= .01, .1, 1, 10, 100, 1000

Bonus Bonus: how does scaling the data interact with your choice of C?