## Mini Exercise
1. Load the titanic dataset that you've put together from previous lessons.
2. Split your data into training and test.
3. Fit a logistic regression model on your training data using sklearn's
   linear_model.LogisticRegression class. Use fare and pclass as the
   predictors.
4. Use the model's .predict method. What is the output?
5. Use the model's .predict_proba method. What is the output? Why do you
   think it is shaped like this?
6. Evaluate your model's predictions on the test data set. How accurate
   is the mode? How does changing the threshold affect this?

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing

import acquire

df = acquire.get_titanic_data()
train, test = acquire.prep_titanic(df)
train = train.dropna()
test = test.dropna()
train.head()

Unnamed: 0,passenger_id,survived,pclass,sex,sibsp,parch,class,embark_town,alone,C,Q,S,age_mm,fare_mm
613,613,0,3,male,0,0,Third,Queenstown,1,0.0,1.0,0.0,0.293286,0.096618
130,130,0,3,male,0,0,Third,Cherbourg,1,1.0,0.0,0.0,0.565099,0.025374
220,220,1,3,male,0,0,Third,Southampton,1,0.0,0.0,1.0,0.456374,0.025374
225,225,0,3,male,0,0,Third,Southampton,1,0.0,0.0,1.0,0.048655,0.054457
209,209,1,1,male,0,0,First,Cherbourg,1,1.0,0.0,0.0,0.306877,0.513342


In [2]:
X_train = train[['fare_mm', 'pclass']]
y_train = train[['survived']]

X_test = test[['fare_mm', 'pclass']]
y_test = test[['survived']]

In [3]:
logit = LogisticRegression().fit(X_train, y_train)
y_pred = logit.predict(X_train)
y_pred

array([0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [4]:
y_pred_proba = logit.predict_proba(X_train)
y_pred_proba

array([[0.70332668, 0.29667332],
       [0.70104299, 0.29895701],
       [0.70104299, 0.29895701],
       [0.70197645, 0.29802355],
       [0.36940447, 0.63059553],
       [0.36049449, 0.63950551],
       [0.35545254, 0.64454746],
       [0.54061097, 0.45938903],
       [0.7007325 , 0.2992675 ],
       [0.35180066, 0.64819934],
       [0.70072466, 0.29927534],
       [0.70104299, 0.29895701],
       [0.70068231, 0.29931769],
       [0.70022718, 0.29977282],
       [0.70022718, 0.29977282],
       [0.7020453 , 0.2979547 ],
       [0.35248494, 0.64751506],
       [0.35301691, 0.64698309],
       [0.35290844, 0.64709156],
       [0.35486183, 0.64513817],
       [0.70071368, 0.29928632],
       [0.70211231, 0.29788769],
       [0.70332694, 0.29667306],
       [0.53282575, 0.46717425],
       [0.35516641, 0.64483359],
       [0.35215723, 0.64784277],
       [0.53030022, 0.46969978],
       [0.70135642, 0.29864358],
       [0.70189194, 0.29810806],
       [0.35447464, 0.64552536],
       [0.

In [5]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.63


In [6]:
print('Accuracy of Logistic Regression classifier on test set: {:.2f}'
     .format(logit.score(X_test, y_test)))

Accuracy of Logistic Regression classifier on test set: 0.72


# Model Exercises

In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

In [7]:
df = acquire.get_titanic_data()
df = df.drop(columns='deck')
df.dropna(inplace=True)
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,Southampton,1


In [8]:
# label encode sex column

le = preprocessing.LabelEncoder()
df['sex_enc'] = le.fit_transform(df.sex)

# One-hot encode embarked column

encoded_values = sorted(list(df['embarked'].unique()))
le = preprocessing.LabelEncoder()
enc = le.fit_transform(df['embarked'])
ohe_array = np.array(enc).reshape(len(enc), 1)
ohe = preprocessing.OneHotEncoder(sparse=False, categories='auto')
df_ohe = ohe.fit_transform(ohe_array)
enc = pd.DataFrame(data=df_ohe, columns=encoded_values, index=df.index)
df = df.join(enc)

df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,sex_enc,C,Q,S
0,0,0,3,male,22.0,1,0,7.25,S,Third,Southampton,0,1,0.0,0.0,1.0
1,1,1,1,female,38.0,1,0,71.2833,C,First,Cherbourg,0,0,1.0,0.0,0.0
2,2,1,3,female,26.0,0,0,7.925,S,Third,Southampton,1,0,0.0,0.0,1.0
3,3,1,1,female,35.0,1,0,53.1,S,First,Southampton,0,0,0.0,0.0,1.0
4,4,0,3,male,35.0,0,0,8.05,S,Third,Southampton,1,1,0.0,0.0,1.0


In [9]:
train, test = acquire.split_my_data(df, 0.8)
train, valid = acquire.split_my_data(train, 0.8)

test.shape, train.shape, valid.shape

((143, 16), (455, 16), (114, 16))

Create another model that includes age in addition to fare and pclass. Does this model perform better than your previous one?

In [10]:
X_train = train[['age', 'fare', 'pclass']]
X_valid = valid[['age', 'fare', 'pclass']]

y_train = train.survived
y_valid = valid.survived
y_test = test.survived

In [11]:
logit2 = LogisticRegression().fit(X_train, y_train)

print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit2.score(X_train, y_train)))
print('Accuracy of Logistic Regression classifier on validation set: {:.2f}'
     .format(logit2.score(X_valid, y_valid)))

Accuracy of Logistic Regression classifier on training set: 0.72
Accuracy of Logistic Regression classifier on validation set: 0.67


Include sex in your model as well. Note that you'll need to encode this feature before including it in a model.

In [12]:
X_train = train[['age', 'fare', 'pclass', 'sex_enc']]
X_valid = valid[['age', 'fare', 'pclass', 'sex_enc']]

In [13]:
logit3 = LogisticRegression().fit(X_train, y_train)

print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit3.score(X_train, y_train)))
print('Accuracy of Logistic Regression classifier on validation set: {:.2f}'
     .format(logit3.score(X_valid, y_valid)))

Accuracy of Logistic Regression classifier on training set: 0.79
Accuracy of Logistic Regression classifier on validation set: 0.80


Try out other combinations of features and models.

In [14]:
X_train = train[['age', 'fare', 'pclass', 'sex_enc', 'C', 'Q', 'S']]
X_valid = valid[['age', 'fare', 'pclass', 'sex_enc', 'C', 'Q', 'S']]

In [15]:
logit4 = LogisticRegression().fit(X_train, y_train)

print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit4.score(X_train, y_train)))
print('Accuracy of Logistic Regression classifier on validation set: {:.2f}'
     .format(logit4.score(X_valid, y_valid)))

Accuracy of Logistic Regression classifier on training set: 0.79
Accuracy of Logistic Regression classifier on validation set: 0.82


In [16]:
X_train = train[['age', 'fare', 'pclass', 'sex_enc', 'C', 'Q', 'S', 'sibsp', 'parch']]
X_valid = valid[['age', 'fare', 'pclass', 'sex_enc', 'C', 'Q', 'S', 'sibsp', 'parch']]

In [17]:
logit5 = LogisticRegression().fit(X_train, y_train)

print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit5.score(X_train, y_train)))
print('Accuracy of Logistic Regression classifier on validation set: {:.2f}'
     .format(logit5.score(X_valid, y_valid)))

Accuracy of Logistic Regression classifier on training set: 0.81
Accuracy of Logistic Regression classifier on validation set: 0.80


Choose you best model and evaluate it on the test dataset. Is it overfit?

In [18]:
X_test = test[['age', 'fare', 'pclass', 'sex_enc', 'C', 'Q', 'S', 'sibsp', 'parch']]

print('Accuracy of Logistic Regression classifier on test set: {:.2f}'
     .format(logit5.score(X_test, y_test)))

Accuracy of Logistic Regression classifier on test set: 0.80


Bonus How do different strategies for handling the missing values in the age column affect model performance?

In [19]:
df = acquire.get_titanic_data()
df = df.drop(columns='deck')
df.embarked = df.embarked.fillna('S')

df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,Southampton,1


In [20]:
df[df.age.isna()].head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone
5,5,0,3,male,,0,0,8.4583,Q,Third,Queenstown,1
17,17,1,2,male,,0,0,13.0,S,Second,Southampton,1
19,19,1,3,female,,0,0,7.225,C,Third,Cherbourg,1
26,26,0,3,male,,0,0,7.225,C,Third,Cherbourg,1
28,28,1,3,female,,0,0,7.8792,Q,Third,Queenstown,1


In [21]:
df['age_imp'] = df["age"].fillna(df.groupby("pclass")["age"].transform("mean"))

In [22]:
df[df.age.isna()].head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,age_imp
5,5,0,3,male,,0,0,8.4583,Q,Third,Queenstown,1,25.14062
17,17,1,2,male,,0,0,13.0,S,Second,Southampton,1,29.87763
19,19,1,3,female,,0,0,7.225,C,Third,Cherbourg,1,25.14062
26,26,0,3,male,,0,0,7.225,C,Third,Cherbourg,1,25.14062
28,28,1,3,female,,0,0,7.8792,Q,Third,Queenstown,1,25.14062


In [23]:
# label encode sex column

le = preprocessing.LabelEncoder()
df['sex_enc'] = le.fit_transform(df.sex)

# One-hot encode embarked column

encoded_values = sorted(list(df['embarked'].unique()))
le = preprocessing.LabelEncoder()
enc = le.fit_transform(df['embarked'])
ohe_array = np.array(enc).reshape(len(enc), 1)
ohe = preprocessing.OneHotEncoder(sparse=False, categories='auto')
df_ohe = ohe.fit_transform(ohe_array)
enc = pd.DataFrame(data=df_ohe, columns=encoded_values, index=df.index)
df = df.join(enc)

df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,age_imp,sex_enc,C,Q,S
0,0,0,3,male,22.0,1,0,7.25,S,Third,Southampton,0,22.0,1,0.0,0.0,1.0
1,1,1,1,female,38.0,1,0,71.2833,C,First,Cherbourg,0,38.0,0,1.0,0.0,0.0
2,2,1,3,female,26.0,0,0,7.925,S,Third,Southampton,1,26.0,0,0.0,0.0,1.0
3,3,1,1,female,35.0,1,0,53.1,S,First,Southampton,0,35.0,0,0.0,0.0,1.0
4,4,0,3,male,35.0,0,0,8.05,S,Third,Southampton,1,35.0,1,0.0,0.0,1.0


In [24]:
train, test = acquire.split_my_data(df, 0.8)
train, valid = acquire.split_my_data(train, 0.8)

test.shape, train.shape, valid.shape

((179, 17), (569, 17), (143, 17))

In [25]:
X_train = train[['age_imp', 'fare', 'pclass', 'sex_enc', 'C', 'Q', 'S', 'sibsp', 'parch']]
X_valid = valid[['age_imp', 'fare', 'pclass', 'sex_enc', 'C', 'Q', 'S', 'sibsp', 'parch']]

y_train = train.survived
y_valid = valid.survived

In [26]:
logit6 = LogisticRegression().fit(X_train, y_train)

print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit6.score(X_train, y_train)))
print('Accuracy of Logistic Regression classifier on validation set: {:.2f}'
     .format(logit6.score(X_valid, y_valid)))

Accuracy of Logistic Regression classifier on training set: 0.82
Accuracy of Logistic Regression classifier on validation set: 0.78


In [27]:
X_test = test[['age_imp', 'fare', 'pclass', 'sex_enc', 'C', 'Q', 'S', 'sibsp', 'parch']]
y_test = test.survived

print('Accuracy of Logistic Regression classifier on test set: {:.2f}'
     .format(logit5.score(X_test, y_test)))

Accuracy of Logistic Regression classifier on test set: 0.79


Bonus: How do different strategies for encoding sex affect model performance?