# Exercise

- In these exercises, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

- For all of the models you create, choose a threshold that optimizes for accuracy.

- Create a new notebook, logistic_regression, use it to answer the following questions:

### Data Prep

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import sklearn.linear_model

# ignore warnings
import warnings
warnings.filterwarnings("ignore")


import matplotlib.pyplot as plt
import seaborn as sns

from acquire import get_titanic_data, split_data

In [2]:
df = get_titanic_data()

# Missing ages
avg_age = df.age.mean()
df.age = df.age.fillna(avg_age)

# Encode
df["is_female"] = (df.sex == "female").astype('int')

# More encode
dummy_df = pd.get_dummies(df[['embark_town']], dummy_na=False, drop_first=True)
df = pd.concat([df, dummy_df], axis=1)

# Drop unnecessary columns
df = df.drop(columns=["passenger_id", "deck", "class", "embarked", "sex", "embark_town"])

df.head(3)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,is_female,embark_town_Queenstown,embark_town_Southampton
0,0,3,22.0,1,0,7.25,0,0,False,True
1,1,1,38.0,1,0,71.2833,0,1,False,False
2,1,3,26.0,0,0,7.925,1,1,False,True


In [3]:
df.isna().sum()

survived                   0
pclass                     0
age                        0
sibsp                      0
parch                      0
fare                       0
alone                      0
is_female                  0
embark_town_Queenstown     0
embark_town_Southampton    0
dtype: int64

In [4]:
# Split the datasets
train, validate, test = split_data(df, 'survived')

train: 498 (56.00000000000001% of 891)
validate: 214 (24.0% of 891)
test: 179 (20.0% of 891)


In [5]:
# Make X and Y splits
X_train = train.drop(columns=["survived"])
y_train = train.survived

X_validate = validate.drop(columns=["survived"])
y_validate = validate.survived

X_test = test.drop(columns=["survived"])
y_test = test.survived

Insert Exploration Here

In [6]:
# Baseline
print(train['survived'].value_counts())

baseline_accuracy = (train.survived == 0).mean()
round(baseline_accuracy, 2)

survived
0    307
1    191
Name: count, dtype: int64


0.62

### 1. 
Create a model that includes only age, fare, and pclass. Does this model perform better than your baseline?



In [7]:
# Create the logistic regression
lr = LogisticRegression(random_state=123)

# specify the features we're using
features = ["age", "pclass", "fare"]

# Fit a model using only these specified features
# lr.fit(X_train[["age", "pclass", "fare"]], y_train)
lr.fit(X_train[features], y_train)

# Since we .fit on a subset, we .predict on that same subset of features
y_pred = lr.predict(X_train[features])

print("Baseline is", round(baseline_accuracy, 3))
print("Logistic Regression using", features)
print('Accuracy of Logistic Regression classifier on training set:', round((lr.score(X_train[features], y_train)),3))

Baseline is 0.616
Logistic Regression using ['age', 'pclass', 'fare']
Accuracy of Logistic Regression classifier on training set: 0.703


### 2. 
Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.



In [8]:
# Create
lr2 = LogisticRegression(random_state=123)

# Features
features = ['age', 'pclass', 'fare', 'is_female']

# Fit 
lr2.fit(X_train[features], y_train)

y_pred = lr2.predict(X_train[features])

# Score
print("Baseline is", round(baseline_accuracy, 3))
print('Logistic Regression using', features)
print('Accuracy of Logistic Regression classifier on training set:', round((lr2.score(X_train[features], y_train)), 3))

Baseline is 0.616
Logistic Regression using ['age', 'pclass', 'fare', 'is_female']
Accuracy of Logistic Regression classifier on training set: 0.813


### 3. 
Try out other combinations of features and models.



In [9]:
# All features, all default hyperparameters
lr3 = LogisticRegression(random_state=123)

lr3.fit(X_train, y_train)

y_pred = lr3.predict(X_train)

accuracy = lr3.score(X_train, y_train)

print("Baseline is", round(baseline_accuracy, 2))
print("Test on: All Features")
print(f'Accuracy on training set:', accuracy)

Baseline is 0.62
Test on: All Features
Accuracy on training set: 0.8152610441767069


In [10]:
# All features, but we'll use the class_weights to hold the actual ratios`
lr4 = LogisticRegression(random_state=123, class_weight='balanced')

lr4.fit(X_train, y_train)

y_pred = lr4.predict(X_train)

accuracy = lr4.score(X_train, y_train)

print("Baseline is", round(baseline_accuracy, 2))
print("Test on: All Features with class_weight='balanced'")
print(f'Accuracy on training set:', accuracy)

Baseline is 0.62
Test on: All Features with class_weight='balanced'
Accuracy on training set: 0.8032128514056225


In [11]:
# Only Age 
features = ["age"]

# All features, but we'll use the class_weights to hold the actual ratios
lr5 = LogisticRegression(random_state=123)

lr5.fit(X_train[features], y_train)

y_pred = lr5.predict(X_train[features])

accuracy = lr5.score(X_train[features], y_train)

print("Baseline is", round(baseline_accuracy, 2))
print("Test on:", features)
print(f'Accuracy on training set:', accuracy)

Baseline is 0.62
Test on: ['age']
Accuracy on training set: 0.6164658634538153


In [12]:
# Only pclass
features = ["pclass"]

# All features, but we'll use the class_weights to hold the actual ratios
lr6 = LogisticRegression(random_state=123)

lr6.fit(X_train[features], y_train)

y_pred = lr6.predict(X_train[features])

accuracy = lr6.score(X_train[features], y_train)

print("Baseline is", round(baseline_accuracy, 2))
print("Test on:", features)
print(f'Accuracy on training set:', accuracy)

Baseline is 0.62
Test on: ['pclass']
Accuracy on training set: 0.6666666666666666


In [13]:
# All Features, C ~ 0
# All features, but we'll use the class_weights to hold the actual ratios
lr7 = LogisticRegression(random_state=123, C=0.0001)

lr7.fit(X_train, y_train)

y_pred = lr7.predict(X_train)
accuracy = lr7.score(X_train, y_train)

print("Baseline is", round(baseline_accuracy, 2))
print("Test on: All Features, C=0.0001")
print(f'Accuracy on training set:', accuracy)

Baseline is 0.62
Test on: All Features, C=0.0001
Accuracy on training set: 0.6445783132530121


### 4. 
Use you best 3 models to predict and evaluate on your validate sample.



In [14]:
# Features
features = ['age', 'pclass', 'fare', 'is_female']

y_pred = lr2.predict(X_validate[features])

# Score
print("Baseline is", round(baseline_accuracy, 3))
print('Accuracy is:', round(lr2.score(X_validate[features], y_validate), 3))
print('Logistic Regression using', features)
print(classification_report(y_validate, y_pred))

Baseline is 0.616
Accuracy is: 0.776
Logistic Regression using ['age', 'pclass', 'fare', 'is_female']
              precision    recall  f1-score   support

           0       0.80      0.84      0.82       132
           1       0.72      0.67      0.70        82

    accuracy                           0.78       214
   macro avg       0.76      0.76      0.76       214
weighted avg       0.77      0.78      0.77       214



In [15]:
# All features, but we'll use the class_weights to hold the actual ratios`
y_pred = lr4.predict(X_validate)

# Score
print("Baseline is", round(baseline_accuracy, 2))
print('Accuracy is:', round(lr4.score(X_validate, y_validate), 3))
print("Test on: All Features with class_weight='balanced'")
print(classification_report(y_validate, y_pred))

Baseline is 0.62
Accuracy is: 0.776
Test on: All Features with class_weight='balanced'
              precision    recall  f1-score   support

           0       0.82      0.81      0.82       132
           1       0.70      0.72      0.71        82

    accuracy                           0.78       214
   macro avg       0.76      0.77      0.76       214
weighted avg       0.78      0.78      0.78       214



In [16]:
# All features, all default hyperparameters
y_pred = lr3.predict(X_validate)

print("Baseline is", round(baseline_accuracy, 2))
print('Accuracy is:', round(lr3.score(X_validate, y_validate), 3))
print("Test on: All Features, All Default")
print(classification_report(y_validate, y_pred))

Baseline is 0.62
Accuracy is: 0.776
Test on: All Features, All Default
              precision    recall  f1-score   support

           0       0.80      0.86      0.82       132
           1       0.74      0.65      0.69        82

    accuracy                           0.78       214
   macro avg       0.77      0.75      0.76       214
weighted avg       0.77      0.78      0.77       214



### 5.
Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?



In [17]:
# All features, but we'll use the class_weights to hold the actual ratios`
y_pred = lr4.predict(X_validate)

# Score
print("Baseline is", round(baseline_accuracy, 2))
print('Accuracy is:', round(lr4.score(X_test, y_test), 3))
print("Test on: All Features with class_weight='balanced'")
print(classification_report(y_validate, y_pred))

Baseline is 0.62
Accuracy is: 0.771
Test on: All Features with class_weight='balanced'
              precision    recall  f1-score   support

           0       0.82      0.81      0.82       132
           1       0.70      0.72      0.71        82

    accuracy                           0.78       214
   macro avg       0.76      0.77      0.76       214
weighted avg       0.78      0.78      0.78       214



## Bonus



### Bonus 1:
How do different strategies for handling the missing values in the age column affect model performance?



### Bonus 2:
How do different strategies for encoding sex affect model performance?



### Bonus 3:
scikit-learn's LogisticRegression classifier is actually applying a regularization penalty to the coefficients by default. This penalty causes the magnitude of the coefficients in the resulting model to be smaller than they otherwise would be. This value can be modified with the C hyper parameter. Small values of C correspond to a larger penalty, and large values of C correspond to a smaller penalty.

Try out the following values for C and note how the coefficients and the model's performance on both the dataset it was trained on and on the validate split are affected.

C=.01,.1,1,10,100,1000



### Bonus 4: 
How does scaling the data interact with your choice of C?

# Notes

### Logistic Regression Pros

- High interpretabability. It's explainable to others, i.e. it's useful for understanding the influence of several independent variables on a single outcome variable.

- We can choose to ‘snap’ predictions to 0 and 1 via a rule (such as if < .5, 0 else 1) OR we can choose to use the output as is, which is a probability of being class 1.

- It’s a fast model and is a good place to start with a benchmark for comparing with other classification algorithms.

- Very efficient and does not require many computational resources. Runs fast.

- Outputs clear predicted probabilities.

### Cons

- Assumes all predictors are independent of each other.

- Missing values must be dealt with prior to fitting the model.

- We can’t solve non-linear problems with logistic regression since it’s decision surface is linear.

- Not always as accurate as other classification algorithms.