# Logistic Regression Exercises
1. Create a model that includes only age, fare, and pclass. Does this model perform better than your baseline?

2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

3. Try out other combinations of features and models.

4. Use you best 3 models to predict and evaluate on your validate sample.

5. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

#### Bonus1
How do different strategies for handling the missing values in the age column affect model performance?

#### Bonus2:
How do different strategies for encoding sex affect model performance?

#### Bonus3:
scikit-learn's LogisticRegression classifier is actually applying a regularization penalty to the coefficients by default. This penalty causes the magnitude of the coefficients in the resulting model to be smaller than they otherwise would be. This value can be modified with the C hyper parameter. Small values of C correspond to a larger penalty, and large values of C correspond to a smaller penalty.
Try out the following values for C and note how the coefficients and the model's performance on both the dataset it was trained on and on the validate split are affected.

#### Bonus Bonus:
how does scaling the data interact with your choice of C?



In [3]:
# imports

# DS libs - tab data
import numpy as np
import pandas as pd

# viz
import matplotlib.pyplot as plt
import seaborn as sns

# modeling
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# data
from acquire import get_titanic_data
from prepare import prep_titanic, split_data


In [88]:
# acquire and prep data
train, validate, test = split_data(prep_titanic(get_titanic_data()), target='survived')

> #### 1. Create a model that includes only age, fare, and pclass. Does this model perform better than your baseline?

In [52]:
# column names of interest
y_col = 'survived'
x_cols = ['age', 'fare','pclass']

In [53]:
# establish x and y train, validate and test with features of interest

X_train = train[x_cols]
y_train = train[y_col]

X_validate= validate[x_cols]
y_validate = validate[y_col]

X_test = test[x_cols]
y_test = test[y_col]

In [54]:
# create
lr = LogisticRegression()

# fit
lr.fit(X_train, y_train)

# predict
y_preds = lr.predict(X_train)

In [55]:
# creating baseline

# most common outcome
train.survived.mode()

# Creating an array of zeros the same length as survived
baseline_preds = np.zeros(len(train.survived))

# checking for correctness
len(baseline_preds), len(train.survived)

(498, 498)

In [56]:
# Comparing the scores of both on the train set
print(classification_report(y_train, y_preds))
print()
print(classification_report(y_train, baseline_preds))

              precision    recall  f1-score   support

           0       0.71      0.86      0.78       307
           1       0.66      0.43      0.52       191

    accuracy                           0.69       498
   macro avg       0.68      0.64      0.65       498
weighted avg       0.69      0.69      0.68       498


              precision    recall  f1-score   support

           0       0.62      1.00      0.76       307
           1       0.00      0.00      0.00       191

    accuracy                           0.62       498
   macro avg       0.31      0.50      0.38       498
weighted avg       0.38      0.62      0.47       498



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


>> #### Takeaways:
>>- The first model performs better than the baseleine in all areas except for recall, which is to be expected
>> #### Actions:
>>- Try on the validation ds

In [64]:
# predict
y_preds2 = lr.predict(X_validate)

In [58]:
# creating baseline

# most common outcome
validate.survived.mode()

# Creating an array of zeros the same length as survived
baseline_preds2 = np.zeros(len(validate.survived))

# checking for correctness
len(baseline_preds2), len(validate.survived)

(214, 214)

In [59]:
print(classification_report(y_validate, y_preds2))
print()
print(classification_report(y_validate, baseline_preds2))

              precision    recall  f1-score   support

           0       0.69      0.94      0.80       132
           1       0.77      0.33      0.46        82

    accuracy                           0.71       214
   macro avg       0.73      0.63      0.63       214
weighted avg       0.72      0.71      0.67       214


              precision    recall  f1-score   support

           0       0.62      1.00      0.76       132
           1       0.00      0.00      0.00        82

    accuracy                           0.62       214
   macro avg       0.31      0.50      0.38       214
weighted avg       0.38      0.62      0.47       214



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


>> #### Takeaways:
>>- The model still outperformed the baseline with the validate set
>>- Only a slight reduction in the presicion, while all other scores increased(recall, fr, and accuracy) when run ont he validate set
>> #### Actions:
>>- No further actions. Next question.

> #### 2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [62]:
# column names of interest with dummy variable for sex included
y_col = 'survived'
x_cols = ['age', 'fare','pclass', 'sex_male']

In [61]:
# establish x and y train, validate and test with features of interest

X_train = train[x_cols]
y_train = train[y_col]

X_validate= validate[x_cols]
y_validate = validate[y_col]

X_test = test[x_cols]
y_test = test[y_col]

In [65]:
# Create it
logit = LogisticRegression()

# fit it
logit.fit(X_train, y_train)

# predict
y_preds = logit.predict(X_train)

In [68]:
# get scores for both
print(classification_report(y_train, y_preds))

print(classification_report(y_train, baseline_preds))

              precision    recall  f1-score   support

           0       0.81      0.86      0.83       307
           1       0.75      0.69      0.72       191

    accuracy                           0.79       498
   macro avg       0.78      0.77      0.78       498
weighted avg       0.79      0.79      0.79       498

              precision    recall  f1-score   support

           0       0.62      1.00      0.76       307
           1       0.00      0.00      0.00       191

    accuracy                           0.62       498
   macro avg       0.31      0.50      0.38       498
weighted avg       0.38      0.62      0.47       498



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


>> #### Takeaways:
>>- There is an increase in the scores for precision, f1, and accuracy, with this newer model on the train set
>> #### Actions:
>>- Test on on the validate set

In [69]:
# predicting on the validate set
y_preds_v = logit.predict(X_validate)

In [71]:
print(classification_report(y_validate, y_preds_v))

print(classification_report(y_validate, baseline_preds2))

              precision    recall  f1-score   support

           0       0.80      0.84      0.82       132
           1       0.72      0.67      0.70        82

    accuracy                           0.78       214
   macro avg       0.76      0.76      0.76       214
weighted avg       0.77      0.78      0.77       214

              precision    recall  f1-score   support

           0       0.62      1.00      0.76       132
           1       0.00      0.00      0.00        82

    accuracy                           0.62       214
   macro avg       0.31      0.50      0.38       214
weighted avg       0.38      0.62      0.47       214



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


>> #### Takeaways:
>>- This model works vbetter than the first and the baseline on the train and validate with an exception to the recall score on the validate set
>> #### Actions:
>>- Create a list of all the columns that are objects
>>- Create various combinations of the columns names using itertools
>>- Create a for loop that will cycle through the different combinations of the columns names and print out the classification report for each
>>- Create a dictionary to store all the values for comparision

In [231]:
X_cols = [col for col in train.columns if train[col].dtype != 'O']
X_cols.remove(y_col)
X_cols.remove('sibsp')
X_cols.remove('pclass')
print(X_cols)

['age', 'parch', 'fare', 'alone', 'embark_town_Queenstown', 'embark_town_Southampton', 'sex_male']


In [76]:
import itertools

In [233]:
col_combos = {}
for i in range(2, len(X_cols) +1):
    col_combos[f'{i} Combos'] = list(itertools.combinations(X_cols, i))

In [226]:
col_combos

{'2 Combos': [('pclass', 'age'),
  ('pclass', 'parch'),
  ('pclass', 'fare'),
  ('pclass', 'alone'),
  ('pclass', 'embark_town_Queenstown'),
  ('pclass', 'embark_town_Southampton'),
  ('pclass', 'sex_male'),
  ('age', 'parch'),
  ('age', 'fare'),
  ('age', 'alone'),
  ('age', 'embark_town_Queenstown'),
  ('age', 'embark_town_Southampton'),
  ('age', 'sex_male'),
  ('parch', 'fare'),
  ('parch', 'alone'),
  ('parch', 'embark_town_Queenstown'),
  ('parch', 'embark_town_Southampton'),
  ('parch', 'sex_male'),
  ('fare', 'alone'),
  ('fare', 'embark_town_Queenstown'),
  ('fare', 'embark_town_Southampton'),
  ('fare', 'sex_male'),
  ('alone', 'embark_town_Queenstown'),
  ('alone', 'embark_town_Southampton'),
  ('alone', 'sex_male'),
  ('embark_town_Queenstown', 'embark_town_Southampton'),
  ('embark_town_Queenstown', 'sex_male'),
  ('embark_town_Southampton', 'sex_male')],
 '3 Combos': [('pclass', 'age', 'parch'),
  ('pclass', 'age', 'fare'),
  ('pclass', 'age', 'alone'),
  ('pclass', 'age'

In [234]:
logit_dict = {}
for k in col_combos:
#     print(k)
    logit_dict[k] = {}
    count=-1
    for i in col_combos[k]:
        count+=1
        i = list(i)
        x_cols = i
        y_col = 'survived'
        logit = LogisticRegression()
        logit.fit(train[x_cols], train[y_col])
        logit_dict[k][f'{k[0]}.{count}'] = {
            'Train Score': round(logit.score(train[x_cols], train[y_col]), 2),
            'Validate Score':round(logit.score(validate[x_cols], validate[y_col]), 2),
            'Difference':round(logit.score(train[x_cols], train[y_col]) - logit.score(validate[x_cols], validate[y_col]), 2)
        }

In [238]:
max_diff = .03
for k in logit_dict:
    for i in logit_dict[k]:
        if abs(logit_dict[k][i]['Difference']) < abs(max_diff) and logit_dict[k][i]['Train Score'] > .78:
            print(i, logit_dict[k][i])

3.4 {'Train Score': 0.79, 'Validate Score': 0.78, 'Difference': 0.01}
3.18 {'Train Score': 0.79, 'Validate Score': 0.77, 'Difference': 0.01}
3.21 {'Train Score': 0.79, 'Validate Score': 0.78, 'Difference': 0.01}
4.6 {'Train Score': 0.79, 'Validate Score': 0.78, 'Difference': 0.01}
4.8 {'Train Score': 0.79, 'Validate Score': 0.78, 'Difference': 0.01}
4.22 {'Train Score': 0.79, 'Validate Score': 0.77, 'Difference': 0.02}
4.24 {'Train Score': 0.79, 'Validate Score': 0.77, 'Difference': 0.01}
4.27 {'Train Score': 0.79, 'Validate Score': 0.78, 'Difference': 0.01}
5.4 {'Train Score': 0.79, 'Validate Score': 0.78, 'Difference': 0.01}
5.7 {'Train Score': 0.79, 'Validate Score': 0.78, 'Difference': 0.01}
5.8 {'Train Score': 0.79, 'Validate Score': 0.78, 'Difference': 0.01}
5.9 {'Train Score': 0.79, 'Validate Score': 0.78, 'Difference': 0.01}
5.16 {'Train Score': 0.79, 'Validate Score': 0.77, 'Difference': 0.02}
5.17 {'Train Score': 0.79, 'Validate Score': 0.77, 'Difference': 0.01}
5.19 {'Train 

In [243]:
# the best performers
5.19, col_combos['5 Combos'][19]

(5.19,
 ('parch',
  'alone',
  'embark_town_Queenstown',
  'embark_town_Southampton',
  'sex_male'))