### Model exercises
Using the titanic data, in your classification-exercises repository, create a notebook, model.ipynb where you will do the following:
1. What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.
2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)
3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.
4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.
5. Run through steps 2-4 using a different max_depth value.
6. Which model performs better on your in-sample data?
7. Which model performs best on your out-of-sample data, the validate set?
8. Work through these same exercises using the Telco dataset.

In [1]:
# Env Set up
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import acquire as acq
import prepare as pp
# Decision Tree and Model Evaluation Imports
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix

In [2]:
titanic = acq.get_titanic_data()

Using cached csv


In [3]:
titanic = pp.prep_titanic(titanic)

Using cached csv
Data cleaned for duplicates, columns dropped [deck, embarked, class, age], filled na, and added numerical versions of sex and embark


In [7]:
titanic.shape

(891, 12)

In [5]:
train, validate, test = pp.train_validate_test_split(titanic, target = 'survived')

In [6]:
train.shape, validate.shape, test.shape

((498, 12), (214, 12), (179, 12))

In [8]:
# What is your baseline prediction? 
train.survived.value_counts()

0    307
1    191
Name: survived, dtype: int64

In [10]:
# baseline accuracy
train['baseline'] = 0
print(f' Baseline accuracy is: {(train.survived==train.baseline).mean():.2%}')

 Baseline accuracy is: 61.65%


1. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)


In [11]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 498 entries, 583 to 744
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   passenger_id             498 non-null    int64  
 1   survived                 498 non-null    int64  
 2   pclass                   498 non-null    int64  
 3   sex                      498 non-null    object 
 4   sibsp                    498 non-null    int64  
 5   parch                    498 non-null    int64  
 6   fare                     498 non-null    float64
 7   embark_town              498 non-null    object 
 8   alone                    498 non-null    int64  
 9   sex_male                 498 non-null    uint8  
 10  embark_town_Queenstown   498 non-null    uint8  
 11  embark_town_Southampton  498 non-null    uint8  
 12  baseline                 498 non-null    int64  
dtypes: float64(1), int64(7), object(2), uint8(3)
memory usage: 44.3+ KB


In [12]:
# Feature selection
features = ['pclass', 'embark_town_Queenstown', 'embark_town_Southampton', 'sex_male']

In [13]:
# Variables
x_train = train[features]
y_train = train[['survived']]

x_validate = validate[features]
y_validate = validate[['survived']]

x_test = test[features]
y_test = test[['survived']]

In [14]:
x_train.head()

Unnamed: 0,pclass,embark_town_Queenstown,embark_town_Southampton,sex_male
583,1,0,0,1
165,3,0,1,1
50,3,0,1,1
259,2,0,1,0
306,1,0,0,0


In [15]:
y_train[:5]

Unnamed: 0,survived
583,0
165,1
50,0
259,1
306,1


In [16]:
y_train.value_counts()

survived
0           307
1           191
dtype: int64

In [17]:
tree = DecisionTreeClassifier(max_depth = 3)

In [18]:
tree= tree.fit(x_train,y_train)

In [19]:
y_pred = tree.predict(x_train)

In [23]:
# Rudimentary visualization of model structure
print(export_text(tree, feature_names=x_train.columns.tolist()))

|--- sex_male <= 0.50
|   |--- pclass <= 2.50
|   |   |--- embark_town_Southampton <= 0.50
|   |   |   |--- class: 1
|   |   |--- embark_town_Southampton >  0.50
|   |   |   |--- class: 1
|   |--- pclass >  2.50
|   |   |--- embark_town_Southampton <= 0.50
|   |   |   |--- class: 1
|   |   |--- embark_town_Southampton >  0.50
|   |   |   |--- class: 0
|--- sex_male >  0.50
|   |--- pclass <= 1.50
|   |   |--- embark_town_Queenstown <= 0.50
|   |   |   |--- class: 0
|   |   |--- embark_town_Queenstown >  0.50
|   |   |   |--- class: 0
|   |--- pclass >  1.50
|   |   |--- pclass <= 2.50
|   |   |   |--- class: 0
|   |   |--- pclass >  2.50
|   |   |   |--- class: 0



In [35]:
# Evaluate your in-sample results using the model score, confusion matrix, and classification report.
print(f'Accuracy score on training set is: {tree.score(x_train,y_train):.2%}')

Accuracy score on training set is: 81.93%


In [37]:
labels = sorted(y_train.survived.unique())
pd.DataFrame(confusion_matrix(y_train, y_pred), index = labels, columns = labels)

Unnamed: 0,0,1
0,294,13
1,77,114


In [38]:
# rows are truth, columns are pred
tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()
(tn, fp, fn, tp)

(294, 13, 77, 114)

In [39]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.96      0.87       307
           1       0.90      0.60      0.72       191

    accuracy                           0.82       498
   macro avg       0.85      0.78      0.79       498
weighted avg       0.83      0.82      0.81       498



Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [42]:
# Formulas
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp/(tp+fp) 
recall = tp/(tp+fn)
f1 = (2* precision * recall)/(precision + recall)

In [48]:
# Written out answers
print(f"False positive rate: {fp/(fp+tn):.2%}")
print(f"False negative rate: {fn/(fn+tp):.2%}")
print(f"True positive rate: {tp/(tp+fn):.2%}")
print(f"True negative rate: {tn/(fp+tn):.2%}")
print(f"Accuracy rate: {(tp + tn) / (tp + tn + fp + fn):.2%}")
print(f"Precision rate: {precision:.2%}")
print(f"Recall rate: {recall:.2%}")
print(f"F1 score: {f1: .2%}")


False positive rate: 4.23%
False negative rate: 40.31%
True positive rate: 59.69%
True negative rate: 95.77%
Accuracy rate: 81.93%
Precision rate: 89.76%
Recall rate: 59.69%
F1 score:  71.70%


Run through steps 2-4 using a different max_depth value.

In [49]:
def decision_tree(train, d = 5, print_results = True):
    
    selected_features = ['pclass','embark_town_Queenstown','embark_town_Southampton','sex_male']
    X_train = train[selected_features]
    y_train = train[['survived']]
    ship = DecisionTreeClassifier(max_depth=d, random_state=123)
    ship = ship.fit(X_train, y_train)
    y_pred = ship.predict(X_train)
    if print_results:
        print("TRAINING RESULTS")
        print("----------------")
        print(f"Accuracy score on training set is: {ship.score(X_train, y_train):.2f}")
        print(classification_report(y_train, y_pred))

        tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()

        print(f"False positive rate: {fp/(fp+tn):.2%}")
        print(f"False negative rate: {fn/(fn+tp):.2%}")
        print(f"True positive rate: {tp/(tp+fn):.2%}")
        print(f"True negative rate: {tn/(fp+tn):.2%}")
        print("----------------")
    
    return ship

In [50]:
for i in[3,5]:
    print(f'For decision tree with depth{i}:')
    decision_tree(train, d=i)

For decision tree with depth3:
TRAINING RESULTS
----------------
Accuracy score on training set is: 0.82
              precision    recall  f1-score   support

           0       0.79      0.96      0.87       307
           1       0.90      0.60      0.72       191

    accuracy                           0.82       498
   macro avg       0.85      0.78      0.79       498
weighted avg       0.83      0.82      0.81       498

False positive rate: 4.23%
False negative rate: 40.31%
True positive rate: 59.69%
True negative rate: 95.77%
----------------
For decision tree with depth5:
TRAINING RESULTS
----------------
Accuracy score on training set is: 0.82
              precision    recall  f1-score   support

           0       0.79      0.96      0.87       307
           1       0.90      0.60      0.72       191

    accuracy                           0.82       498
   macro avg       0.85      0.78      0.79       498
weighted avg       0.83      0.82      0.81       498

False posi

## Which model performs better on your in-sample data?
### Takeaway:
-  Depth 3 and 5 are identical

### Which model performs best on your out-of-sample data, the validate set?

In [56]:
def validate_results(d):
    ship = decision_tree(train, d = d, print_results = False)
    print('')
    print(f'For decision tree of depth: {ship.max_depth}')
    print('VALIDATE RESULTS')
    print('Accuracy of Decision Tree classifier on validate set: {:.2f}'
         .format(ship.score(x_validate, y_validate)))


    # Produce y_predictions that come from the X_validate
    y_pred = ship.predict(x_validate)

    # Compare actual y values (from validate) to predicted y_values from the model run on X_validate
    print(classification_report(y_validate, y_pred))

In [57]:
for i in [3,5]:
    validate_results(i)


For decision tree of depth: 3
VALIDATE RESULTS
Accuracy of Decision Tree classifier on validate set: 0.79
              precision    recall  f1-score   support

           0       0.77      0.95      0.85       132
           1       0.88      0.54      0.67        82

    accuracy                           0.79       214
   macro avg       0.82      0.75      0.76       214
weighted avg       0.81      0.79      0.78       214


For decision tree of depth: 5
VALIDATE RESULTS
Accuracy of Decision Tree classifier on validate set: 0.79
              precision    recall  f1-score   support

           0       0.77      0.95      0.85       132
           1       0.88      0.54      0.67        82

    accuracy                           0.79       214
   macro avg       0.82      0.75      0.76       214
weighted avg       0.81      0.79      0.78       214



### Takeaways:
- Depth 3 and 5 are identical

### Work through these same exercises using the Telco dataset.
- See Telco_modeling notebook