### Decision Tree Model Exercises

Using the titanic data, in your classification-exercises repository, create a notebook, model.ipynb where you will do the following:

1. What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

In [165]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import acquire
import prepare
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [138]:
titanic = acquire.get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


What is your baseline prediction?

In [None]:
# baseline prediction is that passangers did not survive.

In [139]:
# prepare titanic
titanic = titanic.drop_duplicates()
cols_to_drop = ['deck', 'embarked', 'class']
titanic = titanic.drop(columns=cols_to_drop)
titanic['embark_town'] = titanic.embark_town.fillna(value='Southampton')
dummy_titanic = pd.get_dummies(titanic[['sex', 'embark_town']], dummy_na=False, drop_first=[True, True])
titanic = titanic.drop(columns = ['sex', 'embark_town'])
titanic = pd.concat([titanic, dummy_titanic], axis=1)

In [163]:
titanic.head(3)

Unnamed: 0,passenger_id,survived,pclass,age,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,0,3,22.0,1,0,7.25,0,1,0,1
1,1,1,1,38.0,1,0,71.2833,0,0,0,0
2,2,1,3,26.0,0,0,7.925,1,0,0,1


In [140]:
# there are some nulls in the age column
titanic.isna().sum()

passenger_id                 0
survived                     0
pclass                       0
age                        177
sibsp                        0
parch                        0
fare                         0
alone                        0
sex_male                     0
embark_town_Queenstown       0
embark_town_Southampton      0
dtype: int64

In [141]:
# plugging missing age values with mean of entire dataset
titanic.age = titanic.age.fillna(value = titanic.age.mean(0))

In [142]:
titanic.isna().sum()

passenger_id               0
survived                   0
pclass                     0
age                        0
sibsp                      0
parch                      0
fare                       0
alone                      0
sex_male                   0
embark_town_Queenstown     0
embark_town_Southampton    0
dtype: int64

In [143]:
#split titanic
train, test = train_test_split(titanic, test_size = .2, random_state=123, stratify=titanic.survived)
train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)

In [144]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 498 entries, 583 to 744
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   passenger_id             498 non-null    int64  
 1   survived                 498 non-null    int64  
 2   pclass                   498 non-null    int64  
 3   age                      498 non-null    float64
 4   sibsp                    498 non-null    int64  
 5   parch                    498 non-null    int64  
 6   fare                     498 non-null    float64
 7   alone                    498 non-null    int64  
 8   sex_male                 498 non-null    uint8  
 9   embark_town_Queenstown   498 non-null    uint8  
 10  embark_town_Southampton  498 non-null    uint8  
dtypes: float64(2), int64(6), uint8(3)
memory usage: 36.5 KB


In [145]:
train.survived.value_counts()

0    307
1    191
Name: survived, dtype: int64

What is your baseline accuracy?

In [146]:
baseline_accuracy = (0 == train.survived).mean()

print(f'baseline accuracy: {baseline_accuracy:.2%}')

baseline accuracy: 61.65%


2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [147]:
X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [148]:
# starting with a max_depth value of 5
titanic_clf = DecisionTreeClassifier(max_depth = 5, random_state = 123)

In [149]:
titanic_clf = titanic_clf.fit(X_train, y_train)

In [150]:
y_pred = titanic_clf.predict(X_train)
#y_pred  ==> array output

3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [151]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(titanic_clf.score(X_train, y_train)))

Accuracy of Decision Tree classifier on training set: 0.86


In [152]:
c_m = confusion_matrix(y_train, y_pred)
c_m

array([[296,  11],
       [ 58, 133]])

In [154]:
labels = ['Did not survive', 'Survived']
print('Actual on the left, predicted on the top')
pd.DataFrame(c_m, index = labels, columns = labels)

Actual on the left, predicted on the top


Unnamed: 0,Did not survive,Survived
Did not survive,296,11
Survived,58,133


In [155]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.96      0.90       307
           1       0.92      0.70      0.79       191

    accuracy                           0.86       498
   macro avg       0.88      0.83      0.84       498
weighted avg       0.87      0.86      0.86       498



4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [156]:
print('Accuracy of Decision Tree classifier on validate set: {:.2f}'.format(titanic_clf.score(X_validate, y_validate)))

Accuracy of Decision Tree classifier on validate set: 0.77


In [157]:
y_pred = titanic_clf.predict(X_validate)
print(classification_report(y_validate, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.86      0.82       132
           1       0.74      0.62      0.68        82

    accuracy                           0.77       214
   macro avg       0.76      0.74      0.75       214
weighted avg       0.77      0.77      0.77       214



5. Run through steps 2-4 using a different max_depth value.

In [162]:
for i in range(5, 20):
    print('max_depth value: {}'.format(i))
    titanic_clf = DecisionTreeClassifier(max_depth = i, random_state = 123)
    titanic_clf = titanic_clf.fit(X_train, y_train)
    y_pred = titanic_clf.predict(X_train)
    print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(titanic_clf.score(X_train, y_train)))
    print(classification_report(y_train, y_pred))
    print('Accuracy of Decision Tree classifier on validate set: {:.2f}'.format(titanic_clf.score(X_validate, y_validate)))
    y_pred = titanic_clf.predict(X_validate)
    print(classification_report(y_validate, y_pred))
    print('-------------------------------------------------')



max_depth value: 5
Accuracy of Decision Tree classifier on training set: 0.86
              precision    recall  f1-score   support

           0       0.84      0.96      0.90       307
           1       0.92      0.70      0.79       191

    accuracy                           0.86       498
   macro avg       0.88      0.83      0.84       498
weighted avg       0.87      0.86      0.86       498

Accuracy of Decision Tree classifier on validate set: 0.77
              precision    recall  f1-score   support

           0       0.79      0.86      0.82       132
           1       0.74      0.62      0.68        82

    accuracy                           0.77       214
   macro avg       0.76      0.74      0.75       214
weighted avg       0.77      0.77      0.77       214

---------------------------------------
max_depth value: 6
Accuracy of Decision Tree classifier on training set: 0.88
              precision    recall  f1-score   support

           0       0.87      0.94   

6. Which model performs better on your in-sample data?

In [None]:
# I've observed that the higher the depth the better the performance on the in-sample
# data but thay may be to overfitting too (?)

7. Which model performs best on your out-of-sample data, the validate set?

In [None]:
# depth values 6 and 8 perform the best

### Random Forest Model Exercises

Continue working in your `model` file with titanic data to do the following: 

1. Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 10.

In [196]:
X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [197]:
rf = RandomForestClassifier(max_depth = 10,
                           random_state = 123, 
                           min_samples_leaf = 1)

In [198]:
rf.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, random_state=123)

In [199]:
print(rf.feature_importances_)

[0.16359522 0.08740554 0.15093152 0.04708349 0.02672832 0.18238905
 0.01822827 0.28960813 0.01169783 0.02233262]


In [200]:
X_train.columns

Index(['passenger_id', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'alone',
       'sex_male', 'embark_town_Queenstown', 'embark_town_Southampton'],
      dtype='object')

In [255]:
y_pred = rf.predict(X_train)
y_pred

array([0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,

In [236]:
print('Accuracy of random forest classifier on training set: {:.2f}'.format(rf.score(X_train, y_train)))

Accuracy of random forest classifier on training set: 0.98


In [237]:
# 0 = did not survive, 1 = survived
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       307
           1       1.00      0.94      0.97       191

    accuracy                           0.98       498
   macro avg       0.98      0.97      0.98       498
weighted avg       0.98      0.98      0.98       498



2. Evaluate your results using the model score, confusion matrix, and classification report.

In [252]:
print('Accuracy of random forest classifier on test set: {:.2f}'.format(rf.score(X_train, y_train)))

Accuracy of random forest classifier on test set: 0.98


In [256]:
c_m = confusion_matrix(y_train, y_pred)
c_m

array([[307,   0],
       [ 11, 180]])

In [257]:
labels = ['Did not survive', 'Survived']
print('Actual on the left, predicted on the top')
pd.DataFrame(c_m, index = labels, columns = labels)

Actual on the left, predicted on the top


Unnamed: 0,Did not survive,Survived
Did not survive,307,0
Survived,11,180


In [258]:
y_pred = rf.predict(X_train)
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       307
           1       1.00      0.94      0.97       191

    accuracy                           0.98       498
   macro avg       0.98      0.97      0.98       498
weighted avg       0.98      0.98      0.98       498



3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [259]:
print('Accuracy of Decision Tree classifier on validate set: {:.2f}'.format(rf.score(X_validate, y_validate)))

Accuracy of Decision Tree classifier on validate set: 0.80


In [261]:
y_pred = rf.predict(X_validate)
print(confusion_matrix(y_validate, y_pred))

[[116  16]
 [ 26  56]]


In [263]:
labels = ['Did not survive', 'Survived']
print('Actual on the left, predicted on the top')
pd.DataFrame(confusion_matrix(y_validate, y_pred), index = labels, columns = labels)

Actual on the left, predicted on the top


Unnamed: 0,Did not survive,Survived
Did not survive,116,16
Survived,26,56


In [264]:
print(classification_report(y_validate, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.88      0.85       132
           1       0.78      0.68      0.73        82

    accuracy                           0.80       214
   macro avg       0.80      0.78      0.79       214
weighted avg       0.80      0.80      0.80       214



4. Run through steps increasing your min_samples_leaf and decreasing your max_depth.

In [270]:
max_depth = 20
for i in range(2, max_depth):
    depth = max_depth - 1
    print('max_depth value: {}'.format(depth))
    print('min_samples_leaf value: {}'.format(i))
    rf = RandomForestClassifier(max_depth = depth,
                           random_state = 123, 
                           min_samples_leaf = i)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_train)
    print('Accuracy of Random Forest model on training set: {:.2f}'.format(rf.score(X_train, y_train)))
    print(classification_report(y_train, y_pred))
    print('Accuracy of Random Forest model on validate set: {:.2f}'.format(rf.score(X_validate, y_validate)))
    y_pred = rf.predict(X_validate)
    print(classification_report(y_validate, y_pred))
    print('-----------------------------------------------------')



max_depth value: 19
min_samples_leaf value: 2
Accuracy of Random Forest model on training set: 0.94
              precision    recall  f1-score   support

           0       0.92      0.98      0.95       307
           1       0.97      0.87      0.92       191

    accuracy                           0.94       498
   macro avg       0.95      0.93      0.93       498
weighted avg       0.94      0.94      0.94       498

Accuracy of Random Forest model on validate set: 0.82
              precision    recall  f1-score   support

           0       0.83      0.89      0.86       132
           1       0.81      0.71      0.75        82

    accuracy                           0.82       214
   macro avg       0.82      0.80      0.81       214
weighted avg       0.82      0.82      0.82       214

-----------------------------------------------------
max_depth value: 19
min_samples_leaf value: 3
Accuracy of Random Forest model on training set: 0.92
              precision    recall  f1-

Accuracy of Random Forest model on training set: 0.85
              precision    recall  f1-score   support

           0       0.83      0.95      0.89       307
           1       0.90      0.69      0.78       191

    accuracy                           0.85       498
   macro avg       0.86      0.82      0.83       498
weighted avg       0.86      0.85      0.85       498

Accuracy of Random Forest model on validate set: 0.79
              precision    recall  f1-score   support

           0       0.78      0.91      0.84       132
           1       0.80      0.60      0.69        82

    accuracy                           0.79       214
   macro avg       0.79      0.75      0.76       214
weighted avg       0.79      0.79      0.78       214

-----------------------------------------------------
max_depth value: 19
min_samples_leaf value: 13
Accuracy of Random Forest model on training set: 0.85
              precision    recall  f1-score   support

           0       0.83     

5. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

After making a few models, which one has the best performance (or closest metrics) on both train and validate?