# Decision Tree Exercises

Using the titanic data, in your classification-exercises repository, create a notebook, decision_tree.ipynb where you will do the following:

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

import acquire
import prepare
import env

In [2]:
df = acquire.get_titanic_data()

In [3]:
clean_df = prepare.prep_titanic(df)

In [4]:
train, validate, test = prepare.split_data(clean_df)

In [5]:
train.head()

Unnamed: 0,passenger_id,survived,pclass,sex,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
376,376,1,3,female,0,0,7.25,Southampton,1,0,0,1
457,457,1,1,female,1,0,51.8625,Southampton,0,0,0,1
77,77,0,3,male,0,0,8.05,Southampton,1,1,0,1
665,665,0,2,male,2,0,73.5,Southampton,0,1,0,1
29,29,0,3,male,0,0,7.8958,Southampton,1,1,0,1


## 1. What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

Baseline accuracy is 0 or did not survive.

In [18]:
#Instructor Answer:
baseline_accuracy = (train.survived == 0).mean()
baseline_accuracy

0.6067415730337079

## 2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [6]:
x_train = train.drop(columns=['survived', 'passenger_id', 'sex', 'embark_town'])
y_train = train.survived

x_validate = validate.drop(columns=['survived', 'passenger_id', 'sex', 'embark_town'])
y_validate = validate.survived

x_test = test.drop(columns=['survived', 'passenger_id', 'sex', 'embark_town'])
y_test = test.survived

In [7]:
clf = DecisionTreeClassifier(max_depth=3, random_state=311)

In [8]:
clf = clf.fit(x_train, y_train)

In [9]:
#plt.figure(figsize=(13, 7))
#plot_tree(clf, feature_names=x_train.columns, class_names=clf.classes_, rounded=True)

In [19]:
#Instructor Answer:
clf.score(x_train, y_train)

0.799625468164794

In [10]:
y_pred = clf.predict(x_train)

y_pred_proba = clf.predict_proba(x_train)

## 3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [11]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
      .format(clf.score(x_train, y_train)))

Accuracy of Decision Tree classifier on training set: 0.80


In [12]:
confusion_matrix(y_train, y_pred)

array([[291,  33],
       [ 74, 136]])

In [13]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.90      0.84       324
           1       0.80      0.65      0.72       210

    accuracy                           0.80       534
   macro avg       0.80      0.77      0.78       534
weighted avg       0.80      0.80      0.79       534



In [14]:
prepare.evaluate_clf(clf, x_train, y_train, y_pred)

(0.799625468164794,
           Pred 0  Pred 1
 Actual 0     291      33
 Actual 1      74     136,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.797260    0.804734  0.799625    0.800997      0.800199
 recall       0.898148    0.647619  0.799625    0.772884      0.799625
 f1-score     0.844702    0.717678  0.799625    0.781190      0.794749
 support    324.000000  210.000000  0.799625  534.000000    534.000000,
                 metric       score
 0             accuracy    0.799625
 1   true_positive_rate    0.647619
 2  false_positive_rate    0.101852
 3   true_negative_rate    0.898148
 4  false_negative_rate    0.352381
 5            precision    0.804734
 6               recall    0.647619
 7             f1_score    0.717678
 8          support_pos  210.000000
 9          support_neg  324.000000)

## 4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [15]:
cm = confusion_matrix(y_train, y_pred)

In [16]:
prepare.get_cm_metrics(cm)

Unnamed: 0,metric,score
0,accuracy,0.799625
1,true_positive_rate,0.647619
2,false_positive_rate,0.101852
3,true_negative_rate,0.898148
4,false_negative_rate,0.352381
5,precision,0.804734
6,recall,0.647619
7,f1_score,0.717678
8,support_pos,210.0
9,support_neg,324.0


## 5. Run through steps 2-4 using a different max_depth value.

In [17]:
prepare.clf_fit_predict(4, x_train, y_train, y_validate)



ValueError: Expected 2D array, got 1D array instead:
array=[1. 1. 1. 0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 1. 1. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0.
 1. 1. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0.
 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 1. 1. 1.
 0. 0. 0. 0. 1. 1. 0. 0. 1. 0.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.