# Tutorial: Decision Tree

In this tutorial, we will build a decision tree model to predict whether a person on the Titanic will survive


**Data Features and meanings**
- survival
    - Survival - 0 = No, 1 = Yes
- pclass
    - Ticket class - 1 = 1st, 2 = 2nd, 3 = 3rd
- sex
    - Sex	
- Age
    - Age in years	
- sibsp
    - #of siblings / spouses aboard the Titanic	
- parch
    - #of parents / children aboard the Titanic	
- ticket
    - Ticket number	
- fare
    - Passenger fare	
- cabin
    - Cabin number	
- embarked
    - Port of Embarkation - C = Cherbourg, Q = Queenstown, S = Southampton

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns; sns.set()

## Part 1: Load  data

In [None]:
# supress warnings
import warnings
warnings.filterwarnings('ignore')

titanic_data = pd.read_csv('titanic.csv')

In [None]:
titanic_data.head(5)

In [None]:
titanic_data.info()

### Removing Columns With Too Much Missing Data

In [None]:
titanic_data.drop('Cabin', axis=1, inplace = True)

### Removing Null Data From Our Data Set

In [None]:
# Check missing value in data
titanic_data.isna().sum()

In [None]:
# drop Missing value
titanic_data.dropna(inplace = True)

### Handling Categorical Data With Dummy Variables

In [None]:
# to avoid multicollinearity
titanic_data=pd.get_dummies(titanic_data,columns=['Sex','Embarked'],prefix='',prefix_sep='',drop_first = True)

In [None]:
# check dataframe columns
titanic_data.columns

Removing some columns we decide not to use from the dataset

In [None]:
titanic_data.drop(['PassengerId','Name', 'Ticket'], axis = 1, inplace = True)

In [None]:
# check dataframe columns
titanic_data.columns

correlation

In [None]:
sns.heatmap(titanic_data.corr())

## Train/Test separation

X/y separation

In [None]:
y_data = titanic_data['Survived']
x_data = titanic_data.drop('Survived', axis = 1)

Perform hold-out method
- 70% training set
- 30% testing set

In [None]:
from sklearn.model_selection import train_test_split
x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x_data, y_data, test_size = 0.3,random_state=0)

# Part 3: Train a decision tree model

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=0)
model

In [None]:
model.fit(x_training_data, y_training_data)

Making Predictions With Our Model

In [None]:
predictions = model.predict(x_test_data)

### Part 4: Model Evaluation

Evaluation metrics
- confusion metrix
- accuracy
- precision, recall, f1-score

Measuring the Performance

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

print("Accuracy:\t %.3f" %accuracy_score(y_test_data, predictions))
print(classification_report(y_test_data, predictions))

confusion matrix

In [None]:
from sklearn.metrics import plot_confusion_matrix
cm = plot_confusion_matrix(model,x_test_data, y_test_data,cmap="Blues",values_format='.3g');
plt.grid(None)
plt.show()

Visualizing the decision tree

In [None]:
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(model, out_file=None, 
                              feature_names=x_training_data.columns,
                              class_names=['0','1'],
                              filled=True, rounded=True,
                              special_characters=True, rotate=True)
graph = graphviz.Source(dot_data)
graph.render('dtree_render')

### Part 5: Model tuning

#### Try tuning the model to see if you can make it perform better?


You can look at the parameters and functions of Decision Tree Classifier at https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

## Try tuning Hyperparameter
***Note that to do this properly, you should split data into train/validation/test set and tune them on validation set, not test set***


In [None]:
model = DecisionTreeClassifier(criterion='entropy', 
                               splitter='best', 
                               max_depth=3, 
                               min_samples_split=2, 
                               min_samples_leaf=1, 
                               min_weight_fraction_leaf=0.0, 
                               max_features=None, 
                               random_state=None, 
                               max_leaf_nodes=None, 
                               min_impurity_decrease=0.0, 
                               min_impurity_split=None, 
                               class_weight=None, 
                               presort='deprecated', 
                               ccp_alpha=0.0)
model.fit(x_training_data, y_training_data)
predictions = model.predict(x_test_data)
print("Accuracy:\t %.3f" %accuracy_score(y_test_data, predictions))
print(classification_report(y_test_data, predictions))

## Feature importance

In [None]:
importances = model.feature_importances_
print(importances)

from matplotlib import pyplot
# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()