# Decision Trees Exercises

## Introduction

We will be using the wine quality data set for these exercises. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white). The name of the file is `Wine_Quality_Data.csv`.

In [1]:
from __future__ import print_function
import os
data_path = ['..', '..', 'data']

## Question 1

* Import the data and examine the features.
* We will be using all of them to predict `color` (white or red), but the colors feature will need to be integer encoded.

In [None]:
# Import Data


# Examine the features


# Convert the color feature to an integer. This is a quick way to do it using Pandas.
data['color'] = data.color.replace('white',0).replace('red',1).astype(np.int)

## Question 2

* Use `StratifiedShuffleSplit` to split data into train and test sets that are stratified by wine quality. If possible, preserve the indices of the split for question 5 below.
* Check the percent composition of each quality level for both the train and test data sets.

In [None]:
# All data columns except for color
feature_cols = [x for x in data.columns if x not in 'color']

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# Split the data into two parts with 1000 points in the test data
# This creates a generator


# Get the index values from the generator


# Create the data sets


Now check the percent composition of each quality level in the train and test data sets. The data set is mostly white wine, as can be seen below.

In [None]:
y_train.value_counts(normalize=True).sort_index()

In [None]:
y_test.value_counts(normalize=True).sort_index()

## Question 3

* Fit a decision tree classifier with no set limits on maximum depth, features, or leaves.
* Determine how many nodes are present and what the depth of this (very large) tree is.
* Using this tree, measure the prediction error in the train and test data sets. What do you think is going on here based on the differences in prediction error?

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Train Decision Tree Classifier


The number of nodes and the maximum actual depth.

In [None]:
dt.tree_.node_count, dt.tree_.max_depth

Performance metrics

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Refer to previous practicals to obtain the performance metrics for training set and testing set

## Question 4

* Using grid search with cross validation, find a decision tree that performs well on the test data set. Use a different variable name for this decision tree model than in question 3 so that both can be used in question 6.
* Determine the number of nodes and the depth of this tree.
* Measure the errors on the training and test sets as before and compare them to those from the tree in question 3.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth':range(1, dt.tree_.max_depth+1, 2),
              'max_features': range(1, len(dt.feature_importances_)+1)}

GR = GridSearchCV(DecisionTreeClassifier(random_state=42),
                  param_grid=param_grid,
                  scoring='accuracy',
                  n_jobs=-1)

GR = GR.fit(X_train, y_train)

The number of nodes and the maximum depth of the tree.

In [None]:
GR.best_estimator_.tree_.node_count, GR.best_estimator_.tree_.max_depth

These test errors are a little better than the previous ones. So it would seem the previous example overfit the data, but only slightly so.

In [None]:
# Refer to previous practicals to obtain the performance metrics for training set and testing set



## Question 5

* Re-split the data into `X` and `y` parts, this time with `residual_sugar` being the predicted (`y`) data. *Note:* if the indices were preserved from the `StratifiedShuffleSplit` output in question 2, they can be used again to split the data.
* Using grid search with cross validation, find a decision tree **regression** model that performs well on the test data set.
* Measure the errors on the training and test sets using mean squared error.
* Make a plot of actual *vs* predicted residual sugar.