<h1>Regression with Decision Trees: Predicting Wine Quality</h1>

Decision tree regressors are used when the target variable is continuous and ordered (wine quality from 0 to 10)

#### Skills Used:
- Regression with Decision Trees
- Entropy minimizers
- Cross-validation

<h3>Import the data</h3>

In [1]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
import pandas as pd
from pandas import DataFrame
w_df = pd.read_csv(url,header=0,sep=';')
w_df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


<h4>Build train and test samples</h4>

In [2]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(w_df, test_size = 0.3)
x_train = train.iloc[0:,0:11]
y_train = train[['quality']]
x_test = test.iloc[0:,0:11]
y_test = test[['quality']]

#Use all data for cross validation
x_data = w_df.iloc[0:,0:11]
y_data = w_df[['quality']]
#x_data
y_test

Unnamed: 0,quality
203,5
158,5
1077,5
1501,5
812,5
...,...
380,6
1413,5
917,6
1335,6


<h4>For wine quality, we need a regressor</h4>

In [3]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

model = DecisionTreeRegressor(max_depth = 3)
model.fit(x_train,y_train)

DecisionTreeRegressor(max_depth=3)

Details: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

In [4]:
#Get the R-Square for the predicted vs actuals on the text sample
print("Training R-Square",model.score(x_train,y_train))
print("Testing R-Square",model.score(x_test,y_test))

Training R-Square 0.33905528924639106
Testing R-Square 0.2508293083733515


<h3>View the tree</h3>

In [7]:
import pydotplus 
feature_names = [key for key in w_df]
feature_names.pop() # removing outcome variable, Quality, which is the last column)
dot_data = tree.export_graphviz(model, out_file=None,feature_names=feature_names) 
graph = pydotplus.graph_from_dot_data(dot_data) 
graph.write_pdf("wines.pdf") 
#The tree will be saved to wines.pdf in your current directory

True

#### A screenshot of the Decision Tree is displayed below:

![](wines_tree.png)

<h3>Decision trees are Entropy minimizers</h3>
<li><b>Entropy</b>: a measure of uncertainty in the data<p>
what is the uncertainty in color when you draw a marble from a box of 100 blue marbles?<p>
what is the uncertainty when you draw a marble from a box with 50 blue and 50 red marbles?
<li>Entropy minimization: decision tree algorithms seek to partition the data on features in the way that total entropy is minimized

<h3>Regression trees</h3>
<li>Run regressions for each X to the dependent variable
<li>Pick the variable with the most explanatory power and split it at several points
<li>Calculate the Mean Square Error of each of the two halves for each split
<li>Pick the split point that gives the lowest mse (combined)

<h2>Cross validation</h2>

<li>Split the training set into k smaller sets (aka folds)
<li>Train the data on k-1 folds
<li>Validate the results on fold k
<li>Repeat this holding out each of the k folds in turn
<li>Report the average of all tests as the performance metric
<li>http://scikit-learn.org/stable/modules/cross_validation.html

In [8]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [9]:
crossvalidation = KFold(n_splits=5,shuffle=True, random_state=1)

In [10]:
from sklearn import tree
import numpy as np
for depth in range(1,10):
    model = tree.DecisionTreeRegressor(
    max_depth=depth, random_state=0)
    if model.fit(x_data,y_data).tree_.max_depth < depth:
        break
    score = np.mean(cross_val_score(model, x_data, y_data,scoring='neg_mean_squared_error', cv=crossvalidation, n_jobs=1))
    print ('Depth: %i Accuracy: %.3f' % (depth,score))

Depth: 1 Accuracy: -0.548
Depth: 2 Accuracy: -0.512
Depth: 3 Accuracy: -0.482
Depth: 4 Accuracy: -0.482
Depth: 5 Accuracy: -0.480
Depth: 6 Accuracy: -0.493
Depth: 7 Accuracy: -0.535
Depth: 8 Accuracy: -0.573
Depth: 9 Accuracy: -0.599


<h3>Purpose of cross-validation</h3>
<li>Not to generate a tree (it generates many trees!)
<li>But to provide an estimate of the average error of the model
<li>Roughly, the idea is to see how the model performance varies with different training sets
<li>To generate the tree, use the entire training set as before