# Analysis

In [3]:
# Imports
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from xgboost import plot_tree
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_graphviz
import subprocess
from sklearn.model_selection import GridSearchCV
import numpy as np

redwine = pd.read_csv('../data/winequality-red.csv', delimiter=';')
whitewine = pd.read_csv('../data/winequality-white.csv', delimiter=';')

# Note: since there are two datasets, everything will have to be done twice. 
# I have decided not to combine the two with an added categorical feature 
# for the colour of the wine. 

## Questions 1 & 2

The first two questions are about which are the most important factors for quality when making wine. Since this dataset is about physicochemical measurements, they mean which chemical attributes should you aim for. Since the initial EDA didn't turn up anything obvious or strongly linearly correlated, I'll use some sort of not-quite-so-linear model to see if anything correlates that way. And I'll have to use one that I can see the workings of. 

###### Plan: 
1. Pick a usable model that I can see the insides of. 
2. Train that model to some good level of accuracy. 
3. Have a look inside and see what features the model has learned can be used. 
4. Pick out the one's that are indicitive of either especially good or bad wine. 
5. Repeat with another type of model until I have my answers or I determine they can't be found. 

The obvious choice for 1 is a forest model, as they're basically auto-generated 20 questions trees. Also, they're quick to get going with because they don't require any much of any preprocessing. 

In [5]:
red_train = redwine.iloc[:,:-1]
red_target = redwine.iloc[:,11]
white_train = whitewine.iloc[:,:-1]
white_target = whitewine.iloc[:,11]

rx_train, rx_test, ry_train, ry_test = train_test_split(red_train, red_target, train_size=0.75, test_size=0.25)
wx_train, wx_test, wy_train, wy_test = train_test_split(red_train, red_target, train_size=0.75, test_size=0.25)

In [3]:
dtr_parameters = {'min_samples_split': np.arange(2, 12), 
                  'min_samples_leaf': np.arange(1, 250, 25), 
                  'max_depth': np.arange(1, 11), 
                  'max_features': np.arange(1, 11), 
                  'max_leaf_nodes': np.arange(5, 50, 5)}

red_grid_search_1 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
red_grid_search_2 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
red_grid_search_3 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
red_grid_search_4 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
white_grid_search_1 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
white_grid_search_2 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
white_grid_search_3 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
white_grid_search_4 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)

red_grid_search_1.fit(rx_train, ry_train)
red_grid_search_2.fit(rx_train, ry_train)
red_grid_search_3.fit(rx_train, ry_train)
red_grid_search_4.fit(rx_train, ry_train)
white_grid_search_1.fit(wx_train, wy_train)
white_grid_search_2.fit(wx_train, wy_train)
white_grid_search_3.fit(wx_train, wy_train)
white_grid_search_4.fit(wx_train, wy_train)


Tried a bunch of other tree models, none were hugely better, so I'll stick 
with this one for it's ease of visualisation. 
dtr = DecisionTreeRegressor(min_samples_split=10, min_samples_leaf=50, max_depth=3, max_features=5, max_leaf_nodes=50)
dtr.fit(rx_train, ry_train)
dtr_pred = dtr.predict(rx_test)
print('mean_absolute_error: ' + str(mean_absolute_error(ry_test.values, dtr_pred)))


feature_names = redwine.columns[:11]
export_graphviz(dtr, out_file='graphviz_output.dot', feature_names=feature_names)
subprocess.run(['dot','-Tpng graphviz_output.dot -o tree.png'])
subprocess.run(['eog', 'tree.png'])

In [9]:
print("Red wines: ")
print()
print("Best score: " + str(red_grid_search_1.best_score_))
print(red_grid_search_1.best_estimator_)
print(red_grid_search_1.best_params_)
print()
print("Best score: " + str(red_grid_search_2.best_score_))
print(red_grid_search_2.best_estimator_)
print(red_grid_search_2.best_params_)
print()
print("Best score: " + str(red_grid_search_3.best_score_))
print(red_grid_search_3.best_estimator_)
print(red_grid_search_3.best_params_)
print()
print("Best score: " + str(red_grid_search_4.best_score_))
print(red_grid_search_4.best_estimator_)
print(red_grid_search_4.best_params_)
print()
print()
print("White wines: ")
print()
print("Best score: " + str(white_grid_search_1.best_score_))
print(white_grid_search_1.best_estimator_)
print(white_grid_search_1.best_params_)
print()
print("Best score: " + str(white_grid_search_1.best_score_))
print(white_grid_search_2.best_estimator_)
print(white_grid_search_2.best_params_)
print()
print("Best score: " + str(white_grid_search_1.best_score_))
print(white_grid_search_3.best_estimator_)
print(white_grid_search_3.best_params_)
print()
print("Best score: " + str(white_grid_search_1.best_score_))
print(white_grid_search_4.best_estimator_)
print(white_grid_search_4.best_params_)

Red wines: 

Best score: 0.3227331471311155
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=8,
           max_leaf_nodes=10, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=19,
           min_samples_split=6, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
{'max_depth': 4, 'max_features': 8, 'max_leaf_nodes': 10, 'min_samples_leaf': 19, 'min_samples_split': 6}

Best score: 0.31870400463868126
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=5,
           max_leaf_nodes=10, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=13,
           min_samples_split=3, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
{'max_depth': 5, 'max_features': 5, 'max_leaf_nodes': 10, 'min_samples_leaf': 13, 'min_samples_split': 3}

Best score: 0.3233797697278856
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=6,

#### Frist run (to ensure it worked): 
##### Red

DecisionTreeRegressor(criterion='mse', max_depth=10, max_features=6,  
           max_leaf_nodes=20, min_impurity_decrease=0.0,  
           min_impurity_split=None, min_samples_leaf=26,  
           min_samples_split=11, min_weight_fraction_leaf=0.0,  
           presort=False, random_state=None, splitter='best')  

##### White

DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=10,  
           max_leaf_nodes=35, min_impurity_decrease=0.0,  
           min_impurity_split=None, min_samples_leaf=26,  
           min_samples_split=4, min_weight_fraction_leaf=0.0,  
           presort=False, random_state=None, splitter='best')  


#### Second run (to get results), limited to 4 per wine colour due to computation time: 
##### Red

DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=8,  
           max_leaf_nodes=30, min_impurity_decrease=0.0,  
           min_impurity_split=None, min_samples_leaf=26,  
           min_samples_split=8, min_weight_fraction_leaf=0.0,  
           presort=False, random_state=None, splitter='best')  

DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=9,  
           max_leaf_nodes=20, min_impurity_decrease=0.0,  
           min_impurity_split=None, min_samples_leaf=26,  
           min_samples_split=9, min_weight_fraction_leaf=0.0,  
           presort=False, random_state=None, splitter='best')  

DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=8,  
           max_leaf_nodes=45, min_impurity_decrease=0.0,  
           min_impurity_split=None, min_samples_leaf=26,  
           min_samples_split=4, min_weight_fraction_leaf=0.0,  
           presort=False, random_state=None, splitter='best')  

DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=7,  
           max_leaf_nodes=15, min_impurity_decrease=0.0,  
           min_impurity_split=None, min_samples_leaf=26,  
           min_samples_split=11, min_weight_fraction_leaf=0.0,  
           presort=False, random_state=None, splitter='best')  

#### White

DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=5,  
           max_leaf_nodes=15, min_impurity_decrease=0.0,  
           min_impurity_split=None, min_samples_leaf=26,  
           min_samples_split=6, min_weight_fraction_leaf=0.0,  
           presort=False, random_state=None, splitter='best')  

DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=7,  
           max_leaf_nodes=45, min_impurity_decrease=0.0,  
           min_impurity_split=None, min_samples_leaf=26,  
           min_samples_split=11, min_weight_fraction_leaf=0.0,  
           presort=False, random_state=None, splitter='best')  

DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=9,  
           max_leaf_nodes=25, min_impurity_decrease=0.0,  
           min_impurity_split=None, min_samples_leaf=26,  
           min_samples_split=7, min_weight_fraction_leaf=0.0,  
           presort=False, random_state=None, splitter='best')  

DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=5,  
           max_leaf_nodes=20, min_impurity_decrease=0.0,  
           min_impurity_split=None, min_samples_leaf=26,  
           min_samples_split=7, min_weight_fraction_leaf=0.0,  
           presort=False, random_state=None, splitter='best')  

#### What comes of that: 
##### Best (red): 
min_samples_split: 11, 8, 9, 4, 11 (mean avg: 8.6, mode avg: 11)  
min_samples_leaf: 26, 26, 26, 26, 26 (so 26)  
max_depth: 10, 9, 9, 9, 9 (so 9)  
max_features: 6, 8, 9, 8, 7 (mean avg: 7.6, mode avg: 8)  
max_leaf_nodes: 20, 30, 20, 45, 15 (mean avg: 26, mode avg: 20)  

##### Best (white): 
min_samples_split: 4, 6, 11, 7, 7 (mean avg: 7, mode avg: 7)  
min_samples_leaf: 26, 26, 26, 26, 26 (so 26)  
max_depth: 4, 4, 7, 7, 5 (mean avg: 5.4, mode avg: n/a)  
max_features: 10, 9, 7, 9, 5 (mean avg: 8, mode avg: 9)  
max_leaf_nodes: 35, 15, 45, 25, 20 (mean avg: 28, mode avg: n/a)  

##### Chosen values:
###### Red: 
min_samples_split: 9  
min_samples_leaf: 26  
max_depth: 9  
max_features: 8  
max_leaf_nodes: 20  

##### White: 
min_samples_split: 7  
min_samples_leaf: 26  
max_depth: 5  
max_features: 9  
max_leaf_nodes: 30  

Just to check my chosen values haven't somehow ruined everything: 

In [5]:
# Best scores so far: 
red_dtr = DecisionTreeRegressor(min_samples_split = 9, 
                                min_samples_leaf = 26, 
                                max_depth = 9, 
                                max_features = 8, 
                                max_leaf_nodes = 20)
white_dtr = DecisionTreeRegressor(min_samples_split = 7, 
                                  min_samples_leaf = 26, 
                                  max_depth = 5, 
                                  max_features = 9, 
                                  max_leaf_nodes = 30)

red_dtr.fit(rx_train, ry_train)
white_dtr.fit(wx_train, wy_train)

red_dtr_pred = red_dtr.predict(rx_test)
white_dtr_pred = white_dtr.predict(wx_test)

print("red dtr mae: " + str(mean_absolute_error(ry_test, red_dtr_pred)))
print("white dtr mae: " + str(mean_absolute_error(wy_test, white_dtr_pred)))

red dtr mae: 0.5480628637506041
white dtr mae: 0.5586677133086352


Now to train the best model setup with all the data, so I can get the best graph: 

In [6]:
# With all data: 
red_dtr_all_data = DecisionTreeRegressor(min_samples_split = 9, 
                                min_samples_leaf = 26, 
                                max_depth = 9, 
                                max_features = 8, 
                                max_leaf_nodes = 20)
white_dtr_all_data = DecisionTreeRegressor(min_samples_split = 17, 
                                  min_samples_leaf = 26, 
                                  max_depth = 15, 
                                  max_features = 9, 
                                  max_leaf_nodes = 30)

red_dtr_all_data.fit(red_train, red_target)
white_dtr_all_data.fit(white_train, white_target)

DecisionTreeRegressor(criterion='mse', max_depth=15, max_features=9,
           max_leaf_nodes=30, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=26,
           min_samples_split=17, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [7]:
feature_names = redwine.columns[:11]

export_graphviz(red_dtr_all_data, out_file='red_graphviz_output.dot', 
                feature_names=feature_names, 
                label='all', 
                filled=True, 
                rounded=True,
                leaves_parallel=True, 
                impurity=True)
subprocess.run(['dot','-Tpng', 'red_graphviz_output.dot', '-o', 'red_tree.png'])

export_graphviz(white_dtr_all_data, out_file='white_graphviz_output.dot', 
                feature_names=feature_names, 
                label='all', 
                filled=True, 
                rounded=True,
                leaves_parallel=True, 
                impurity=True)
subprocess.run(['dot','-Tpng', 'white_graphviz_output.dot', '-o', 'white_tree.png'])

CompletedProcess(args=['dot', '-Tpng', 'white_graphviz_output.dot', '-o', 'white_tree.png'], returncode=0)

### The graphs: 
#### Red Wine Decision Tree
![Red Wine Decision Tree](red_tree.png)
#### White Wine Decision Tree
![White Wine Decision Tree](white_tree.png)

### Unintelligible
These graphs are huge and difficult to easily read on my screen, nevermind interpret for knowing the *most* important factors. I will retry the gridsearch with a narrower range of parameter values, especially of tree depth. 

In [8]:
dtr_parameters = {'min_samples_split': np.arange(2, 12), 
                  'min_samples_leaf': np.arange(1, 25), 
                  'max_depth': np.arange(3, 6), 
                  'max_features': np.arange(1, 11), 
                  'max_leaf_nodes': np.arange(2, 11)}

red_grid_search_1 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
red_grid_search_2 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
red_grid_search_3 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
red_grid_search_4 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
white_grid_search_1 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
white_grid_search_2 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
white_grid_search_3 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)
white_grid_search_4 = GridSearchCV(DecisionTreeRegressor(), dtr_parameters, n_jobs=4)

red_grid_search_1.fit(rx_train, ry_train)
red_grid_search_2.fit(rx_train, ry_train)
red_grid_search_3.fit(rx_train, ry_train)
red_grid_search_4.fit(rx_train, ry_train)
white_grid_search_1.fit(wx_train, wy_train)
white_grid_search_2.fit(wx_train, wy_train)
white_grid_search_3.fit(wx_train, wy_train)
white_grid_search_4.fit(wx_train, wy_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'),
       fit_params=None, iid=True, n_jobs=4,
       param_grid={'min_samples_split': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11]), 'min_samples_leaf': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24]), 'max_depth': array([3, 4, 5]), 'max_features': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]), 'max_leaf_nodes': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [11]:
print("Red wines: ")
print()
print("Best score: " + str(red_grid_search_1.best_score_))
print(red_grid_search_1.best_estimator_)
print(red_grid_search_1.best_params_)
print()
print("Best score: " + str(red_grid_search_2.best_score_))
print(red_grid_search_2.best_estimator_)
print(red_grid_search_2.best_params_)
print()
print("Best score: " + str(red_grid_search_3.best_score_))
print(red_grid_search_3.best_estimator_)
print(red_grid_search_3.best_params_)
print()
print("Best score: " + str(red_grid_search_4.best_score_))
print(red_grid_search_4.best_estimator_)
print(red_grid_search_4.best_params_)
print()
print()
print("White wines: ")
print()
print("Best score: " + str(white_grid_search_1.best_score_))
print(white_grid_search_1.best_estimator_)
print(white_grid_search_1.best_params_)
print()
print("Best score: " + str(white_grid_search_1.best_score_))
print(white_grid_search_2.best_estimator_)
print(white_grid_search_2.best_params_)
print()
print("Best score: " + str(white_grid_search_1.best_score_))
print(white_grid_search_3.best_estimator_)
print(white_grid_search_3.best_params_)
print()
print("Best score: " + str(white_grid_search_1.best_score_))
print(white_grid_search_4.best_estimator_)
print(white_grid_search_4.best_params_)

Red wines: 

Best score: 0.3227331471311155
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=8,
           max_leaf_nodes=10, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=19,
           min_samples_split=6, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
{'max_depth': 4, 'max_features': 8, 'max_leaf_nodes': 10, 'min_samples_leaf': 19, 'min_samples_split': 6}

Best score: 0.31870400463868126
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=5,
           max_leaf_nodes=10, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=13,
           min_samples_split=3, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
{'max_depth': 5, 'max_features': 5, 'max_leaf_nodes': 10, 'min_samples_leaf': 13, 'min_samples_split': 3}

Best score: 0.3233797697278856
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=6,

#### What comes of that: 
##### Best (red): 
min_samples_split: 6, 3, 10, 10 (mean avg: 7.25, mode avg: 10)  
min_samples_leaf: 19, 13, 19, 14 (mean avg: 16.25, mode avg: 19)  
max_depth: 4, 5, 3, 5 (mean avg: 4.25, mode avg: 5)  
max_features: 8, 5, 6, 6 (mean avg: 6.25, mode avg: 6)  
max_leaf_nodes: 10, 10, 10, 10 (mean avg: 10, mode avg: 10)  
mean mean_absolute_error: 0.321144912  

##### Best (white): 
min_samples_split: 4, 2, 5, 11 (mean avg: 5.5, mode avg: n/a)  
min_samples_leaf: 21, 4, 8, 6 (mean avg: 9.75, mode avg: n/a)  
max_depth: 3, 3, 5, 4 (mean avg: 3.75, mode avg: 3)  
max_features: 8, 8, 7, 8 (mean avg: 7.75, mode avg: 8)  
max_leaf_nodes: 10, 10, 9, 10 (mean avg: 9.75, mode avg: 10) 
mean mean_absolute_error: 0.337293917  

|GridSearchCV Run|Wine Colour|min_samples_split|min_samples_leaf|max_depth|max_features|max_leaf_nodes|mean_absolute_error|
|---|---|---|---|---|---|---|---|
|1|Red|6|19|4|8|10|0.32|
|2|Red|3|13|5|5|10|0.32|
|3|Red|10|19|3|6|10|0.32|
|4|Red|10|14|5|6|10|0.32|
|Mean|Red|7.25|16.25|4.25|6.25|10|0.32|
|Mode|Red|10|19|5|6|10|n/a|
|1|White|4|21|3|8|10|0.34|
|2|White|2|4|3|8|10|0.34|
|3|White|5|8|5|7|9|0.34|
|4|White|11|6|4|8|10|0.34|
|Mean|White|5.5|9.75|3.75|7.75|9.75|0.34|
|Mode|White|n/a|n/a|3|8|10|n/a|

##### Chosen values:
###### Red: 
min_samples_split: 7  
min_samples_leaf: 16  
max_depth: 5  
max_features: 6  
max_leaf_nodes: 10  

##### White: 
min_samples_split: 5  
min_samples_leaf: 10  
max_depth: 4  
max_features: 8  
max_leaf_nodes: 10  

In [6]:
# Best scores so far: 
red_dtr = DecisionTreeRegressor(min_samples_split = 7, 
                                min_samples_leaf = 16, 
                                max_depth = 5, 
                                max_features = 6, 
                                max_leaf_nodes = 10)
white_dtr = DecisionTreeRegressor(min_samples_split = 5, 
                                  min_samples_leaf = 10, 
                                  max_depth = 4, 
                                  max_features = 8, 
                                  max_leaf_nodes = 10)

red_dtr.fit(rx_train, ry_train)
white_dtr.fit(wx_train, wy_train)

red_dtr_pred = red_dtr.predict(rx_test)
white_dtr_pred = white_dtr.predict(wx_test)

print("red dtr mae: " + str(mean_absolute_error(ry_test, red_dtr_pred)))
print("white dtr mae: " + str(mean_absolute_error(wy_test, white_dtr_pred)))

red dtr mae: 0.5347849932257697
white dtr mae: 0.5163841420706428


In [7]:
# With all data: 
red_dtr_all_data_shorter_trees = DecisionTreeRegressor(min_samples_split = 7, 
                                         min_samples_leaf = 16, 
                                         max_depth = 5, 
                                         max_features = 6, 
                                         max_leaf_nodes = 10)
white_dtr_all_data_shorter_trees = DecisionTreeRegressor(min_samples_split = 5, 
                                           min_samples_leaf = 10, 
                                           max_depth = 4, 
                                           max_features = 8, 
                                           max_leaf_nodes = 10)

red_dtr_all_data_shorter_trees.fit(red_train, red_target)
white_dtr_all_data_shorter_trees.fit(white_train, white_target)

DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=8,
           max_leaf_nodes=10, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=10,
           min_samples_split=5, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [8]:
feature_names = redwine.columns[:11]

export_graphviz(red_dtr_all_data_shorter_trees, out_file='red_graphviz_output_shorter_trees.dot', 
                feature_names=feature_names, 
                label='all', 
                filled=True, 
                rounded=True,
                leaves_parallel=True, 
                impurity=True)
subprocess.run(['dot','-Tpng', 'red_graphviz_output_shorter_trees.dot', '-o', 'red_tree_shorter.png'])

export_graphviz(white_dtr_all_data_shorter_trees, out_file='white_graphviz_output_shorter_trees.dot', 
                feature_names=feature_names, 
                label='all', 
                filled=True, 
                rounded=True,
                leaves_parallel=True, 
                impurity=True)
subprocess.run(['dot','-Tpng', 'white_graphviz_output_shorter_trees.dot', '-o', 'white_tree_shorter.png'])

CompletedProcess(args=['dot', '-Tpng', 'white_graphviz_output_shorter_trees.dot', '-o', 'white_tree_shorter.png'], returncode=0)

## The (shorter) graphs: 
### (Shorter) Red Wine Decision Tree
![(Shorter) Red Wine Decision Tree](red_tree_shorter.png)
#### What answers to questions 1 and 2 does this graph give?
In the above graph there are 9 decisions; 3 alcohol, 3 sulphates, 2 volatile acidity, and one pH.

In all 3 decisions, a greater amount of alcohol lead toward higher predicted scores for the wine, with the highest amount of alcohol leading to the highest predicted score. This suggests that for making good quality wine, making it _strong_ (higher alcohol content compared to the average) is a good idea.

Similarly, in all 3 decisions about sulphates, the higher amount of sulphates lead to the higher predicted quality scores. Sulphates are used as preservatives in many foodstuffs, including wine. This suggest that using sulphates in wine improves the quality, perhaps by preserving it.

2 decisions were about volatile acidity, where in both cases a lower reading of volatile acidity lead to the higher predictions of quality. Volatile acidity is what increases as wine turns to vinegar. This suggests that it's a good idea to keep the wine from turning, either by some manufacturing process, or by adding perservatives.

1 decision was about pH, which is a measure of acidity, and can be interpreted as being similar to volatile acidity for the purposes of determining what factors determine the quality of wine.

From the above, I infer that the main indicators of high quality wine are that it is strong (high alcohol content), and fresh (or properly preserved). 
### (Shorter) White Wine Decision Tree
![(Shorter) White Wine Decision Tree](white_tree_shorter.png)
#### What answers to questions 1 and 2 does this graph give?
As for red wine, there were 9 decisions in the tree, with 3 of them were for alcohol. Again, more alcohol lead to higher quality score predictions. This supports the above inference of strong wine being higher quality.

There were 3 decisions about volatile acidity, and in all three cases, higher acidity (which comes from the wine turning to vinegar over time) lead to lower predictions of quality. This supports the above inference of fresher wine being of higher quality.

Free sulfur dioxide comes up in 2 decisions. In both cases higher amounts of free sulful dioxide lead to higher estimates of quality. This supports the above inference of fresher wine being of higher quality, in this case via the use of preservatives.

There was one decision based on density, where lower density lead to lower quality. I am unable to draw any useful inference from this. 