## Wine doesn't grow on trees!

Instead of predicting a continuous variable, like price, I'm going to instead shift gears and see if it is possible to predict whether a wine is expensive or not based on its features. For the purposes of this project, I'm going to define expensive as whether or not the wine is priced greater than $100, as this seems like a lot to spend on a bottle. I'm going to begin by running a decision tree as they're intuitive to understand and tweak it as necessary to improve the accuracy and recall metrics. I'll be looking at the accuracy and recall metrics in particular, and will be running a 10-fold cross-validation to ensure the stability of the model's output.

In [7]:
#Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pylab as pl
import seaborn as sns
%matplotlib inline

cab_df2 = pd.read_csv('./cablist4.csv')

In [8]:
#create a new target variable based on price
#$100 seems like a lot to spend on a bottle of wine, so that will be my threshold

cab_df2['HighPrice_Ind'] = cab_df2['PriceRetail'].apply(lambda x: 1 if x >= 100.00 else 0)

In [9]:
features = ['Vintage', 'RatingScore', 'UnqWordInd', 'Attribute_94+ Rated Wine', \
            'Attribute_Boutique Wines', 'Attribute_Business Gifts',\
            'Attribute_Collectible Wines', 'Attribute_Earthy &amp; Spicy', 'Attribute_Great Bottles to Give',\
            'Attribute_Green Wines', 'Attribute_Kosher Wines', 'Attribute_Older Vintages', \
            'Attribute_Private Cellar List', 'Attribute_Rich &amp; Creamy', 'Attribute_Screw Cap Wines', \
            'Attribute_Smooth &amp; Supple', 'Region_California', 'Region_Italy', 'Region_South Africa',\
            'Region_South America', 'Region_Spain', 'Region_Washington']
target = ['HighPrice_Ind']

In [10]:
x = cab_df2[features]
y = cab_df2.HighPrice_Ind.values

In [12]:
# split into train/test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.33, random_state=39)

In [13]:
from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier(max_depth=3, min_samples_leaf=20, random_state=39)
treeclf.fit(x, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=20,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=39, splitter='best')

In [14]:
pd.DataFrame({'feature':features, 'importance':treeclf.feature_importances_})

Unnamed: 0,feature,importance
0,Vintage,0.105209
1,RatingScore,0.687225
2,UnqWordInd,0.0
3,Attribute_94+ Rated Wine,0.0
4,Attribute_Boutique Wines,0.0
5,Attribute_Business Gifts,0.0
6,Attribute_Collectible Wines,0.0
7,Attribute_Earthy &amp; Spicy,0.0
8,Attribute_Great Bottles to Give,0.0
9,Attribute_Green Wines,0.0


### The model seems to do a decent job predicting which wines are high priced or not.

In [15]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
preds = treeclf.predict(x_test)
cm = confusion_matrix(y_test, preds)

print("Train Testing Results \n\n")

print(classification_report(y_test, preds,
                         target_names=['not high priced', 'high priced']))

print('Confusion Matrix is:')
cm

Train Testing Results 


                 precision    recall  f1-score   support

not high priced       0.82      0.84      0.83       560
    high priced       0.73      0.71      0.72       344

    avg / total       0.79      0.79      0.79       904

Confusion Matrix is:


array([[470,  90],
       [101, 243]])

### Unfortunately it isn't stable, as the cross-validation shows

In [16]:
from sklearn.cross_validation import cross_val_score
recall_scores = cross_val_score(treeclf, x, y, cv=10, scoring='recall')
accuracy_scores = cross_val_score(treeclf, x, y, cv=10, scoring='accuracy')
print "Recall Score Summary:"
print recall_scores
print recall_scores.mean()

print "Accuracy Score Summary:"
print accuracy_scores
print accuracy_scores.mean()

Recall Score Summary:
[ 1.          0.92929293  0.92929293  0.91919192  0.          0.68686869
  0.26262626  0.          0.05050505  0.        ]
0.477777777778
Accuracy Score Summary:
[ 0.36131387  0.50364964  0.56569343  0.47810219  0.63868613  0.72627737
  0.66423358  0.63868613  0.65693431  0.63736264]
0.58709392797


### Let's restrict features in hopes of improving stability

In [36]:
features2 = ['RatingScore','Region_California',\
            'Vintage']
target = ['PriceRetail']

In [37]:
x2 = cab_df2[features2]
y2 = cab_df2.HighPrice_Ind.values

x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2, test_size=.33, random_state=39)

from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier(max_depth=3, min_samples_leaf=20, random_state=39)
treeclf.fit(x2, y2)

pred2 = treeclf.predict(x2_test)
cm = confusion_matrix(y2_test, pred2)

print("Train Testing Results \n\n")

print(classification_report(y2_test, pred2,
                         target_names=['not high priced', 'high priced']))

print('Confusion Matrix is:')
cm

Train Testing Results 


                 precision    recall  f1-score   support

not high priced       0.82      0.84      0.83       560
    high priced       0.73      0.71      0.72       344

    avg / total       0.79      0.79      0.79       904

Confusion Matrix is:


array([[470,  90],
       [101, 243]])

In [38]:
recall_scores = cross_val_score(treeclf, x2, y2, cv=10, scoring='recall')
accuracy_scores = cross_val_score(treeclf, x2, y2, cv=10, scoring='accuracy')
print "Recall Score Summary:"
print recall_scores
print recall_scores.mean()

print "Accuracy Score Summary:"
print accuracy_scores
print accuracy_scores.mean()

Recall Score Summary:
[ 1.          0.92929293  0.92929293  0.91919192  0.          0.62626263
  0.26262626  0.          0.05050505  0.        ]
0.471717171717
Accuracy Score Summary:
[ 0.36131387  0.50364964  0.56569343  0.47810219  0.63868613  0.76277372
  0.66423358  0.63868613  0.65693431  0.63736264]
0.590743563006


In [39]:
pd.DataFrame({'feature':features, 'importance':treeclf.feature_importances_})

Unnamed: 0,feature,importance
0,RatingScore,0.687225
1,Region_California,0.207565
2,Vintage,0.105209


## A forest may be better than the tree -- Trying a random forest to improve results
I'm not that satisfied with the classification tree output, so I'm going to try some random forest models

In [43]:
x3 = cab_df2[features]
y3 = cab_df2.HighPrice_Ind.values

x3_train, x3_test, y3_train, y3_test = train_test_split(x3, y3, test_size=.33, random_state=39)

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, max_depth=3, min_samples_leaf=20, random_state=39)
rfc.fit(x3, y3)

pred3 = rfc.predict(x3_test)
cm = confusion_matrix(y3_test, pred3)

print("Train Testing Results \n\n")

print(classification_report(y3_test, pred3,
                         target_names=['not high priced', 'high priced']))

print('Confusion Matrix is:')
cm

Train Testing Results 


                 precision    recall  f1-score   support

not high priced       0.79      0.91      0.85       560
    high priced       0.80      0.61      0.69       344

    avg / total       0.80      0.79      0.79       904

Confusion Matrix is:


array([[509,  51],
       [135, 209]])

In [44]:
recall_scores = cross_val_score(rfc, x3, y3, cv=10, scoring='recall')
accuracy_scores = cross_val_score(rfc, x3, y3, cv=10, scoring='accuracy')
print "Recall Score Summary:"
print recall_scores
print recall_scores.mean()

print "Accuracy Score Summary:"
print accuracy_scores
print accuracy_scores.mean()

Recall Score Summary:
[ 1.          0.92929293  0.92929293  0.91919192  0.19191919  0.54545455
  0.15151515  0.01010101  0.05050505  0.        ]
0.472727272727
Accuracy Score Summary:
[ 0.36131387  0.50364964  0.72627737  0.47810219  0.7080292   0.78832117
  0.67518248  0.64233577  0.65693431  0.63736264]
0.617750862276


## Let's Gradient Boost this wine!
The random forest barely improved the results, and I'd like to try to do better, so I'm going to give Gradient Boosting a try. The logic behind this is that this algorithm will generate a sequence of models that build upon themselves to correct their mistakes. 

In [45]:
x3 = cab_df2[features]
y3 = cab_df2.HighPrice_Ind.values

x3_train, x3_test, y3_train, y3_test = train_test_split(x3, y3, test_size=.33, random_state=39)

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=100, max_depth=3, min_samples_leaf=20, random_state=39)
gbc.fit(x3, y3)

pred3 = gbc.predict(x3_test)
cm = confusion_matrix(y3_test, pred3)

print("Train Testing Results \n\n")

print(classification_report(y3_test, pred3,
                         target_names=['not high priced', 'high priced']))

print('Confusion Matrix is:')
cm

Train Testing Results 


                 precision    recall  f1-score   support

not high priced       0.81      0.90      0.85       560
    high priced       0.80      0.66      0.72       344

    avg / total       0.81      0.81      0.80       904

Confusion Matrix is:


array([[505,  55],
       [118, 226]])

In [46]:
recall_scores = cross_val_score(gbc, x3, y3, cv=10, scoring='recall')
accuracy_scores = cross_val_score(gbc, x3, y3, cv=10, scoring='accuracy')
print "Recall Score Summary:"
print recall_scores
print recall_scores.mean()

print "Accuracy Score Summary:"
print accuracy_scores
print accuracy_scores.mean()

Recall Score Summary:
[ 1.          0.57575758  0.87878788  0.75757576  0.47474747  0.41414141
  0.15151515  0.07070707  0.06060606  0.        ]
0.438383838384
Accuracy Score Summary:
[ 0.4379562   0.37226277  0.74817518  0.51824818  0.80656934  0.69343066
  0.54014599  0.66423358  0.64233577  0.63736264]
0.60607203016


## Results Discussion:

The decision tree models did a mediocre job of predicting whether or not a bottle of wine will be priced above $100. While the recall and accuracy were decent for a single iteration of the model, their stability across multiple iterations was challenged and produced sub-optimal results. I think this can be improved, so I'm going to try some new modeling techniques. 