## Starting Out 🍷

I'm interested in predicting price, and since that is a continuous variable I am going to start with a regression tree model. It's an intuitive model, and rather forgiving of the underlying data distribution, i.e. not having a normal distribution isn't a deal breaker. The problem with trees is that they can grow, learning too much about the training data, and not generalize well to new observations. 

In [1]:
#Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pylab as pl
import seaborn as sns
%matplotlib inline

cab_df2 = pd.read_csv('./cablist4.csv')
cab_df2 = cab_df2[cab_df2.RatingScore != 0]
cab_df2 = cab_df2[cab_df2.PriceRetail < 1000.00]

In [2]:
#Set features & target for modeling
features = ['Vintage', 'RatingScore', 'UnqWordInd', 'Attribute_94+ Rated Wine', \
            'Attribute_Boutique Wines', 'Attribute_Business Gifts',\
            'Attribute_Collectible Wines', 'Attribute_Earthy &amp; Spicy', 'Attribute_Great Bottles to Give',\
            'Attribute_Green Wines', 'Attribute_Kosher Wines', 'Attribute_Older Vintages', \
            'Attribute_Private Cellar List', 'Attribute_Rich &amp; Creamy', 'Attribute_Screw Cap Wines', \
            'Attribute_Smooth &amp; Supple', 'Region_California', 'Region_Italy', 'Region_South Africa',\
            'Region_South America', 'Region_Spain', 'Region_Washington']
target = ['PriceRetail']

In [3]:
#Set x & y for model building
x = cab_df2[features]
y = cab_df2[target]

In [4]:
# split into train/test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.33, random_state=39)

In [6]:
#Is there an optimal tree depth that will minimize RMSE? Looping through a range of 7 splits 
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score

leafs = range(1, 7)
v = []
for l in leafs:
    treereg = DecisionTreeRegressor(max_depth=l, min_samples_leaf=20, random_state=39)
    treereg.fit(x_train, y_train)
    scores = cross_val_score(treereg, x, y, cv=10, scoring='neg_mean_squared_error')
    v.append(np.mean(np.sqrt(-scores)))

In [7]:
#Optimal depth is with 5 branches.
val = []
for i, j in enumerate(v):
    if j == min(v):
        val.append(i)
        
opt_leaf = min(val) + 1
print opt_leaf

5


In [8]:
#Build 1st model using regression decision tree using the optimized depth input
from sklearn.tree import DecisionTreeRegressor
treereg = DecisionTreeRegressor(max_depth=5, min_samples_leaf=20, random_state=39)
treereg.fit(x_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=20, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=39,
           splitter='best')

In [9]:
#Use model to predict test values
preds = treereg.predict(x_test)

#Comparison of predictions and test values... Some rather large differences
print preds[0:10]
print y_test[0:10]

[  97.58757576   21.17792899  182.69392405   97.31268634   21.17792899
  129.30130435  129.30130435   50.80906404   44.75007519  180.93235772]
      PriceRetail
2522       159.00
2121        27.99
617         85.00
808         79.99
2111        18.00
235        189.00
299         79.99
2368        45.00
1680        49.99
336        149.00


In [10]:
#What are the important features?
pd.DataFrame({'feature':features, 'importance':treereg.feature_importances_})

Unnamed: 0,feature,importance
0,Vintage,0.062005
1,RatingScore,0.823285
2,UnqWordInd,0.0
3,Attribute_94+ Rated Wine,0.0
4,Attribute_Boutique Wines,0.0
5,Attribute_Business Gifts,0.0
6,Attribute_Collectible Wines,0.001342
7,Attribute_Earthy &amp; Spicy,0.0
8,Attribute_Great Bottles to Give,0.0
9,Attribute_Green Wines,0.0


In [11]:
#RMSE is 108
from sklearn import metrics
import numpy as np
np.sqrt(metrics.mean_squared_error(y_test, preds))

107.96017152647437

In [12]:
#Point of comparison... The average wine price in the dataset is $114
cab_df2['PriceRetail'].mean()

114.29228550565573

In [13]:
metrics.mean_squared_error(y_test, preds)

11655.398636025768

In [14]:
#How do the results hold up in cross-validation?
from sklearn.model_selection import cross_val_score

treereg = DecisionTreeRegressor(max_depth=5, min_samples_leaf=30, random_state=39)
treereg.fit(x_train, y_train)
scores = cross_val_score(treereg, x, y, cv=10, scoring='neg_mean_squared_error')
np.mean(np.sqrt(-scores))

94.826603456314544

### Not a great first pass -- Can this be improved?

I'm not happy with the RMSE of the first model. At $108, it's nearly the same as the average price of our wines, indicating a large error. Let's see if it's possible to improve this by first restricting the input variables.

In [15]:
features2 = ['RatingScore', 'UnqWordInd','Attribute_Boutique Wines',\
            'Attribute_Collectible Wines', 'Attribute_Smooth &amp; Supple',\
            'Attribute_Older Vintages','Region_California', 'Region_South America']
target = ['PriceRetail']

In [16]:
x = cab_df2[features2]
y = cab_df2[target]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.33, random_state=39)

treereg = DecisionTreeRegressor(max_depth=5, min_samples_leaf=20, random_state=39)
treereg.fit(x_train, y_train)

preds = treereg.predict(x_test)
np.sqrt(metrics.mean_squared_error(y_test, preds))


105.83346432841651

In [17]:
treereg = DecisionTreeRegressor(max_depth=3, min_samples_leaf=20, random_state=39)
treereg.fit(x_train, y_train)
scores = cross_val_score(treereg, x, y, cv=10, scoring='neg_mean_squared_error')
np.mean(np.sqrt(-scores))

98.074115157610294

### Will a Forest be better than the tree?
Restricting the features didn't do much to improve the results, so I'm going to see a random forest regression model does better. I'm going to run 100 trees and see if this improves the predictions. 

In [47]:
x = cab_df2[features]
y = cab_df2[target]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.33, random_state=39)

from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(n_estimators = 100, max_depth=5, min_samples_leaf=20, random_state=39)
regr.fit(x_train, y_train)

preds = regr.predict(x_test)
np.sqrt(metrics.mean_squared_error(y_test, preds))



105.53493632018667

In [48]:
scores = cross_val_score(regr, x, y, cv=10, scoring='neg_mean_squared_error')
np.mean(np.sqrt(-scores))

91.840075438925169

### Should outliers be further scrubbed?

RMSE barely improves after restricting features. There may be too many wines with high prices that are confusing the model. What happens if the dataset is further restricted to prices < $360? This will capture all wines within 2 standard deviations of the average price.

In [20]:
cab_df3 = cab_df2[cab_df2.PriceRetail < 360.00]

In [22]:
x = cab_df3[features]
y = cab_df3[target]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.33, random_state=39)

treereg = DecisionTreeRegressor(max_depth=5, min_samples_leaf=20, random_state=39)
treereg.fit(x_train, y_train)

preds = treereg.predict(x_test)
np.sqrt(metrics.mean_squared_error(y_test, preds))

52.198965202916682

In [23]:
treereg = DecisionTreeRegressor(max_depth=5, min_samples_leaf=20, random_state=39)
treereg.fit(x_train, y_train)
scores = cross_val_score(treereg, x, y, cv=10, scoring='neg_mean_squared_error')
np.mean(np.sqrt(-scores))

55.533368543596019

### Getting closer -- Trying the Forest model again.

RMSE has improved considerably, but 52 is still a large error. I'm going to rerun the random forest model and see if the error can be further minimized.

In [42]:
x = cab_df3[features]
y = cab_df3[target]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.33, random_state=39)

from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(n_estimators = 100, max_depth=5, min_samples_leaf=20, random_state=39)
regr.fit(x_train, y_train)

preds = regr.predict(x_test)
rmse = np.sqrt(metrics.mean_squared_error(y_test, preds))



In [43]:
scores = cross_val_score(regr, x, y, cv=10, scoring='neg_mean_squared_error')
cv_rmse = np.mean(np.sqrt(-scores))

In [44]:
avgPrice = cab_df3['PriceRetail'].mean()
err = rmse/avgPrice

print 'RMSE is:                      %r' %rmse
print 'Cross-Validation Avg RMSE is: %r' %cv_rmse
print 'Avg Price is:                 %r' %avgPrice
print 'Error to Price Ratio is:      %r' %err

RMSE is:                      50.942754973664691
Cross-Validation Avg RMSE is: 55.206403711971802
Avg Price is:                 92.04570273392119
Error to Price Ratio is:      0.55345066049336578


## Discussion

The results fall short of what I was hoping for, and we have not been able to prove that we're able to forecast the price of wine with an error of less than 30% of the average price. As of yet, I have not been able to disprove the null hypothesis and will try another approach using linear regression to see if this improves the accuracy.