# Models for Project 4

### Introduction
In this notebook, I hope to answer the following question:

Are wine points predictable based on it's chracteristics?

##### Step 1 Imports and Clean-up

Importing data and packages that will be used for analysis

In [1]:
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import stats, mannwhitneyu
from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_val_predict
from sklearn import metrics, linear_model, model_selection
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
#from sklearn.naive_bayes import MultinomialNB    
from sklearn.metrics import r2_score
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_selection import SelectFromModel
#importing unicodedata to handle é in rosé wine variety
import unicodedata
%matplotlib inline 

Load dataset and save as 'wine' variable.
After importing all dependencies we load our dataset called wine.

In [2]:
wine = pd.read_json('./wine-reviews/winemag-data-130k-v2.json', dtype='unicode');

Cleaning up accent marks within dataset by replacing Rosé with Rose.

In [3]:
wine.loc[:, 'variety'].replace(u'Rosé','Rose', inplace=True)

Checking for null values and reveiwing data information.

In [4]:
wine.isnull().sum()

country                  0
description              0
designation              0
points                   0
price                    0
province                 0
region_1                 0
region_2                 0
taster_name              0
taster_twitter_handle    0
title                    0
variety                  0
winery                   0
dtype: int64

In [5]:
wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 13 columns):
country                  129971 non-null object
description              129971 non-null object
designation              129971 non-null object
points                   129971 non-null object
price                    129971 non-null object
province                 129971 non-null object
region_1                 129971 non-null object
region_2                 129971 non-null object
taster_name              129971 non-null object
taster_twitter_handle    129971 non-null object
title                    129971 non-null object
variety                  129971 non-null object
winery                   129971 non-null object
dtypes: object(13)
memory usage: 12.9+ MB


Changing points and price types to integar and float

In [6]:
#changing dtypes for points and price
wine['points'] = wine.points.astype(int)
wine['price'] = wine.price.astype(float)
wine.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos


In [7]:
#Dropping nan values
wine.dropna(inplace=True)

##### Step 2 Dummy Data
The next step to building our model is to dummy the data. I've dummied country, designation, province, region_1, region_2, and variety. This will allow us to assign numeric values to categories which we'll use in our model.

In [8]:
dummy = pd.get_dummies(wine[['country']],prefix='dum_country', drop_first=True)
print(dummy.columns)

Index(['dum_country_Armenia', 'dum_country_Australia', 'dum_country_Austria',
       'dum_country_Bosnia and Herzegovina', 'dum_country_Brazil',
       'dum_country_Bulgaria', 'dum_country_Canada', 'dum_country_Chile',
       'dum_country_China', 'dum_country_Croatia', 'dum_country_Cyprus',
       'dum_country_Czech Republic', 'dum_country_England',
       'dum_country_France', 'dum_country_Georgia', 'dum_country_Germany',
       'dum_country_Greece', 'dum_country_Hungary', 'dum_country_India',
       'dum_country_Israel', 'dum_country_Italy', 'dum_country_Lebanon',
       'dum_country_Luxembourg', 'dum_country_Macedonia', 'dum_country_Mexico',
       'dum_country_Moldova', 'dum_country_Morocco', 'dum_country_New Zealand',
       'dum_country_None', 'dum_country_Peru', 'dum_country_Portugal',
       'dum_country_Romania', 'dum_country_Serbia', 'dum_country_Slovakia',
       'dum_country_Slovenia', 'dum_country_South Africa', 'dum_country_Spain',
       'dum_country_Switzerland', 'dum_c

In [9]:
dummy = pd.get_dummies(wine[['designation']],prefix='dum_designation', drop_first=True)
print(dummy.columns)
#lessons learned more data cleaning ie with foriegn languages/branding wines etc

Index(['dum_designation_#50 Mon Chou', 'dum_designation_#SocialSecret',
       'dum_designation_%@#$!', 'dum_designation_&',
       'dum_designation_'61 Rosé', 'dum_designation_'A Rina',
       'dum_designation_'Blend 105' Red Wine', 'dum_designation_'Na Vota',
       'dum_designation_'S'', 'dum_designation_'Unfiltered'',
       ...
       'dum_designation_Ürziger Würzgarten GG Réserve Alte Reben Erste Lage Dry',
       'dum_designation_Ürziger Würzgarten Kabinett',
       'dum_designation_Ürziger Würzgarten Spätlese',
       'dum_designation_étoile Brut', 'dum_designation_étoile Rosé',
       'dum_designation_Župska', 'dum_designation_‘Rough Justice' Red',
       'dum_designation_‘S'', 'dum_designation_“Champ” Jim the Gent',
       'dum_designation_“Champ” Lightnin' Lane'],
      dtype='object', length=35776)


In [10]:
dummy = pd.get_dummies(wine[['province']],prefix='dum_province', drop_first=True)
print(dummy.columns)

Index(['dum_province_Aconcagua Costa', 'dum_province_Aconcagua Valley',
       'dum_province_Aegean', 'dum_province_Agioritikos', 'dum_province_Ahr',
       'dum_province_Alenquer', 'dum_province_Alentejano',
       'dum_province_Alentejo', 'dum_province_Alenteo', 'dum_province_Algarve',
       ...
       'dum_province_Wellington', 'dum_province_Western Australia',
       'dum_province_Western Cape', 'dum_province_Wiener Gemischter Satz',
       'dum_province_Württemberg', 'dum_province_Zenata',
       'dum_province_Österreichischer Perlwein',
       'dum_province_Österreichischer Sekt', 'dum_province_Štajerska',
       'dum_province_Župa'],
      dtype='object', length=422)


In [11]:
dummy = pd.get_dummies(wine[['region_1']],prefix='dum_region_1', drop_first=True)
print(dummy.columns)

Index(['dum_region_1_Adelaida District', 'dum_region_1_Adelaide',
       'dum_region_1_Adelaide Hills', 'dum_region_1_Adelaide Plains',
       'dum_region_1_Aglianico d'Irpinia',
       'dum_region_1_Aglianico del Beneventano',
       'dum_region_1_Aglianico del Taburno',
       'dum_region_1_Aglianico del Vulture', 'dum_region_1_Agrelo',
       'dum_region_1_Albana di Romagna',
       ...
       'dum_region_1_Yadkin Valley', 'dum_region_1_Yakima Valley',
       'dum_region_1_Yamhill County', 'dum_region_1_Yarra Valley',
       'dum_region_1_Yecla', 'dum_region_1_Yolo County',
       'dum_region_1_York Mountain', 'dum_region_1_Yorkville Highlands',
       'dum_region_1_Yountville', 'dum_region_1_Zonda Valley'],
      dtype='object', length=1204)


In [12]:
dummy = pd.get_dummies(wine[['region_2']],prefix='dum_region_2', drop_first=True)
print(dummy.columns)

Index(['dum_region_2_Central Coast', 'dum_region_2_Central Valley',
       'dum_region_2_Columbia Valley', 'dum_region_2_Finger Lakes',
       'dum_region_2_Long Island', 'dum_region_2_Napa',
       'dum_region_2_Napa-Sonoma', 'dum_region_2_New York Other',
       'dum_region_2_None', 'dum_region_2_North Coast',
       'dum_region_2_Oregon Other', 'dum_region_2_Sierra Foothills',
       'dum_region_2_Sonoma', 'dum_region_2_South Coast',
       'dum_region_2_Southern Oregon', 'dum_region_2_Washington Other',
       'dum_region_2_Willamette Valley'],
      dtype='object')


In [13]:
dummy = pd.get_dummies(wine[['variety']],prefix='dum_variety', drop_first=True)
print(dummy.columns)
dummy.shape

Index(['dum_variety_Agiorgitiko', 'dum_variety_Aglianico',
       'dum_variety_Aidani', 'dum_variety_Airen', 'dum_variety_Albana',
       'dum_variety_Albanello', 'dum_variety_Albariño',
       'dum_variety_Albarossa', 'dum_variety_Aleatico',
       'dum_variety_Alfrocheiro',
       ...
       'dum_variety_Xynisteri', 'dum_variety_Yapincak', 'dum_variety_Zibibbo',
       'dum_variety_Zierfandler', 'dum_variety_Zierfandler-Rotgipfler',
       'dum_variety_Zinfandel', 'dum_variety_Zlahtina', 'dum_variety_Zweigelt',
       'dum_variety_Çalkarası', 'dum_variety_Žilavka'],
      dtype='object', length=697)


(120975, 697)

In [14]:
# Concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns).
dummy_wine = pd.concat([wine, dummy], axis=1)

##### Step 3: Feature Selection
The next step before we start to consider modeling our data is to select all the features or all of my 'X' variables. This process will help us determine which features have the greatest impact on predicting our outcome or 'y' variable. 

The code below is designed to find the correlation of points to each wine feature.

In [15]:
#save the correlation function in a variable
corr_dummy_wine = dummy_wine.corr()  

In [16]:
corr_points = corr_dummy_wine.loc[:, ['points']]
# find the 15 most pos correlated features to star rating 
print(corr_points.nlargest(25, 'points'))             
print('------------------------------')
# find the 15 most neg correlated features to star rating
print(corr_points.nsmallest(25, 'points'))             

                                        points
points                                1.000000
price                                 0.416167
dum_variety_Pinot Noir                0.111451
dum_variety_Nebbiolo                  0.087904
dum_variety_Riesling                  0.069101
dum_variety_Sangiovese Grosso         0.054668
dum_variety_Syrah                     0.053358
dum_variety_Grüner Veltliner          0.051174
dum_variety_Champagne Blend           0.040423
dum_variety_Port                      0.033572
dum_variety_Bordeaux-style Red Blend  0.026134
dum_variety_Rhône-style Red Blend     0.025424
dum_variety_Blaufränkisch             0.022330
dum_variety_Portuguese Red            0.019759
dum_variety_Cabernet Sauvignon        0.017977
dum_variety_Shiraz                    0.016929
dum_variety_Aglianico                 0.015824
dum_variety_Austrian white blend      0.015097
dum_variety_Tinta de Toro             0.014080
dum_variety_Sagrantino                0.013986
dum_variety_T

##### Can we predict the points rating based on other characteristics of a wine? 

##### Linear Regression
Using linear regression I wanted to try to predict the outcome variable (in this case points rating) based on various X variables (or column features including dummy columns). I've fit three linear regression models, each with different 'X' variables, with a training set of 'X' features to try to predict 'y'.

Null Hypothesis Regression

In [17]:
y_null = np.full((len(dummy_wine.points), ), dummy_wine.points.mean())
print("RMSE",np.sqrt(metrics.mean_squared_error(y_null, wine['points'])))
print("R-Squared",metrics.r2_score(y_null, wine['points']))

#(scale -1 to 1) High R-squared means more accurate. Low is inaccurate. 
#My prediction for lin reg can not explain the variation of y by my x

RMSE 3.0444953794531355
R-Squared -1.1474406946010606e+28


We'll use our null RSME of 3.04 as a benchmark for our models.

Linear Regression with price feature only

In [18]:
features1 = ['price']

X_wine = wine[features1]                # used to make prediction                                                    
y_wine = wine['points']                # what we want to predict 

X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(X_wine, y_wine, random_state = 413)

# train data is what the model is fed
# testing data is new and shows how well the model can adjust
linreg = LinearRegression()
linreg.fit(X_train_wine, y_train_wine)
    
y_pred_wine = linreg.predict(X_test_wine)

print("Points value if all X variables were ZERO",linreg.intercept_) # print the interecept and the slope coeffecient
print("R-sq train", linreg.score(X_train_wine,y_train_wine)) #output the score or the rsqared of the model
print("R-sq test", linreg.score(X_test_wine,y_test_wine)) #output the score or the rsqared of the model

print("RMSE",np.sqrt(metrics.mean_squared_error(y_test_wine, y_pred_wine)))  #spit out RMSE

Points value if all X variables were ZERO 87.37127353791841
R-sq train 0.16754316064753183
R-sq test 0.1890867277908327
RMSE 2.7448290922077474


In [19]:
pd.DataFrame({'Features':X_wine.columns, 'Coefs':linreg.coef_[0]}).sort_values(by='Coefs')


Unnamed: 0,Features,Coefs
0,price,0.029706


Running our first linear regression model using price as our only 'X' variable you'll notice that our R-squared values are 0.17 on our training set and 0.19 on our test set. R-squared is a measurement of how accurate the model is predicting our outcome. We want the result to be closer to 1 as it's a measurement for greater accuracy. With the model above, these R-sqaured values are relatively low and we can determine that this wouldn't be a reliable model. That said, we do have a better RMSE with 2.74 compared to our null.

Linear Regression with price and variety dummy features

In [20]:
#linear regression with variety dummy and price features
features3 = dummy_wine.filter(regex='dum_variety(.*)|(price)')
#for dummies maybe use only top 10 countries, 20 regions etc -- drop ones I don't need to use in features

# used to make prediction
X_dummy_var = dummy_wine[features3.columns]  

# what we want to predict
y_dummy_var = dummy_wine.points                

#split train test
X_train_dummy_var, X_test_dummy_var, y_train_dummy_var, y_test_dummy_var = train_test_split(X_dummy_var, y_dummy_var, random_state = 413)

# train data is what the model is fed
# testing data is new and shows how well the model can adjust
linreg2 = LinearRegression()
linreg2.fit(X_train_dummy_var, y_train_dummy_var)
    
y_pred_dummy_var = linreg2.predict(X_test_dummy_var)

print("Points value if all X variables were ZERO",linreg2.intercept_) # print the interecept and the slope coeffecient
print("R-sq train", linreg2.score(X_train_dummy_var,y_train_dummy_var)) #output the score or the rsqared of the model
print("R-sq test", linreg2.score(X_test_dummy_var,y_test_dummy_var)) #output the score or the rsqared of the model
print("RMSE",np.sqrt(metrics.mean_squared_error(y_test_dummy_var, y_pred_dummy_var)))  #spit out RMSE

list(zip(features3, linreg2.coef_))

Points value if all X variables were ZERO 84.7726248267295
R-sq train 0.2270818925842809
R-sq test 0.23716428845750848
RMSE 2.6622179775641177


[('price', 0.027275003850463622),
 ('dum_variety_Agiorgitiko', 1.9794556957765956),
 ('dum_variety_Aglianico', 3.51186212970117),
 ('dum_variety_Aidani', -3.509049930692198),
 ('dum_variety_Airen', -5.018099861383481),
 ('dum_variety_Albana', 3.730737594336311),
 ('dum_variety_Albanello', 0.6818750962610705),
 ('dum_variety_Albariño', 2.434016938834217),
 ('dum_variety_Albarossa', 6.394884621840902e-14),
 ('dum_variety_Aleatico', -0.10680622978548024),
 ('dum_variety_Alfrocheiro', 3.5839519837953486),
 ('dum_variety_Alicante', 2.3568688356722887),
 ('dum_variety_Alicante Bouschet', 3.74963718818183),
 ('dum_variety_Aligoté', 1.0077132555744228),
 ('dum_variety_Alsace white blend', 4.865181995997813),
 ('dum_variety_Altesse', 3.740321533083804),
 ('dum_variety_Alvarelhão', 1.1546000924232491),
 ('dum_variety_Alvarinho', 3.3173966138887008),
 ('dum_variety_Alvarinho-Chardonnay', 1.954625134770715),
 ('dum_variety_Ansonica', 0.40912505775657515),
 ('dum_variety_Antão Vaz', 2.5364212512918

This second linear regression model performed better than our first model with an RMSE of 2.66. The main difference is that our 'X's' are now price and variety of wine instead of just price. That said, even with better R-squared values compared to our first model, it's still an unreliable model because of the low R-squared values.

Linear Regression with all column features

In [21]:
#linear regression with dummy features
features2 = dummy_wine.drop(columns=['country','province','region_1','region_2','variety','points', 'description', 'designation','taster_name', 'taster_twitter_handle','title','winery']) 

# used to make prediction
X_dummy_wine2 = dummy_wine[features2.columns]  

# what we want to predict
y_dummy_wine2 = dummy_wine.points                

#split train test
X_train_dummy_wine2, X_test_dummy_wine2, y_train_dummy_wine2, y_test_dummy_wine2 = train_test_split(X_dummy_wine2, y_dummy_wine2, random_state = 413)

# train data is what the model is fed
# testing data is new and shows how well the model can adjust
linreg3 = LinearRegression()
linreg3.fit(X_train_dummy_wine2, y_train_dummy_wine2)
    
y_pred_dummy_wine2 = linreg3.predict(X_test_dummy_wine2)

print("Points value if all X variables were ZERO",linreg3.intercept_) # print the interecept and the slope coeffecient
print("R-sq train", linreg3.score(X_train_dummy_wine2,y_train_dummy_wine2)) #output the score or the rsqared of the model
print("R-sq test", linreg3.score(X_test_dummy_wine2,y_test_dummy_wine2)) #output the score or the rsqared of the model
print("RMSE",np.sqrt(metrics.mean_squared_error(y_test_dummy_wine2, y_pred_dummy_wine2)))  #spit out RMSE

list(zip(features2, linreg3.coef_))

#throwing more data deson't mean it's a better solution. ie. clients wanted to ask for more data.

Points value if all X variables were ZERO 84.7726248267295
R-sq train 0.2270818925842809
R-sq test 0.23716428845750848
RMSE 2.6622179775641177


[('price', 0.027275003850463622),
 ('dum_variety_Agiorgitiko', 1.9794556957765956),
 ('dum_variety_Aglianico', 3.51186212970117),
 ('dum_variety_Aidani', -3.509049930692198),
 ('dum_variety_Airen', -5.018099861383481),
 ('dum_variety_Albana', 3.730737594336311),
 ('dum_variety_Albanello', 0.6818750962610705),
 ('dum_variety_Albariño', 2.434016938834217),
 ('dum_variety_Albarossa', 6.394884621840902e-14),
 ('dum_variety_Aleatico', -0.10680622978548024),
 ('dum_variety_Alfrocheiro', 3.5839519837953486),
 ('dum_variety_Alicante', 2.3568688356722887),
 ('dum_variety_Alicante Bouschet', 3.74963718818183),
 ('dum_variety_Aligoté', 1.0077132555744228),
 ('dum_variety_Alsace white blend', 4.865181995997813),
 ('dum_variety_Altesse', 3.740321533083804),
 ('dum_variety_Alvarelhão', 1.1546000924232491),
 ('dum_variety_Alvarinho', 3.3173966138887008),
 ('dum_variety_Alvarinho-Chardonnay', 1.954625134770715),
 ('dum_variety_Ansonica', 0.40912505775657515),
 ('dum_variety_Antão Vaz', 2.5364212512918

In our last linear model we're using all non-numeric columns (which include dummy columns) to predict our 'y' or points value. It performed almost exactly like our second linear model. So, what does this mean? I would suggest that we should look at the coefficients and use only the strongest correlated. More data doesn't nessecarily have more impact or influence on our prediction.

In [24]:
pd.DataFrame({'Features':features2.columns, 'Coefs':linreg.coef_[0]}).sort_values(by='Coefs')


Unnamed: 0,Features,Coefs
0,price,0.027275
460,dum_variety_Prugnolo Gentile,0.027275
461,dum_variety_Prunelard,0.027275
462,dum_variety_Pugnitello,0.027275
463,dum_variety_Rabigato,0.027275
464,dum_variety_Raboso,0.027275
465,dum_variety_Ramisco,0.027275
466,dum_variety_Rara Neagra,0.027275
467,dum_variety_Rebo,0.027275
468,dum_variety_Rebula,0.027275


#### Logistic Regression
Logisitic Regression modeling basically outputs the probabilities of a specific class which can be used to predict classification. This model isn't ideal for predicting the actual point value, however, we can use it to determine if wine is good or bad. We do this by setting a threshold for good wine which equals any wine rated over 92 ( which is slightly higher than the top 75% precentile ). With that threshold we can determine the probability that a wine is good or bad. 

Note: 

Since all wines have a good rating from 80 to 100, assuming we're rating the wine on a scale of 1 to 100, we will need to determine the scale of a good or bad wine. For example, if a wine has a rating of 87 is it considered good or bad due to our 80 to 100 scale? 

Creating class variable for logisitic regression model

In [22]:
dummy_wine['good_bad_rating'] = dummy_wine.points > 92
#anything above the mean/median has good point rating otherwise bad

Null Hypothesis Logisitic Regression

We expect to predict at least 90% accuracy.

In [23]:
dummy_wine.good_bad_rating.value_counts(normalize=True)

False    0.904154
True     0.095846
Name: good_bad_rating, dtype: float64

In [24]:
#K Fold 
kf = model_selection.KFold(n_splits=5, shuffle=True)
#Log Reg init
log = LogisticRegression(C=1e9, multi_class='multinomial', solver='lbfgs')

#logistic regression with dummy variety and price features. 
#We determine the best features were variety and price due to the coefficient strength
features_log = dummy_wine.filter(regex='dum_variety(.*)|(price)')

#Features
A = dummy_wine[features_log.columns]

#y
dummy_wine['gb_rate'] = dummy_wine.good_bad_rating.map({True:1, False:0})
b = dummy_wine['gb_rate']
#Split train test
A_train, A_test, b_train, b_test = train_test_split(A,b, random_state=413, test_size=0.33)

#fitting our model  
log.fit(A_train,b_train)
#prediction
b_pred = log.predict(A_test)
b_pred_train = log.predict(A_train)

#scores
print('Accuracy Score: ')
print('test:', metrics.accuracy_score(b_test,b_pred))
print('train:', metrics.accuracy_score(b_train,b_pred_train))
print('Confusion Matrix: ')
print(metrics.confusion_matrix(b_test,b_pred))
print('Cross Val Score: ')
print(cross_val_score(log, A, b, cv=kf, scoring='accuracy').mean())
print('---')
print("Intercept:") #value of intercept
print(log.intercept_)
print("Predict:")
print(log.predict_proba(A_test))

Accuracy Score: 
test: 0.9120535043334502
train: 0.9111198845200055
Confusion Matrix: 
[[35745   402]
 [ 3109   666]]
Cross Val Score: 
0.911403182475718
---
Intercept:
[-2.04499092]
Predict:
[[0.81384643 0.18615357]
 [0.80770583 0.19229417]
 [0.86081568 0.13918432]
 ...
 [0.8305317  0.1694683 ]
 [0.81705299 0.18294701]
 [0.79755083 0.20244917]]


After we run our model we get an accuracy score of 91 which is the percentage of correct predictions. That said, if you compare this to our null value you can see that it only does slightly better with only a 1% increase in accuracy. Lastly, the predictions look like they can be better since all of them are under 90%. 

Note:
Increasing the threshold increases the accuracy of the logisitic model. 

In [25]:
pd.DataFrame({'Features':features_log.columns, 'Coefs':log.coef_[0]}).sort_values(by='Coefs')

Unnamed: 0,Features,Coefs
486,dum_variety_Rose,-0.551344
682,dum_variety_White Blend,-0.345968
430,dum_variety_Pinot Grigio,-0.291027
578,dum_variety_Tempranillo,-0.290088
322,dum_variety_Merlot,-0.280735
69,dum_variety_Cabernet Franc,-0.246954
41,dum_variety_Barbera,-0.243144
152,"dum_variety_Corvina, Rondinella, Molinara",-0.234933
104,dum_variety_Carmenère,-0.234218
119,dum_variety_Champagne Blend,-0.217948


#### Random Forests

Our last model will be a Random Forest model to try to determine if wine description had any influence on points. A Random Forest Regression Algorithm predicts the outcome based on a series of Decision Trees. 

A decision tree is basically a flow chart that uses a branching method to illustrate possible outcomes of a decision. Each node in a tree represents a test on a specific variable and each branch is an outcome of said test.


In [26]:
from sklearn.ensemble import RandomForestRegressor

In [27]:
# max_features=5 is best and n_estimators=150 is sufficiently large.
rfreg = RandomForestRegressor(n_estimators=150,
                              max_features=50,
                              oob_score=True,
                              n_jobs = -1,
                              random_state=1)

In [28]:
# Define X and y.
A2 = dummy_wine.description
b2 = dummy_wine.points

In [29]:
# Split the new DataFrame into training and testing sets.
A2_train, A2_test, b2_train, b2_test = train_test_split\
(A2, b2, random_state=413, test_size=0.3)


# Use CountVectorizer to create document-term matrices from X_train and X_test.
vtr = CountVectorizer(lowercase=True, ngram_range=(1, 2))
A2_train_nlp = vtr.fit_transform(A2_train)
A2_test_nlp = vtr.transform(A2_test)

In [30]:
# Create object that selects features with importance greater than or equal to a threshold
selector = SelectFromModel(rfreg, threshold='mean')

In [31]:
# Feature new feature matrix using selector
A2_train_important = selector.fit_transform(A2_train_nlp, b2_train)
A2_test_important = selector.fit_transform(A2_test_nlp, b2_test)

In [32]:
# Fit the model on only the train data
rfreg.fit(A2_train_important, b2_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=50, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=-1, oob_score=True, random_state=1,
           verbose=0, warm_start=False)

In [33]:
#Check the RMSE for a random forest that only 
#includes important features.
scores_test = cross_val_score(rfreg, A2_test_important, b2_test, cv=5, scoring='neg_mean_squared_error')
print('RMSE Test: ',np.mean(np.sqrt(-scores_test)))
print('out of bag: ',(rfreg.oob_score_))            

RMSE Test:  2.293318767424938
out of bag:  0.4905138473812264


In [34]:
# Check the RMSE for a random forest that only includes important features.
scores_train = cross_val_score(rfreg, A2_train_important, b2_train, cv=5, scoring='neg_mean_squared_error')
print('RMSE Train: ',np.mean(np.sqrt(-scores_train)))
print('out of bag: ',(rfreg.oob_score_))         

RMSE Train:  2.1895887151883824
out of bag:  0.4905138473812264


In [72]:
list(zip(A2, rfreg.feature_importances_))

#Test Feature Importance 
#pd.DataFrame({'feature':A2_test_important, 
 #             'importance':rfreg.feature_importances_}).sort_values(by='importance', ascending = False)

[("This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's  already drinkable, although it will certainly be better from 2016.",
  3.336686570192254e-05),
 ('Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.',
  2.189806358597704e-06),
 ('Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish.',
  2.0614877707825232e-05),
 ("Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew.",
  5.387535034132195e-06),
 ('Blackbe

The Random Forest Regression model had an RMSE of 2.3. It's the best regression model we've come across when trying to predict points, however, it still could be better.

##### Summary / Next Steps
When trying to predict wine point reviews, when compared to the null RMSE of 3.04, our linear regression is better at predicting points with a RMSE of 2.66. That said, it still isn't a reliable model due to our low R-squared value of 0.23. Moreover, our best regression model was the Random Forest with a RMSE of 2.3. In terms of our random forest regression model, the argument can be made that adjusting the tuning parameters could help the model perform better, however, I wasn't able to do so due to the amount of computing resources it required. 

When trying to predict good or bad wine, our logistic regression model did a pretty good job. Basically, I classified wines over a 92 rating as being a good wine, which is 3 points higher than the mean and one point less than top 75% percentile. After we ran the model our accuracy score was 91, however, it was only a 1% increase from the null. That said, if we were to increase our threshold for a good wine, to let's say 95, we would expect a more accurate prediction.

In conclusion, while this isn't enough to use for a real world problem, it is our best model when trying to predict the point rating value or good/bad wine. Also, feature importance for each model didn't changed too much from model to model. Each model listed the most important characteristics of wine, according to this dataset, as price and the various varieties of wine. Lastly, in the future, we will aim to develop more reliable models to predict the point rating. If I could make a recommendation, I would ask to have a full review of the wine to coincide with the point rating. 