# Models for Project 4

### Introduction
In this notebook, I hope to answer the following questions:

1. Are wine points predictable based on it's chracteristics?

##### Step 1 Imports and Clean-up

Importing data and packages that will be used for analysis

In [1]:
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_val_predict
from sklearn import metrics, linear_model, model_selection
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB         
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
#importing unicodedata to handle é in rosé wine variety
import unicodedata
%matplotlib inline 

Load dataset and save as 'wine' variable.
After importing all dependencies we load our dataset called wine.

In [2]:
wine = pd.read_json('./wine-reviews/winemag-data-130k-v2.json', dtype='unicode');

Cleaning up accent marks within dataset by replacing Rosé with Rose.

In [3]:
wine.loc[:, 'variety'].replace(u'Rosé','Rose', inplace=True)

Checking for null values and reveiwing data information.

In [4]:
wine.isnull().sum()

country                  0
description              0
designation              0
points                   0
price                    0
province                 0
region_1                 0
region_2                 0
taster_name              0
taster_twitter_handle    0
title                    0
variety                  0
winery                   0
dtype: int64

In [5]:
wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 13 columns):
country                  129971 non-null object
description              129971 non-null object
designation              129971 non-null object
points                   129971 non-null object
price                    129971 non-null object
province                 129971 non-null object
region_1                 129971 non-null object
region_2                 129971 non-null object
taster_name              129971 non-null object
taster_twitter_handle    129971 non-null object
title                    129971 non-null object
variety                  129971 non-null object
winery                   129971 non-null object
dtypes: object(13)
memory usage: 12.9+ MB


Changing points and price types to integar and float

In [6]:
#changing dtypes for points and price
wine['points'] = wine.points.astype(int)
wine['price'] = wine.price.astype(float)
wine.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos


In [7]:
#Since the wine price is an object with 'nan' text values 
#I needed to find a way to remove the values without using dropna.
wine.dropna(inplace=True)

##### Step 2 Dummy Data
The next step to building our model is to dummy the data. I've dummied country, designation, province, region_1, region_2, and variety.

In [8]:
dummy = pd.get_dummies(wine[['country']],prefix='dum_country', drop_first=True)
print(dummy.columns)

Index(['dum_country_Armenia', 'dum_country_Australia', 'dum_country_Austria',
       'dum_country_Bosnia and Herzegovina', 'dum_country_Brazil',
       'dum_country_Bulgaria', 'dum_country_Canada', 'dum_country_Chile',
       'dum_country_China', 'dum_country_Croatia', 'dum_country_Cyprus',
       'dum_country_Czech Republic', 'dum_country_England',
       'dum_country_France', 'dum_country_Georgia', 'dum_country_Germany',
       'dum_country_Greece', 'dum_country_Hungary', 'dum_country_India',
       'dum_country_Israel', 'dum_country_Italy', 'dum_country_Lebanon',
       'dum_country_Luxembourg', 'dum_country_Macedonia', 'dum_country_Mexico',
       'dum_country_Moldova', 'dum_country_Morocco', 'dum_country_New Zealand',
       'dum_country_None', 'dum_country_Peru', 'dum_country_Portugal',
       'dum_country_Romania', 'dum_country_Serbia', 'dum_country_Slovakia',
       'dum_country_Slovenia', 'dum_country_South Africa', 'dum_country_Spain',
       'dum_country_Switzerland', 'dum_c

In [9]:
dummy = pd.get_dummies(wine[['designation']],prefix='dum_designation', drop_first=True)
print(dummy.columns)

Index(['dum_designation_#50 Mon Chou', 'dum_designation_#SocialSecret',
       'dum_designation_%@#$!', 'dum_designation_&',
       'dum_designation_'61 Rosé', 'dum_designation_'A Rina',
       'dum_designation_'Blend 105' Red Wine', 'dum_designation_'Na Vota',
       'dum_designation_'S'', 'dum_designation_'Unfiltered'',
       ...
       'dum_designation_Ürziger Würzgarten GG Réserve Alte Reben Erste Lage Dry',
       'dum_designation_Ürziger Würzgarten Kabinett',
       'dum_designation_Ürziger Würzgarten Spätlese',
       'dum_designation_étoile Brut', 'dum_designation_étoile Rosé',
       'dum_designation_Župska', 'dum_designation_‘Rough Justice' Red',
       'dum_designation_‘S'', 'dum_designation_“Champ” Jim the Gent',
       'dum_designation_“Champ” Lightnin' Lane'],
      dtype='object', length=35776)


In [10]:
dummy = pd.get_dummies(wine[['province']],prefix='dum_province', drop_first=True)
print(dummy.columns)

Index(['dum_province_Aconcagua Costa', 'dum_province_Aconcagua Valley',
       'dum_province_Aegean', 'dum_province_Agioritikos', 'dum_province_Ahr',
       'dum_province_Alenquer', 'dum_province_Alentejano',
       'dum_province_Alentejo', 'dum_province_Alenteo', 'dum_province_Algarve',
       ...
       'dum_province_Wellington', 'dum_province_Western Australia',
       'dum_province_Western Cape', 'dum_province_Wiener Gemischter Satz',
       'dum_province_Württemberg', 'dum_province_Zenata',
       'dum_province_Österreichischer Perlwein',
       'dum_province_Österreichischer Sekt', 'dum_province_Štajerska',
       'dum_province_Župa'],
      dtype='object', length=422)


In [11]:
dummy = pd.get_dummies(wine[['region_1']],prefix='dum_region_1', drop_first=True)
print(dummy.columns)

Index(['dum_region_1_Adelaida District', 'dum_region_1_Adelaide',
       'dum_region_1_Adelaide Hills', 'dum_region_1_Adelaide Plains',
       'dum_region_1_Aglianico d'Irpinia',
       'dum_region_1_Aglianico del Beneventano',
       'dum_region_1_Aglianico del Taburno',
       'dum_region_1_Aglianico del Vulture', 'dum_region_1_Agrelo',
       'dum_region_1_Albana di Romagna',
       ...
       'dum_region_1_Yadkin Valley', 'dum_region_1_Yakima Valley',
       'dum_region_1_Yamhill County', 'dum_region_1_Yarra Valley',
       'dum_region_1_Yecla', 'dum_region_1_Yolo County',
       'dum_region_1_York Mountain', 'dum_region_1_Yorkville Highlands',
       'dum_region_1_Yountville', 'dum_region_1_Zonda Valley'],
      dtype='object', length=1204)


In [12]:
dummy = pd.get_dummies(wine[['region_2']],prefix='dum_region_2', drop_first=True)
print(dummy.columns)

Index(['dum_region_2_Central Coast', 'dum_region_2_Central Valley',
       'dum_region_2_Columbia Valley', 'dum_region_2_Finger Lakes',
       'dum_region_2_Long Island', 'dum_region_2_Napa',
       'dum_region_2_Napa-Sonoma', 'dum_region_2_New York Other',
       'dum_region_2_None', 'dum_region_2_North Coast',
       'dum_region_2_Oregon Other', 'dum_region_2_Sierra Foothills',
       'dum_region_2_Sonoma', 'dum_region_2_South Coast',
       'dum_region_2_Southern Oregon', 'dum_region_2_Washington Other',
       'dum_region_2_Willamette Valley'],
      dtype='object')


In [13]:
dummy = pd.get_dummies(wine[['variety']],prefix='dum_variety', drop_first=True)
print(dummy.columns)
dummy.shape

Index(['dum_variety_Agiorgitiko', 'dum_variety_Aglianico',
       'dum_variety_Aidani', 'dum_variety_Airen', 'dum_variety_Albana',
       'dum_variety_Albanello', 'dum_variety_Albariño',
       'dum_variety_Albarossa', 'dum_variety_Aleatico',
       'dum_variety_Alfrocheiro',
       ...
       'dum_variety_Xynisteri', 'dum_variety_Yapincak', 'dum_variety_Zibibbo',
       'dum_variety_Zierfandler', 'dum_variety_Zierfandler-Rotgipfler',
       'dum_variety_Zinfandel', 'dum_variety_Zlahtina', 'dum_variety_Zweigelt',
       'dum_variety_Çalkarası', 'dum_variety_Žilavka'],
      dtype='object', length=697)


(120975, 697)

In [14]:
# Concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns).
dummy_wine = pd.concat([wine, dummy], axis=1)

##### Step 3: Feature Selection

In [15]:
corr_dummy_wine = dummy_wine.corr()  

In [16]:
corr_points = corr_dummy_wine.loc[:, ['points']]
print(corr_points.nlargest(25, 'points'))              # find the 15 most positively correlated features to star rating 
print('------------------------------')
print(corr_points.nsmallest(25, 'points'))             # find the 15 most positively correlated features to star rating

                                        points
points                                1.000000
price                                 0.416167
dum_variety_Pinot Noir                0.111451
dum_variety_Nebbiolo                  0.087904
dum_variety_Riesling                  0.069101
dum_variety_Sangiovese Grosso         0.054668
dum_variety_Syrah                     0.053358
dum_variety_Grüner Veltliner          0.051174
dum_variety_Champagne Blend           0.040423
dum_variety_Port                      0.033572
dum_variety_Bordeaux-style Red Blend  0.026134
dum_variety_Rhône-style Red Blend     0.025424
dum_variety_Blaufränkisch             0.022330
dum_variety_Portuguese Red            0.019759
dum_variety_Cabernet Sauvignon        0.017977
dum_variety_Shiraz                    0.016929
dum_variety_Aglianico                 0.015824
dum_variety_Austrian white blend      0.015097
dum_variety_Tinta de Toro             0.014080
dum_variety_Sagrantino                0.013986
dum_variety_T

### Can we predict the points rating based on other characteristics of a wine? 

#### Linear Regression
Using linear regression I wanted to try to predict an the outcome variable (in this case points) based on all X variables (or column features including dummies). When we run the model it trys to fit the 

How big is the dummy wine dataset?

In [17]:
dummy_wine.shape

(120975, 710)

Running linear regression on dummy data only

Linear Regression with price feature only

In [18]:
#check dsecrioption summary stats to determine what's a good review. ie better than 95 or something else.

features = ['price']


X_wine = wine[features]                # used to make prediction                                                    
y_wine = wine['points']                # what we want to predict 

X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(X_wine, y_wine, random_state = 413)

# train data is what the model is fed
# testing data is new and shows how well the model can adjust
linreg = LinearRegression()
linreg.fit(X_train_wine, y_train_wine)
    
y_pred_wine = linreg.predict(X_test_wine)

print("Points value if all X variables were ZERO",linreg.intercept_) # print the interecept and the slope coeffecient
print("R-sq train", linreg.score(X_train_wine,y_train_wine)) #output the score or the rsqared of the model
print("R-sq test", linreg.score(X_test_wine,y_test_wine)) #output the score or the rsqared of the model

print("RMSE",np.sqrt(metrics.mean_squared_error(y_test_wine, y_pred_wine)))  #spit out RMSE

list(zip(features, linreg.coef_))

Points value if all X variables were ZERO 87.37127353791841
R-sq train 0.16754316064753183
R-sq test 0.1890867277908327
RMSE 2.7448290922077474


[('price', 0.029706128437876737)]

##### Concatenating top 15 for various categories

In [19]:
#Top 15 Provinces
dummy_province15 = dummy_wine[dummy_wine.province.isin(['California', 'Washington', 'Bordeaux', 
                                  'Tuscany', 'Oregon', 'Burgundy', 
                                  'Northern Spain', 'Piedmont', 'Mendoza Province',
                                  'Veneto', 'New York', 'Alsace',
                                  'Northeastern Italy', 'Loire Valley', 'Sicily & Sardinia'])]

In [20]:
#Top 15 Countries
dummy_country15 = dummy_wine[dummy_wine.country.isin(['US', 'France', 'Italy', 
                                  'Spain', 'Portugal', 'Chile', 
                                  'Argentina', 'Austria', 'Australia',
                                  'Germany', 'New Zealand', 'South Africa',
                                  'Israel', 'Greece', 'Canada'])]

In [21]:
#Top 15 Varieties
#dummy_variety15 = dummy_wine.variety.replace('é', 'e')
dummy_variety15 = dummy_wine[dummy_wine.variety.isin(['Pinot Noir', 'Chardonnay', 'Cabernet Sauvignon', 
                                  'Red Blend', 'Bordeaux-style Red Blend', 'Riesling', 
                                  'Sauvignon Blanc', 'Syrah', 'Rose',
                                  'Merlot', 'Nebbiolo', 'Zinfandel',
                                  'Sangiovese', 'Malbec', 'Portuguese Red'])]

In [22]:
#Top 15 Region 1
dummy_region_1_15 = dummy_wine[dummy_wine.region_1.isin(['Napa Valley', 'Columbia Valley (WA)', 'Russian River Valley', 
                                  'California', 'Paso Robles', 'Mendoza', 
                                  'Willamette Valley', 'Alsace', 'Champagne',
                                  'Barolo', 'Finger Lakes', 'Sonoma Coast',
                                  'Brunello di Montalcino', 'Rioja', 'Sonoma County'])]

In [23]:
#Top 15 Region 2
dummy_region_2_15 = dummy_wine[dummy_wine.region_2.isin(['Central Coast', 'Sonoma', 'Columbia Valley', 
                                  'Napa', 'Willamette Valley', 'California Other', 
                                  'Finger Lakes', 'Sierra Foothills', 'Napa-Sonoma',
                                  'Central Valley', 'Southern Oregon', 'Oregon Other',
                                  'Long Island', 'North Coast', 'Washington Other'])]

In [24]:
# Concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns).
dummy_wine2 = pd.concat([dummy_province15, dummy_country15, dummy_variety15, dummy_region_1_15, dummy_region_2_15])

In [25]:
dummy_wine2 = dummy_wine.loc[dummy_wine['region_1'] != 'None', :]
dummy_wine2 = dummy_wine.loc[dummy_wine['region_2'] != 'None', :]

What's the shape of the new dummy wine 2 df?

In [26]:
dummy_wine2.shape

(50292, 710)

Linear Regression with all column features

In [27]:
#linear regression with dummy features
features = dummy_wine2.drop(columns=['country','province','region_1','region_2','variety','points', 'description', 'designation','taster_name', 'taster_twitter_handle','title','winery']) 
#for dummies maybe use only top 10 countries, 20 regions etc -- drop ones I don't need to use in features

# used to make prediction
X_dummy_wine2 = dummy_wine2[features.columns]  

# what we want to predict
y_dummy_wine2 = dummy_wine2.points                

#split train test
X_train_dummy_wine2, X_test_dummy_wine2, y_train_dummy_wine2, y_test_dummy_wine2 = train_test_split(X_dummy_wine2, y_dummy_wine2, random_state = 413)

# train data is what the model is fed
# testing data is new and shows how well the model can adjust
linreg = LinearRegression()
linreg.fit(X_train_dummy_wine2, y_train_dummy_wine2)
    
y_pred_dummy_wine2 = linreg.predict(X_test_dummy_wine2)

print("Points value if all X variables were ZERO",linreg.intercept_) # print the interecept and the slope coeffecient
print("R-sq train", linreg.score(X_train_dummy_wine2,y_train_dummy_wine2)) #output the score or the rsqared of the model
print("R-sq test", linreg.score(X_test_dummy_wine2,y_test_dummy_wine2)) #output the score or the rsqared of the model
print("RMSE",np.sqrt(metrics.mean_squared_error(y_test_dummy_wine2, y_pred_dummy_wine2)))  #spit out RMSE

list(zip(features, linreg.coef_))

Points value if all X variables were ZERO 81.3787399282289
R-sq train 0.2219951791928898
R-sq test -3.2246891224026016e+16
RMSE 558997293.2556045


[('price', 0.048284014350213234),
 ('dum_variety_Agiorgitiko', -1311124637712.314),
 ('dum_variety_Aglianico', -3.3562439598380074),
 ('dum_variety_Aidani', 658999140265.5297),
 ('dum_variety_Airen', 144817589893.66116),
 ('dum_variety_Albana', 112013898425.23624),
 ('dum_variety_Albanello', 206702375913.4556),
 ('dum_variety_Albariño', 6.152155006766568),
 ('dum_variety_Albarossa', 297619346910.24915),
 ('dum_variety_Aleatico', 172375939653.83035),
 ('dum_variety_Alfrocheiro', -48939408875.19247),
 ('dum_variety_Alicante', 47972677360.13799),
 ('dum_variety_Alicante Bouschet', 4.755751609802246),
 ('dum_variety_Aligoté', 6.592885971069336),
 ('dum_variety_Alsace white blend', -99124672512.32768),
 ('dum_variety_Altesse', -217128252485.0839),
 ('dum_variety_Alvarelhão', 2.75111985206604),
 ('dum_variety_Alvarinho', -62680025249.01658),
 ('dum_variety_Alvarinho-Chardonnay', -13991129443.224014),
 ('dum_variety_Ansonica', 27356058305.43039),
 ('dum_variety_Antão Vaz', -13068679314.838276

Linear Regression with price and variety dummy features

In [28]:
#linear regression with variety dummy and price features
features = dummy_wine2.filter(regex='dum_variety(.*)|(price)')
#for dummies maybe use only top 10 countries, 20 regions etc -- drop ones I don't need to use in features

# used to make prediction
X_dummy_var = dummy_wine2[features.columns]  

# what we want to predict
y_dummy_var = dummy_wine2.points                

#split train test
X_train_dummy_var, X_test_dummy_var, y_train_dummy_var, y_test_dummy_var = train_test_split(X_dummy_var, y_dummy_var, random_state = 413)

# train data is what the model is fed
# testing data is new and shows how well the model can adjust
linreg = LinearRegression()
linreg.fit(X_train_dummy_var, y_train_dummy_var)
    
y_pred_dummy_var = linreg.predict(X_test_dummy_var)

print("Points value if all X variables were ZERO",linreg.intercept_) # print the interecept and the slope coeffecient
print("R-sq train", linreg.score(X_train_dummy_var,y_train_dummy_var)) #output the score or the rsqared of the model
print("R-sq test", linreg.score(X_test_dummy_var,y_test_dummy_var)) #output the score or the rsqared of the model
print("RMSE",np.sqrt(metrics.mean_squared_error(y_test_dummy_var, y_pred_dummy_var)))  #spit out RMSE

list(zip(features, linreg.coef_))

Points value if all X variables were ZERO 81.3787399282289
R-sq train 0.2219951791928898
R-sq test -3.2246891224026016e+16
RMSE 558997293.2556045


[('price', 0.048284014350213234),
 ('dum_variety_Agiorgitiko', -1311124637712.314),
 ('dum_variety_Aglianico', -3.3562439598380074),
 ('dum_variety_Aidani', 658999140265.5297),
 ('dum_variety_Airen', 144817589893.66116),
 ('dum_variety_Albana', 112013898425.23624),
 ('dum_variety_Albanello', 206702375913.4556),
 ('dum_variety_Albariño', 6.152155006766568),
 ('dum_variety_Albarossa', 297619346910.24915),
 ('dum_variety_Aleatico', 172375939653.83035),
 ('dum_variety_Alfrocheiro', -48939408875.19247),
 ('dum_variety_Alicante', 47972677360.13799),
 ('dum_variety_Alicante Bouschet', 4.755751609802246),
 ('dum_variety_Aligoté', 6.592885971069336),
 ('dum_variety_Alsace white blend', -99124672512.32768),
 ('dum_variety_Altesse', -217128252485.0839),
 ('dum_variety_Alvarelhão', 2.75111985206604),
 ('dum_variety_Alvarinho', -62680025249.01658),
 ('dum_variety_Alvarinho-Chardonnay', -13991129443.224014),
 ('dum_variety_Ansonica', 27356058305.43039),
 ('dum_variety_Antão Vaz', -13068679314.838276

After running linear regression on...

#### Trying Logistic Regression

Based on Good/Bad categorical prediction

Since all wines have a good rating from 80 to 100, assuming we're rating the wine on a scale of 1 to 100, we will need to determine the scale of a good or bad wine. For example, if a wine has a rating of 89 is it considered good or bad due to our 80 to 100 scale? 

Should I classify rating into categories? ie 80-85 = fair; 86-90 = good etc

In [29]:
dummy_wine2['good_bad_rating'] = dummy_wine.points > 95
#anything above the mean/median has good point rating otherwise bad

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [30]:
dummy_wine2.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,...,dum_variety_Yapincak,dum_variety_Zibibbo,dum_variety_Zierfandler,dum_variety_Zierfandler-Rotgipfler,dum_variety_Zinfandel,dum_variety_Zlahtina,dum_variety_Zweigelt,dum_variety_Çalkarası,dum_variety_Žilavka,good_bad_rating
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,...,0,0,0,0,0,0,0,0,0,False
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,...,0,0,0,0,0,0,0,0,0,False


In [31]:
kf = model_selection.KFold(n_splits=5, shuffle=True)
log = LogisticRegression(C=1e9, multi_class='multinomial', solver='lbfgs')

#logistic regression with all dummy features
features_log = dummy_wine2.drop(columns=['country','province','region_1'
                                         ,'region_2','variety','points'
                                         , 'description', 'designation','taster_name'
                                         , 'taster_twitter_handle','title','winery']) 
#for dummies maybe use only top 10 countries, 20 regions etc -- drop ones I don't need to use in features

A = dummy_wine2[features_log.columns]


dummy_wine2['gb_rate'] = dummy_wine2.good_bad_rating.map({True:1, False:0})
b = dummy_wine2['gb_rate']

A_train, A_test, b_train, b_test = train_test_split(A,b, random_state =42)


scaler = StandardScaler()
A_train = scaler.fit_transform(A_train)
A_test = scaler.transform(A_test)
    
    
log.fit(A_train,b_train)
b_pred = log.predict(A_test)

print('Accuracy Score: ')
print(metrics.accuracy_score(b_test,b_pred))
print('Confusion Matrix: ')
print(metrics.confusion_matrix(b_test,b_pred))
print('Cross Val Score: ')
print(cross_val_score(log, A, b, cv=5, scoring='accuracy').mean())
print('---')
print(np.mean(-cross_val_score(log, A, b, cv=kf, scoring='neg_mean_squared_error')))
print(np.mean(cross_val_score(log, A, b, cv=kf)))


print("Coefficient:") #weights of each feature
print(list(zip(features_log, log.coef_)))
print("Intercept:") #value of intercept
print(log.intercept_)
print("Predict:")
print(log.predict_proba(A_test))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Accuracy Score: 
1.0
Confusion Matrix: 
[[12488     0]
 [    0    85]]
Cross Val Score: 
1.0
---
0.0
1.0
Coefficient:
[('price', array([ 1.45397895e-02,  0.00000000e+00, -8.70753206e-05,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -9.44704520e-04,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
       -1.55633126e-04, -3.04728653e-04,  0.00000000e+00,  0.00000000e+00,
       -1.28157345e-04, -1.22366869e-04,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -1.85732570e-04,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -2.25237734e-04,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -2.59757462e-04,  0.00000000e+00,
        0.00000000e+00, -9.33162226e-04,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00

#### Attempting NLP

###### Random Forest 1: Determining feature importance from majority of variables¶
Using the same variables as the logistic regression above, let's use random forest to determine feature importance


In [32]:
# Define X and y.
A2 = dummy_wine2.description
b2 = dummy_wine2.points

# Split the new DataFrame into training and testing sets.
A2_train, A2_test, b2_train, b2_test = train_test_split\
(A2, b2, random_state=413)

In [33]:
# Use CountVectorizer to create document-term matrices from X_train and X_test.
vtr = CountVectorizer(lowercase=True, ngram_range=(1, 2))
A2_train_nlp = vtr.fit_transform(A2_train)
A2_test_nlp = vtr.transform(A2_test)

# Use Naive Bayes to predict the star rating.
nb = MultinomialNB()
nb.fit(A2_train_nlp, b2_train)
b2_pred_class = nb.predict(A2_test_nlp)

# Calculate accuracy.
print((metrics.accuracy_score(b2_test, b2_pred_class)))
print(b2_test.mean())
print(1 - b2_test.mean())

0.2566610991807842
88.66070150322119
-87.66070150322119


In [34]:
b2_test.value_counts()

90    1640
88    1638
87    1464
91    1266
86    1152
92    1046
89    1000
85     780
93     680
84     596
94     451
83     265
82     226
95     170
81      68
96      50
80      44
97      25
98       7
99       5
Name: points, dtype: int64

In [35]:
log.fit(A2_train_nlp, b2_train)
print(log.score(A2_test_nlp, b2_test))
#print(metrics.accuracy_score(b_test,b_pred))


0.30358705161854765
