# Machine learning on wine

**Topics:** Text analysis, linear regression, logistic regression, text analysis, classification

**Datasets**

- **wine-reviews.csv** Wine reviews scraped from https://www.winemag.com/
- **Data dictionary:** just go [here](https://www.winemag.com/buying-guide/tenuta-dellornellaia-2007-masseto-merlot-toscana/) and look at the page

## The background

You work in the **worst newsroom in the world**, and you've had a hard few weeks at work - a couple stories killed, a few scoops stolen out from under you. It's not going well.

And because things just can't get any worse: your boss shows up, carrying a huge binder. She slams it down on your desk.

"You know some machine learning stuff, right?"

You say "no," but she isn't listening. She's giving you an assignment, the _worst assignment_:

> Machine learning is the new maps. Let's get some hits!
>
> **Do some machine learning on this stuff.**

"This stuff" is wine reviews.

## A tiny, meagre bit of help

You have a dataset. It has some stuff in it:

* **Numbers:**
    - Year published
    - Alcohol percentage
    - Price
    - Score
    - Bottle size
* **Categories:**
    - Red vs white
    - Different countries
    - Importer
    - Designation
    - Taster
    - Variety
    - Winery
* **Free text:**
    - Wine description

# Cleaning up your data

Many of these pieces - the alcohol, the year produced, the bottle size, the country the wine is from - aren't in a format you can use. Convert the ones to numbers that are numbers, and extract the others from the appropriate strings.

In [58]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import re
import nltk
import fuzzy_pandas as fpd
from nltk.stem.porter import PorterStemmer
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

%matplotlib inline

In [59]:
df = pd.read_csv("wine-reviews.csv")
df.head(15)

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5%,750 ml,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review]
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,13%,750 ml,Red,,12/1/2014,Not rated yet [Add Your Review]
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,13%,750 ml,White,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review]
5,https://www.winemag.com/buying-guide/mumm-napa...,90.0,Mumm Napa 2008 DVX Rosé Sparkling (Napa Valley),"Pretty peach in color, this 50-50 sparkling bl...",Virginie Boone,"$70, Buy Now",DVX Rosé,"Sparkling Blend, Sparkling","Napa Valley, Napa, California, US",Mumm Napa,12.5%,750 ml,Sparkling,,12/1/2014,Not rated yet [Add Your Review]
6,https://www.winemag.com/buying-guide/nuiton-be...,90.0,Nuiton-Beaunoy 2011 Clos du Chapitre Premier C...,The two-acre Clos du Chapitre vineyard is in t...,Roger Voss,"N/A, Buy Now",Clos du Chapitre Premier Cru,Pinot Noir,"Gevrey-Chambertin, Burgundy, France",Nuiton-Beaunoy,13%,750 ml,Red,"Fruit of the Vines, Inc",12/1/2014,Not rated yet [Add Your Review]
7,https://www.winemag.com/buying-guide/trapiche-...,90.0,Trapiche 2012 Broquel Cabernet Sauvignon (Mend...,"Spice, licorice and herbal notes complement re...",Michael Schachner,"$15, Buy Now",Broquel,Cabernet Sauvignon,"Mendoza, Mendoza Province, Argentina",Trapiche,14%,750 ml,Red,The Wine Group,12/1/2014,Not rated yet [Add Your Review]
8,https://www.winemag.com/buying-guide/zonin-201...,90.0,Zonin 2010 Amarone della Valpolicella,"Full-bodied and fresh, this offfers attractive...",Kerin O’Keefe,"$50, Buy Now",,"Red Blends, Red Blends","Amarone della Valpolicella, Veneto, Italy",Zonin,15%,750 ml,Red,Zonin USA,12/1/2014,Not rated yet [Add Your Review]
9,https://www.winemag.com/buying-guide/pali-2012...,90.0,Pali 2012 Cargasacchi Vineyard Pinot Noir (Sta...,"Round, savory aromas of orange-cranberry with ...",Matt Kettmann,"$56, Buy Now",Cargasacchi Vineyard,Pinot Noir,"Sta. Rita Hills, Central Coast, California, US",Pali,13.8%,750 ml,Red,,12/1/2014,Not rated yet [Add Your Review]


In [60]:
df.shape

(21435, 16)

In [61]:
 r = {'750 ml':750, '750ML':750, 
    '375 ml':375, '500 ml':500, '500ML':500,
     '1 L':1000, '1.5 L':1500,
       '375ML':375, '3 L':3000,
     '1.5L':1500, '1L':1000,
     '3L':3000, '187 ml':187}

df['bottle_size_ml'] = df['bottle size'].replace(r)
df.bottle_size_ml.value_counts() 

750     21055
375       237
500        94
3000       17
1500       16
1000       14
187         2
Name: bottle_size_ml, dtype: int64

In [62]:
df['price_clean'] = df.price.str.split(',', expand=True)[0].str.replace('$', '').replace('N/A', np.nan).astype(float) 

In [63]:
 df['avg_rating_clean'] = df['user avg rating'].str[:2].replace('No', np.nan) 

In [64]:
 df['alcohol_clean'] = df.alcohol.str.replace('%', '').astype(float) 

In [65]:
 df['year'] = df.wine_name.str.extract(r'.*?(\d\d\d\d)').astype(float) 



In [66]:
df['appellation_clean'] = df.appellation.str.extract('(\w*.)$').astype(str) 


In [68]:
df.head()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,...,category,importer,date published,user avg rating,bottle_size_ml,price_clean,avg_rating_clean,alcohol_clean,year,appellation_clean
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,...,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review],750,25.0,,14.5,2011.0,Spain
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,...,White,,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.5,2012.0,US
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,...,White,,12/1/2014,Not rated yet [Add Your Review],750,25.0,,13.5,2013.0,US
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,...,Red,,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.0,2011.0,US
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,...,White,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review],750,17.0,,13.0,2013.0,Spain


In [None]:
#clean: the alcohol, the year produced, the bottle size, the country the wine is from.

## What might be interesting in this dataset?

Maybe start out playing around _without_ machine learning. Here are some thoughts to get you started:

* I've heard that since the 90's wine has gone through [Parkerization](https://www.estatewinebrokers.com/blog/the-parkerization-of-wine-in-the-1990s-and-beyond/), an increase in production of high-alcohol, fruity red wines thanks to the influence of wine critic Robert Parker.
* Red and white wines taste different, obviously, but people always use [goofy words to describe them](https://winefolly.com/tutorial/40-wine-descriptions/)
* Once upon a time in 1976 [California wines proved themselves against France](https://en.wikipedia.org/wiki/Judgment_of_Paris_(wine)) and France got very angry about it

In [69]:
df['variety'].value_counts()

Pinot Noir                                         2429
Chardonnay                                         1991
Cabernet Sauvignon                                 1847
Red Blends, Red Blends                             1226
Bordeaux-style Red Blend                           1035
Syrah                                               964
Riesling                                            849
Sauvignon Blanc                                     759
Merlot                                              643
Rosé                                                536
Zinfandel                                           498
Malbec                                              429
Champagne Blend, Sparkling                          386
Portuguese Red                                      363
Tempranillo                                         357
White Blend                                         342
Sparkling Blend, Sparkling                          338
Nebbiolo                                        

In [71]:
keep_cols=['variety','wine_points']
df1=df[keep_cols]
df1.head()

Unnamed: 0,variety,wine_points
0,Tempranillo,90.0
1,Chardonnay,90.0
2,"Auxerrois, Other White",90.0
3,Pinot Noir,90.0
4,Albariño,90.0


In [72]:
df1['grapes'] = df1.variety.str.replace('(,.*)$', '').astype(str) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [73]:
df1 = df1[~df1['grapes'].str.contains('Blend')]

In [74]:
df1 = df1[~df1['grapes'].str.contains('Portuguese')]

In [75]:
df1['grapes'].value_counts()


Pinot Noir                      2429
Chardonnay                      1991
Cabernet Sauvignon              1847
Syrah                            964
Riesling                         849
Sauvignon Blanc                  759
Merlot                           643
Rosé                             536
Zinfandel                        498
Malbec                           429
Tempranillo                      357
Nebbiolo                         300
Grüner Veltliner                 287
Shiraz                           235
Sangiovese                       229
Pinot Gris                       197
Viognier                         191
Cabernet Franc                   182
Chenin Blanc                     163
Gewürztraminer                   154
Barbera                          153
Pinot Grigio                     143
Petite Sirah                     137
Port                             134
Gamay                            123
Grenache                         110
Carmenère                         83
C

Series([], Name: grapes, dtype: bool)


NameError: name 'zip_data_df1' is not defined

In [85]:
df1['grapes'].value_counts()

Series([], Name: grapes, dtype: int64)

In [76]:
df.wine_points.value_counts()

90.0     2312
87.0     2280
92.0     1954
88.0     1839
93.0     1767
91.0     1746
86.0     1663
84.0     1436
94.0     1263
89.0     1139
85.0      943
83.0      929
82.0      680
95.0      635
81.0      257
96.0      231
80.0      151
97.0      136
98.0       46
99.0       17
100.0      11
Name: wine_points, dtype: int64

In [78]:
df_grape = pd.get_dummies(df1.grapes, prefix='wine_points')
df_grape.shape

(16880, 387)

In [81]:
y = df1.wine_points
X = df_grape

In [82]:
mod = sm.OLS(y, X)
res = mod.fit()
res.summary()

0,1,2,3
Dep. Variable:,wine_points,R-squared:,0.149
Model:,OLS,Adj. R-squared:,0.129
Method:,Least Squares,F-statistic:,7.473
Date:,"Tue, 06 Aug 2019",Prob (F-statistic):,0.0
Time:,16:45:11,Log-Likelihood:,-45062.0
No. Observations:,16880,AIC:,90900.0
Df Residuals:,16493,BIC:,93890.0
Df Model:,386,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
wine_points_Agiorgitiko,84.5000,0.833,101.466,0.000,82.868,86.132
wine_points_Aglianico,90.6842,0.811,111.876,0.000,89.095,92.273
wine_points_Airen,82.0000,3.533,23.208,0.000,75.075,88.925
wine_points_Albana,89.2000,1.580,56.452,0.000,86.103,92.297
wine_points_Albariño,88.1343,0.432,204.179,0.000,87.288,88.980
wine_points_Aleatico,83.5000,2.498,33.422,0.000,78.603,88.397
wine_points_Alfrocheiro,90.0000,2.040,44.120,0.000,86.002,93.998
wine_points_Alicante Bouschet,90.0000,1.065,84.483,0.000,87.912,92.088
wine_points_Aligoté,87.0000,3.533,24.623,0.000,80.075,93.925

0,1,2,3
Omnibus:,481.915,Durbin-Watson:,0.426
Prob(Omnibus):,0.0,Jarque-Bera (JB):,297.46
Skew:,-0.186,Prob(JB):,2.56e-65
Kurtosis:,2.467,Cond. No.,49.3


## But machine learning?

Well, you can usually break machine learning down into a few different things. These aren't necessarily perfect ways of categorizing things, but eh, close enough.

* **Predicting a number**
    - Linear regression
    - How does a change in unemployment translate into a change in life expectancy?
* **Predicting a category** (aka classification)
    - Lots of algos options: logistic regression, random forest, etc
    - For example, predicting cuisines based on ingredients
* **Seeing what influences a numeric outcome**
    - Linear regression since the output is a number
    - For example, minority and poverty status on test scores 
* **Seeing what influences a categorical outcome**
    - Logistic regression since the output is a category
    - Race and car speed for if you get a waring vs ticket
    - Wet/dry pavement and car weight if you survive or not in a car crash)

We have numbers, we have categories, we have all sorts of stuff. **What are some ways we can mash them together and use machine learning?**

### Brainstorm some ideas

Use the categories above to try to come up with some ideas. Be sure to scroll up where I break down categories vs numbers vs text!

**I'll give you one idea for free:** if you don't have any ideas, start off by creating a classifier that determines whether a wine is white or red based on the wine's description.

You can also go to https://library.columbia.edu and see if you can find some academic papers about wine. I'm sure they'll inspire you! (and they might even have some ML ideas in them you can steal, too)

# Implement 2 of your machine learning ideas

In [128]:
df['is_spanish'] = 0


In [130]:
df.loc[df.appellation_clean=='Spain', 'is_spanish'] = 1

In [131]:
df

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,...,importer,date published,user avg rating,bottle_size_ml,price_clean,avg_rating_clean,alcohol_clean,year,appellation_clean,is_spanish
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,...,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review],750,25.0,,14.5,2011.0,Spain,1
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,...,,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.5,2012.0,US,0
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,...,,12/1/2014,Not rated yet [Add Your Review],750,25.0,,13.5,2013.0,US,0
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,...,,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.0,2011.0,US,0
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,...,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review],750,17.0,,13.0,2013.0,Spain,1
5,https://www.winemag.com/buying-guide/mumm-napa...,90.0,Mumm Napa 2008 DVX Rosé Sparkling (Napa Valley),"Pretty peach in color, this 50-50 sparkling bl...",Virginie Boone,"$70, Buy Now",DVX Rosé,"Sparkling Blend, Sparkling","Napa Valley, Napa, California, US",Mumm Napa,...,,12/1/2014,Not rated yet [Add Your Review],750,70.0,,12.5,2008.0,US,0
6,https://www.winemag.com/buying-guide/nuiton-be...,90.0,Nuiton-Beaunoy 2011 Clos du Chapitre Premier C...,The two-acre Clos du Chapitre vineyard is in t...,Roger Voss,"N/A, Buy Now",Clos du Chapitre Premier Cru,Pinot Noir,"Gevrey-Chambertin, Burgundy, France",Nuiton-Beaunoy,...,"Fruit of the Vines, Inc",12/1/2014,Not rated yet [Add Your Review],750,,,13.0,2011.0,France,0
7,https://www.winemag.com/buying-guide/trapiche-...,90.0,Trapiche 2012 Broquel Cabernet Sauvignon (Mend...,"Spice, licorice and herbal notes complement re...",Michael Schachner,"$15, Buy Now",Broquel,Cabernet Sauvignon,"Mendoza, Mendoza Province, Argentina",Trapiche,...,The Wine Group,12/1/2014,Not rated yet [Add Your Review],750,15.0,,14.0,2012.0,Argentina,0
8,https://www.winemag.com/buying-guide/zonin-201...,90.0,Zonin 2010 Amarone della Valpolicella,"Full-bodied and fresh, this offfers attractive...",Kerin O’Keefe,"$50, Buy Now",,"Red Blends, Red Blends","Amarone della Valpolicella, Veneto, Italy",Zonin,...,Zonin USA,12/1/2014,Not rated yet [Add Your Review],750,50.0,,15.0,2010.0,Italy,0
9,https://www.winemag.com/buying-guide/pali-2012...,90.0,Pali 2012 Cargasacchi Vineyard Pinot Noir (Sta...,"Round, savory aromas of orange-cranberry with ...",Matt Kettmann,"$56, Buy Now",Cargasacchi Vineyard,Pinot Noir,"Sta. Rita Hills, Central Coast, California, US",Pali,...,,12/1/2014,Not rated yet [Add Your Review],750,56.0,,13.8,2012.0,US,0


In [132]:
df = df[df.wine_desc.notnull()]

In [133]:
count_vectorizer = CountVectorizer(stop_words='english', max_features = 2000)
X = count_vectorizer.fit_transform(df.wine_desc)
print(count_vectorizer.get_feature_names())

['000', '04', '05', '10', '100', '11', '12', '13', '14', '15', '16', '17', '18', '20', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023', '2024', '2025', '2026', '2027', '2028', '2030', '21', '25', '30', '35', '40', '45', '50', '60', '70', '75', '80', '85', '90', '93', '94', '95', '96', '97', 'abound', 'absolutely', 'abundant', 'acacia', 'accent', 'accented', 'accents', 'acceptable', 'accessible', 'accompanied', 'acid', 'acidic', 'acidity', 'acids', 'acre', 'actually', 'add', 'added', 'adding', 'addition', 'additional', 'adds', 'adequate', 'affordable', 'aftertaste', 'age', 'ageability', 'ageable', 'aged', 'ager', 'ages', 'ageworthy', 'aggressive', 'aging', 'ago', 'ahead', 'air', 'airing', 'alcohol', 'alcoholic', 'alicante', 'allow', 'allowing', 'allspice', 'alluring', 'almond', 'almonds', 'alongside', 'amazing', 'amazingly', 'american', 'amidst',

In [134]:
word_df = pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())
word_df.head()

Unnamed: 0,000,04,05,10,100,11,12,13,14,15,...,young,youth,youthful,zest,zestiness,zesty,zin,zinfandel,zingy,zippy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [135]:
word_df.sum().sort_values(ascending=False).head(15)

wine       13886
flavors    10931
fruit       8510
palate      5389
finish      5379
aromas      5318
acidity     5240
tannins     4902
cherry      4522
ripe        4347
black       4332
drink       3958
rich        3355
red         3093
dry         3018
dtype: int64

In [136]:
porter_stemmer = PorterStemmer()
def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

In [148]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=False, max_features = 200)
X = tfidf_vectorizer.fit_transform(df.wine_desc)
df_spain = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())

  'stop_words.' % sorted(inconsistent))


In [149]:
df_spain['is spanish'] = df['is_spanish']

In [150]:
df_spain

Unnamed: 0,accent,acid,add,age,alcohol,appl,apricot,aroma,aromat,bake,...,way,weight,white,wild,wine,wineri,wood,year,young,is spanish
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.242536,0.000000,0.242536,...,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,1.0
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.213201,0.000000,0.000000,0.000000,0.213201,...,0.000000,0.0,0.000000,0.0,0.213201,0.0,0.000000,0.000000,0.000000,0.0
2,0.000000,0.000000,0.258199,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.258199,0.0,0.258199,0.0,0.000000,0.000000,0.000000,0.0
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.333333,0.000000,0.000000,...,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.204124,0.000000,0.000000,...,0.000000,0.0,0.204124,0.0,0.000000,0.0,0.000000,0.000000,0.000000,1.0
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.223607,0.000000,0.000000,...,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0
6,0.000000,0.213201,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.0,0.213201,0.0,0.000000,0.000000,0.000000,0.0
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.218218,0.000000,0.000000,...,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0
8,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.213201,0.000000,0.213201,...,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.267261,0.000000,0.000000,...,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0


In [151]:
df_spain = df_spain.dropna()


In [152]:
df_spain.head()

Unnamed: 0,accent,acid,add,age,alcohol,appl,apricot,aroma,aromat,bake,...,way,weight,white,wild,wine,wineri,wood,year,young,is spanish
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.242536,0.0,0.242536,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.213201,0.0,0.0,0.0,0.213201,...,0.0,0.0,0.0,0.0,0.213201,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.258199,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.258199,0.0,0.258199,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.204124,0.0,0.0,...,0.0,0.0,0.204124,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [154]:
X = df_spain.drop('is spanish', axis = 1)
y = df_spain['is spanish']


In [155]:
forest = RandomForestClassifier(max_depth=5, n_estimators=100)
forest.fit(X, y) 
print(forest.feature_importances_)

[0.         0.00974044 0.00167454 0.00567295 0.00103424 0.01081297
 0.         0.00579075 0.00094556 0.00060888 0.01116025 0.00261379
 0.         0.0105347  0.00430825 0.00048884 0.00396078 0.00572559
 0.00403225 0.00430509 0.00387287 0.0064249  0.00241996 0.00758271
 0.00250133 0.00238823 0.00092954 0.00301315 0.00241998 0.00446183
 0.00478926 0.00680836 0.02855262 0.01074961 0.00268612 0.0092665
 0.00414373 0.00100479 0.0139107  0.00285629 0.00268634 0.000418
 0.000625   0.00250227 0.00645674 0.00943825 0.00445411 0.00032271
 0.0040958  0.00040502 0.00771208 0.00260467 0.         0.00085523
 0.00084645 0.00255802 0.00144924 0.00437365 0.00847751 0.01640048
 0.0061884  0.00170762 0.00342482 0.00308039 0.00064659 0.0078308
 0.00875561 0.00929065 0.         0.01846524 0.0021895  0.00045354
 0.0005224  0.0082445  0.00540929 0.00585691 0.00271644 0.00275961
 0.00888574 0.0005421  0.00178042 0.00330079 0.01427344 0.00266172
 0.         0.01216835 0.00827522 0.00143344 0.00507442 0.0010092


In [156]:
feature_names = X.columns
importances = forest.feature_importances_

pd.DataFrame({
    'feature': feature_names,
    'feature importance': importances,
}).sort_values(by='feature importance', ascending=False).head(20)

Unnamed: 0,feature,feature importance
108,make,0.06357
159,smoki,0.03097
32,cellar,0.028553
134,pineappl,0.021657
171,sweet,0.019986
69,flavor,0.018465
181,time,0.017114
59,drink,0.0164
179,thi,0.016034
103,littl,0.015597
