# Machine learning on wine

**Topics:** Text analysis, linear regression, logistic regression, text analysis, classification

**Datasets**

- **wine-reviews.csv** Wine reviews scraped from https://www.winemag.com/
- **Data dictionary:** just go [here](https://www.winemag.com/buying-guide/tenuta-dellornellaia-2007-masseto-merlot-toscana/) and look at the page

## The background

You work in the **worst newsroom in the world**, and you've had a hard few weeks at work - a couple stories killed, a few scoops stolen out from under you. It's not going well.

And because things just can't get any worse: your boss shows up, carrying a huge binder. She slams it down on your desk.

"You know some machine learning stuff, right?"

You say "no," but she isn't listening. She's giving you an assignment, the _worst assignment_:

> Machine learning is the new maps. Let's get some hits!
>
> **Do some machine learning on this stuff.**

"This stuff" is wine reviews.

## A tiny, meagre bit of help

You have a dataset. It has some stuff in it:

* **Numbers:**
    - Year published
    - Alcohol percentage
    - Price
    - Score
    - Bottle size
* **Categories:**
    - Red vs white
    - Different countries
    - Importer
    - Designation
    - Taster
    - Variety
    - Winery
* **Free text:**
    - Wine description

# Cleaning up your data

Many of these pieces - the alcohol, the year produced, the bottle size, the country the wine is from - aren't in a format you can use. Convert the ones to numbers that are numbers, and extract the others from the appropriate strings.

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import re
import nltk
import fuzzy_pandas as fpd
import numpy as np
from nltk.stem.porter import PorterStemmer
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import numpy as np
from statsmodels.sandbox.regression.predstd import wls_prediction_std

In [5]:
df = pd.read_csv('wine-reviews.csv')

In [6]:
df.head()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5%,750 ml,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review]
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,13%,750 ml,Red,,12/1/2014,Not rated yet [Add Your Review]
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,13%,750 ml,White,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review]


In [7]:
r = {'750 ml':750, '750ML':750, 
    '375 ml':375, '500 ml':500, '500ML':500,
     '1 L':1000, '1.5 L':1500,
       '375ML':375, '3 L':3000,
     '1.5L':1500, '1L':1000,
     '3L':3000, '187 ml':187}

df['bottle_size_ml'] = df['bottle size'].replace(r)
df.bottle_size_ml.value_counts()

750     21055
375       237
500        94
3000       17
1500       16
1000       14
187         2
Name: bottle_size_ml, dtype: int64

In [8]:
df['price_clean'] = df.price.str.split(',', expand=True)[0].str.replace('$', '').replace('N/A', np.nan).astype(float) 

In [9]:
df['avg_rating_clean'] = df['user avg rating'].str[:2].replace('No', np.nan) 

In [15]:
df['user avg rating'].value_counts()

Not rated yet [Add Your Review]    21427
90 [Add Your Review]                   3
89 [Add Your Review]                   2
83 [Add Your Review]                   1
95 [Add Your Review]                   1
98 [Add Your Review]                   1
Name: user avg rating, dtype: int64

In [16]:
df['alcohol_clean'] = df.alcohol.str.replace('%', '').astype(float)

In [17]:
df['year'] = df.wine_name.str.extract(r'.*?(\d\d\d\d)').astype(float)

In [24]:
df.wine_desc.value_counts()

86-88 This could work as a rich wine, because there is good structure and piles of botrytis. It could be delicious, with its lovely dry finish, but that's for the future.                                                                                                                                                                                                                                                                                                                                                            3
Brilliant, pure fruit flavors light up this full-bodied but easy-to-drink wine. Cherry, strawberry and something deeper like black currant dance through the aromas and coat the tongue with their richness. The texture is smooth and plush due to fine tannins and mild acidity.                                                                                                                                                                                                              

In [19]:
df['appellation_clean'] = df.appellation.str.extract('(\w*.)$').astype(str)

In [23]:
df.variety.value_counts()

Pinot Noir                                     2429
Chardonnay                                     1991
Cabernet Sauvignon                             1847
Red Blends, Red Blends                         1226
Bordeaux-style Red Blend                       1035
Syrah                                           964
Riesling                                        849
Sauvignon Blanc                                 759
Merlot                                          643
Rosé                                            536
Zinfandel                                       498
Malbec                                          429
Champagne Blend, Sparkling                      386
Portuguese Red                                  363
Tempranillo                                     357
White Blend                                     342
Sparkling Blend, Sparkling                      338
Nebbiolo                                        300
Grüner Veltliner                                287
Rhône-style 

## What might be interesting in this dataset?

Maybe start out playing around _without_ machine learning. Here are some thoughts to get you started:

* I've heard that since the 90's wine has gone through [Parkerization](https://www.estatewinebrokers.com/blog/the-parkerization-of-wine-in-the-1990s-and-beyond/), an increase in production of high-alcohol, fruity red wines thanks to the influence of wine critic Robert Parker.
* Red and white wines taste different, obviously, but people always use [goofy words to describe them](https://winefolly.com/tutorial/40-wine-descriptions/)
* Once upon a time in 1976 [California wines proved themselves against France](https://en.wikipedia.org/wiki/Judgment_of_Paris_(wine)) and France got very angry about it

In [25]:
df.wine_points.value_counts()

90.0     2312
87.0     2280
92.0     1954
88.0     1839
93.0     1767
91.0     1746
86.0     1663
84.0     1436
94.0     1263
89.0     1139
85.0      943
83.0      929
82.0      680
95.0      635
81.0      257
96.0      231
80.0      151
97.0      136
98.0       46
99.0       17
100.0      11
Name: wine_points, dtype: int64

## But machine learning?

Well, you can usually break machine learning down into a few different things. These aren't necessarily perfect ways of categorizing things, but eh, close enough.

* **Predicting a number**
    - Linear regression
    - How does a change in unemployment translate into a change in life expectancy?
* **Predicting a category** (aka classification)
    - Lots of algos options: logistic regression, random forest, etc
    - For example, predicting cuisines based on ingredients
* **Seeing what influences a numeric outcome**
    - Linear regression since the output is a number
    - For example, minority and poverty status on test scores 
* **Seeing what influences a categorical outcome**
    - Logistic regression since the output is a category
    - Race and car speed for if you get a waring vs ticket
    - Wet/dry pavement and car weight if you survive or not in a car crash)

We have numbers, we have categories, we have all sorts of stuff. **What are some ways we can mash them together and use machine learning?**

### Brainstorm some ideas

Use the categories above to try to come up with some ideas. Be sure to scroll up where I break down categories vs numbers vs text!

**I'll give you one idea for free:** if you don't have any ideas, start off by creating a classifier that determines whether a wine is white or red based on the wine's description.

In [None]:
# does the grape affect the points (people like the wine)? 

You can also go to https://library.columbia.edu and see if you can find some academic papers about wine. I'm sure they'll inspire you! (and they might even have some ML ideas in them you can steal, too)

# Implement 2 of your machine learning ideas

In [26]:
df['grapes'] = df.variety.str.replace('(,.*)$', '').astype(str)

In [37]:
df = df[~df['grapes'].str.contains('Blend')]

In [39]:
df = df[~df['grapes'].str.contains('Portuguese')]

In [52]:
df.columns

Index(['url', 'wine_points', 'wine_name', 'wine_desc', 'taster', 'price',
       'designation', 'variety', 'appellation', 'winery', 'alcohol',
       'bottle size', 'category', 'importer', 'date published',
       'user avg rating', 'bottle_size_ml', 'price_clean', 'avg_rating_clean',
       'alcohol_clean', 'year', 'appellation_clean', 'grapes'],
      dtype='object')

In [53]:
df.head()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,...,importer,date published,user avg rating,bottle_size_ml,price_clean,avg_rating_clean,alcohol_clean,year,appellation_clean,grapes
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,...,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review],750,25.0,,14.5,2011.0,Spain,Tempranillo
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,...,,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.5,2012.0,US,Chardonnay
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,...,,12/1/2014,Not rated yet [Add Your Review],750,25.0,,13.5,2013.0,US,Auxerrois
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,...,,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.0,2011.0,US,Pinot Noir
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,...,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review],750,17.0,,13.0,2013.0,Spain,Albariño


In [51]:
df.grapes.value_counts()

Pinot Noir                 2429
Chardonnay                 1991
Cabernet Sauvignon         1847
Syrah                       964
Riesling                    849
Sauvignon Blanc             759
Merlot                      643
Rosé                        536
Zinfandel                   498
Malbec                      429
Tempranillo                 357
Nebbiolo                    300
Grüner Veltliner            287
Shiraz                      235
Sangiovese                  229
Pinot Gris                  197
Viognier                    191
Cabernet Franc              182
Chenin Blanc                163
Gewürztraminer              154
Barbera                     153
Pinot Grigio                143
Petite Sirah                137
Port                        134
Gamay                       123
Grenache                    110
Carmenère                    83
Corvina                      70
Glera                        70
Albariño                     67
                           ... 
Groppell

## Remove every wine with less than 10 entries

In [94]:
df.sort_values(by='grapes')

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,...,importer,date published,user avg rating,bottle_size_ml,price_clean,avg_rating_clean,alcohol_clean,year,appellation_clean,grapes
929,https://www.winemag.com/buying-guide/zacharias...,85.0,Zacharias 2012 Agiorgitiko (Nemea),"Plum, spice and red berries on the nose and pa...",Susan Kostrzewa,"$13, Buy Now",,"Agiorgitiko, Greek Red","Nemea, Greece",Zacharias,...,"Stellar Importing Company, LLC",8/1/2015,Not rated yet [Add Your Review],750,13.0,,12.5,2012.0,Greece,Agiorgitiko
15266,https://www.winemag.com/buying-guide/kouros-20...,80.0,Kouros 2000 Agiorgitiko (Corinth),,Joe Czerwinski,"$9, Buy Now",,"Agiorgitiko, Greek Red","Corinth, Greece",Kouros,...,Nestor Imports,9/1/2004,Not rated yet [Add Your Review],750,9.0,,12.0,2000.0,Greece,Agiorgitiko
15846,https://www.winemag.com/buying-guide/ktima-dri...,85.0,Ktima Driopi 2005 Agiorgitiko (Nemea),"This earthy red starts with aromas of leather,...",Susan Kostrzewa,"$25, Buy Now",,"Agiorgitiko, Greek Red","Nemea, Greece",Ktima Driopi,...,Angel's Share Wines,4/1/2010,Not rated yet [Add Your Review],750,25.0,,13.0,2005.0,Greece,Agiorgitiko
15267,https://www.winemag.com/buying-guide/spiropoul...,80.0,Spiropoulos 2000 Red Stag Agiorgitiko (Pelopon...,,Joe Czerwinski,"$15, Buy Now",Red Stag,"Agiorgitiko, Greek Red","Peloponnese, Greece",Spiropoulos,...,Athenee Importers,9/1/2004,Not rated yet [Add Your Review],750,15.0,,12.5,2000.0,Greece,Agiorgitiko
13074,https://www.winemag.com/buying-guide/achaia-cl...,83.0,Achaia Clauss 2000 Agiorgitiko (Corinth),,Joe Czerwinski,"$10, Buy Now",,"Agiorgitiko, Greek Red","Corinth, Greece",Achaia Clauss,...,"Stellar Importing Company, LLC",9/1/2004,Not rated yet [Add Your Review],750,10.0,,12.0,2000.0,Greece,Agiorgitiko
8963,https://www.winemag.com/buying-guide/alexandro...,86.0,Alexandros Megapanos 2005 Megapanos Agiorgitik...,Sweet cedar and cinnamon aromas and flavors of...,Susan Kostrzewa,"$28, Buy Now",Megapanos,"Agiorgitiko, Greek Red","Nemea, Greece",Alexandros Megapanos,...,Wonderful Ethnic Imports,4/1/2010,Not rated yet [Add Your Review],750,28.0,,12.5,2005.0,Greece,Agiorgitiko
10414,https://www.winemag.com/buying-guide/spiropoul...,86.0,Spiropoulos 2006 Red Stag Agiorgitiko (Nemea),"Ripe cherry, cedar, vanilla and spice aromas a...",Susan Kostrzewa,"$15, Buy Now",Red Stag,"Agiorgitiko, Greek Red","Nemea, Greece",Spiropoulos,...,Athenee Importers,4/1/2010,Not rated yet [Add Your Review],750,15.0,,13.0,2006.0,Greece,Agiorgitiko
3946,https://www.winemag.com/buying-guide/tsantali-...,87.0,Tsantali 2003 Réserve Agiorgitiko (Nemea),"Agiorgitiko, famed in Nemea and popular for it...",Susan Kostrzewa,"$16, Buy Now",Réserve,"Agiorgitiko, Greek Red","Nemea, Greece",Tsantali,...,"Fantis Imports, Inc",8/1/2008,Not rated yet [Add Your Review],750,16.0,,13.5,2003.0,Greece,Agiorgitiko
7719,https://www.winemag.com/buying-guide/estate-bi...,88.0,Estate Biblia Chora 2008 Areti Agiorgitiko (Pa...,This expressive red starts with aromas of rasp...,Susan Kostrzewa,"$29, Buy Now",Areti,"Agiorgitiko, Greek Red","Pangeon, Greece",Estate Biblia Chora,...,Cava Spiliadis,9/1/2012,Not rated yet [Add Your Review],750,29.0,,13.5,2008.0,Greece,Agiorgitiko
6988,https://www.winemag.com/buying-guide/estate-co...,83.0,Estate Constantin Gofas 2005 Mythic River Agio...,This rustic red starts with a nose of red berr...,Susan Kostrzewa,"$13, Buy Now",Mythic River,"Agiorgitiko, Greek Red","Nemea, Greece",Estate Constantin Gofas,...,Athena Importing Co,8/1/2009,Not rated yet [Add Your Review],750,13.0,,13.5,2005.0,Greece,Agiorgitiko


In [46]:
df_grape = pd.get_dummies(df.grapes, prefix='wine_points').drop(columns=[])

In [56]:
df_grape.head()

Unnamed: 0,wine_points_Agiorgitiko,wine_points_Aglianico,wine_points_Airen,wine_points_Albana,wine_points_Albariño,wine_points_Aleatico,wine_points_Alfrocheiro,wine_points_Alicante Bouschet,wine_points_Aligoté,wine_points_Alsace white blend,...,wine_points_Weissburgunder,wine_points_Welschriesling,wine_points_White Riesling,wine_points_Xarel-lo,wine_points_Xinomavro,wine_points_Zibibbo,wine_points_Zierfandler,wine_points_Zierfandler-Rotgipfler,wine_points_Zinfandel,wine_points_Zweigelt
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [59]:
df.wine_points.shape

(16880,)

In [62]:
df.wine_points.isnull().sum()

0

In [67]:
y.shape

(16880, 387)

In [68]:
X = df_grape  
y = df.wine_points

In [69]:
import statsmodels.api as sm
mod = sm.OLS(y, X)
res = mod.fit()

In [75]:
res.summary2().tables[1].sort_values("Coef.", ascending=False)

Unnamed: 0,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
wine_points_Muscat Ottonel,100.000000,3.533221,28.302790,4.043059e-172,93.074507,106.925493
wine_points_Rosenmuskateller,98.000000,3.533221,27.736734,1.549790e-165,91.074507,104.925493
wine_points_Scheurebe,97.250000,1.766610,55.048926,0.000000e+00,93.787253,100.712747
wine_points_Grenache-Mourvèdre,96.000000,3.533221,27.170678,4.496963e-159,89.074507,102.925493
wine_points_Welschriesling,95.000000,1.117302,85.026216,0.000000e+00,92.809967,97.190033
wine_points_Tokaji,94.500000,1.249182,75.649498,0.000000e+00,92.051468,96.948532
wine_points_Bual,94.500000,2.498364,37.824749,2.570289e-300,89.602937,99.397063
wine_points_Nosiola,94.000000,3.533221,26.604623,9.862334e-153,87.074507,100.925493
wine_points_Syrah-Petit Verdot,94.000000,3.533221,26.604623,9.862334e-153,87.074507,100.925493
wine_points_Riesling-Chardonnay,94.000000,3.533221,26.604623,9.862334e-153,87.074507,100.925493


## Second Analysis: Spanish Wine Detector 

In [110]:
df.head()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,...,importer,date published,user avg rating,bottle_size_ml,price_clean,avg_rating_clean,alcohol_clean,year,appellation_clean,grapes
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,...,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review],750,25.0,,14.5,2011.0,Spain,Tempranillo
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,...,,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.5,2012.0,US,Chardonnay
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,...,,12/1/2014,Not rated yet [Add Your Review],750,25.0,,13.5,2013.0,US,Auxerrois
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,...,,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.0,2011.0,US,Pinot Noir
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,...,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review],750,17.0,,13.0,2013.0,Spain,Albariño


In [113]:
df.loc[df.appellation_clean == 'Spain', 'is_spanish'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [112]:
df['is_spanish'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [114]:
df.head()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,...,date published,user avg rating,bottle_size_ml,price_clean,avg_rating_clean,alcohol_clean,year,appellation_clean,grapes,is_spanish
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,...,12/1/2014,Not rated yet [Add Your Review],750,25.0,,14.5,2011.0,Spain,Tempranillo,1
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,...,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.5,2012.0,US,Chardonnay,0
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,...,12/1/2014,Not rated yet [Add Your Review],750,25.0,,13.5,2013.0,US,Auxerrois,0
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,...,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.0,2011.0,US,Pinot Noir,0
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,...,12/1/2014,Not rated yet [Add Your Review],750,17.0,,13.0,2013.0,Spain,Albariño,1


In [121]:
df = df[df.wine_desc.notnull()]

In [101]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(stop_words='english', max_features=200)
X = count_vectorizer.fit_transform(df.wine_desc)
print(count_vectorizer.get_feature_names())

['acidity', 'acids', 'age', 'aging', 'alcohol', 'apple', 'apricot', 'aromas', 'baked', 'balance', 'balanced', 'barrel', 'berry', 'best', 'big', 'bit', 'black', 'blackberries', 'blackberry', 'blend', 'bodied', 'body', 'bottle', 'bottling', 'bright', 'cabernet', 'caramel', 'cassis', 'cedar', 'character', 'chardonnay', 'cherries', 'cherry', 'chocolate', 'cinnamon', 'citrus', 'clean', 'coffee', 'cola', 'color', 'come', 'comes', 'complex', 'complexity', 'concentrated', 'concentration', 'core', 'creamy', 'crisp', 'currant', 'dark', 'deep', 'delicious', 'delivers', 'dense', 'depth', 'dried', 'drink', 'dry', 'earth', 'earthy', 'easy', 'edge', 'elegant', 'feel', 'feels', 'fine', 'finish', 'finishes', 'firm', 'flavor', 'flavors', 'floral', 'fresh', 'freshness', 'fruit', 'fruits', 'fruity', 'glass', 'good', 'grapefruit', 'great', 'green', 'hard', 'heavy', 'herb', 'herbal', 'herbs', 'high', 'hint', 'hints', 'honey', 'intense', 'juicy', 'just', 'lead', 'leather', 'lemon', 'licorice', 'light', 'like

In [102]:
words_df = pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())

In [103]:
words_df.head()

Unnamed: 0,acidity,acids,age,aging,alcohol,apple,apricot,aromas,baked,balance,...,weight,white,wild,wine,winery,wines,wood,years,yellow,young
0,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0


In [106]:
words_df.sum().sort_values(ascending=False)

wine          10311
flavors        8760
fruit          6628
finish         4442
palate         4379
aromas         4205
acidity        4020
cherry         3703
tannins        3462
ripe           3258
black          3138
drink          2944
oak            2508
rich           2482
dry            2397
sweet          2302
red            2269
nose           2244
notes          2192
spice          2174
fresh          1985
berry          1854
soft           1632
blackberry     1627
plum           1593
dark           1539
shows          1522
good           1504
fruits         1500
white          1499
              ...  
toasty          447
smoke           442
feels           441
round           440
plenty          433
power           430
nice            429
mocha           428
right           423
sweetness       421
yellow          420
strong          417
refreshing      416
barrel          413
comes           408
way             406
baked           403
heavy           398
make            398


In [116]:
porter_stemmer = PorterStemmer()
def stemming_tokenizer(str_input):
   words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
   words = [porter_stemmer.stem(word) for word in words]
   return words

In [117]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=False, norm='l1', max_features=200)
X = tfidf_vectorizer.fit_transform(df.wine_desc)
df_spain = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,accent,acid,add,age,alcohol,appl,apricot,aroma,aromat,bake,...,way,white,wild,wine,wineri,wood,year,yellow,young,zesti
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.058824,0.000000,0.058824,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.050000,0.000000,0.000000,0.000000,0.050000,...,0.000000,0.000000,0.000000,0.050000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.000000,0.000000,0.076923,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.076923,0.000000,0.076923,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.111111,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.050000,0.000000,0.000000,...,0.000000,0.050000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
5,0.000000,0.050000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.050000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.052632,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.083333,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
8,0.000000,0.090909,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
9,0.000000,0.058824,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.117647,0.0,0.000000,0.000000,0.058824,0.000000,0.000000


In [118]:
df_spain = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())

In [119]:
df_spain['is_spanish'] = df['is_spanish']

In [120]:
df_spain

Unnamed: 0,accent,acid,add,age,alcohol,appl,apricot,aroma,aromat,bake,...,white,wild,wine,wineri,wood,year,yellow,young,zesti,is_spanish
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.058824,0.000000,0.058824,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,1.0
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.050000,0.000000,0.000000,0.000000,0.050000,...,0.000000,0.000000,0.050000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
2,0.000000,0.000000,0.076923,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.076923,0.000000,0.076923,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.111111,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.050000,0.000000,0.000000,...,0.050000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,1.0
5,0.000000,0.050000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.050000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.052632,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.083333,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
8,0.000000,0.090909,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,
9,0.000000,0.058824,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.117647,0.0,0.000000,0.000000,0.058824,0.000000,0.000000,0.0


In [131]:
df_spain = df_spain.dropna()

In [122]:
df_spain = df_spain[df_spain.notnull()]

In [123]:
df_spain.head()

Unnamed: 0,accent,acid,add,age,alcohol,appl,apricot,aroma,aromat,bake,...,white,wild,wine,wineri,wood,year,yellow,young,zesti,is_spanish
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.058824,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.05,...,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.076923,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,...,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [132]:
X = df_spain.drop('is_spanish', axis=1)
y = df_spain['is_spanish']

In [126]:
y.shape

(16779,)

In [127]:
X.shape

(16779, 200)

## Random Forest 

In [134]:
forest = RandomForestClassifier(max_depth=5, n_estimators=100)
forest.fit(X, y) 
print(forest.feature_importances_)

[0.00000000e+00 8.87716365e-03 1.11718274e-02 9.93348227e-04
 1.35552409e-03 3.44485953e-03 0.00000000e+00 9.94464243e-03
 4.78825726e-03 2.53100762e-03 3.59233439e-03 1.63859173e-02
 2.59993635e-03 1.27524432e-02 2.49359597e-02 1.20839319e-03
 4.74142587e-03 0.00000000e+00 4.38846867e-03 5.53995384e-03
 3.53684774e-03 0.00000000e+00 2.38436309e-03 9.69854540e-04
 2.29526524e-03 3.99175013e-03 1.23363292e-03 1.61674731e-04
 9.08932368e-03 8.96545568e-04 9.50182070e-03 1.00802747e-02
 1.49509636e-02 3.48305079e-03 1.23929713e-02 3.10174773e-03
 8.11305781e-03 3.22692224e-03 6.03112840e-03 1.99048442e-02
 4.01918903e-04 7.84017458e-03 2.02503584e-02 2.07961763e-03
 4.22418192e-04 9.93284346e-03 1.18278365e-03 8.27423264e-03
 2.45685969e-03 4.56576441e-03 0.00000000e+00 1.95556272e-05
 1.05668666e-03 2.50489613e-03 6.17207469e-03 0.00000000e+00
 2.27637279e-03 3.41785747e-03 7.34168843e-03 4.86169696e-04
 4.89477795e-03 3.80300728e-03 8.38813724e-03 6.52840363e-03
 7.73622692e-03 4.294780

In [137]:
feature_names = X.columns
importances = forest.feature_importances_

pd.DataFrame({
    'feature': feature_names,
    'feature importance': importances,
}).sort_values(by='feature importance', ascending=False).head(20)

Unnamed: 0,feature,feature importance
14,best,0.024936
42,come,0.02025
106,lush,0.020085
39,coffe,0.019905
119,noir,0.019154
155,slightli,0.018435
11,barrel,0.016386
143,rich,0.016313
108,meat,0.015808
123,offer,0.015615
