# sommelier.ai
#### Practical Machine Learning Workshop

### Agenda:
- Data Exploration with pandas
- Modeling with scikit-learn

### Tools and Documentation
- [pandas](https://pandas.pydata.org/pandas-docs/stable/api.html)
- [scikit-learn](http://scikit-learn.org/stable/index.html)
- [matplotlib](https://matplotlib.org/api/api_overview.html)
- [Jupyter Lab](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html)


## Data Exploration

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from workshop import boxplot_sorted

sns.set(style="darkgrid")

## Modeling

In [50]:
from sklearn import metrics
from sklearn import tree
from sklearn.pipeline import *
from sklearn.feature_extraction.text import *
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import *
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.compose import *
from sklearn.impute import *
from sklearn.preprocessing import *

from workshop import show_most_informative_features

### Carmen's wine

In [150]:
df[df.points == 100].sort_values("price").head(10)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
58548,US,Initially a rather subdued Frog; as if it has ...,Bionic Frog,100,80.0,Washington,Walla Walla Valley (WA),Columbia Valley,Paul Gregutt,@paulgwine,Cayuse 2008 Bionic Frog Syrah (Walla Walla Val...,Syrah,Cayuse
94349,US,In 2005 Charles Smith introduced three high-en...,Royal City,100,80.0,Washington,Columbia Valley (WA),Columbia Valley,Paul Gregutt,@paulgwine,Charles Smith 2006 Royal City Syrah (Columbia ...,Syrah,Charles Smith
52675,France,"This is a magnificently solid wine, initially ...",,100,150.0,Bordeaux,Saint-Julien,,Roger Voss,@vossroger,Château Léoville Barton 2010 Saint-Julien,Bordeaux-style Red Blend,Château Léoville Barton
38560,US,Tasted in a flight of great and famous Napa wi...,,100,200.0,California,Napa Valley,Napa,,,Cardinale 2006 Cabernet Sauvignon (Napa Valley),Cabernet Sauvignon,Cardinale
102445,Italy,Thick as molasses and dark as caramelized brow...,Occhio di Pernice,100,210.0,Tuscany,Vin Santo di Montepulciano,,,,Avignonesi 1995 Occhio di Pernice (Vin Santo ...,Prugnolo Gentile,Avignonesi
44510,France,This latest incarnation of the famous brand is...,Cristal Vintage Brut,100,250.0,Champagne,Champagne,,Roger Voss,@vossroger,Louis Roederer 2008 Cristal Vintage Brut (Cha...,Champagne Blend,Louis Roederer
91058,France,This is a fabulous wine from the greatest Cham...,Brut,100,259.0,Champagne,Champagne,,Roger Voss,@vossroger,Krug 2002 Brut (Champagne),Champagne Blend,Krug
102474,Australia,This wine contains some material over 100 year...,Rare,100,350.0,Victoria,Rutherglen,,Joe Czerwinski,@JoeCz,Chambers Rosewood Vineyards NV Rare Muscat (Ru...,Muscat,Chambers Rosewood Vineyards
99513,France,"A hugely powerful wine, full of dark, brooding...",,100,359.0,Bordeaux,Saint-Julien,,Roger Voss,@vossroger,Château Léoville Las Cases 2010 Saint-Julien,Bordeaux-style Red Blend,Château Léoville Las Cases
52554,US,This wine dazzles with perfection. Sourced fro...,La Muse,100,450.0,California,Sonoma County,Sonoma,,,Verité 2007 La Muse Red (Sonoma County),Bordeaux-style Red Blend,Verité


In [5]:
df['is_good'] = df.points > 88
df.is_good.value_counts()

False    61869
True     55104
Name: is_good, dtype: int64

In [6]:
train = df.drop(['is_good', 'points', 'price'], axis=1)
train.head()

Unnamed: 0,country,description,designation,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,US,This wine's aromas are vibrant and fruit forwa...,,Washington,Columbia Valley (WA),Columbia Valley,Sean P. Sullivan,@wawinereport,Baer 2014 Malbec (Columbia Valley (WA)),Malbec,Baer
1,US,If you're curious about California Grenache Bl...,Tourmaline,California,Santa Ynez Valley,Central Coast,,,Coghlan 2010 Tourmaline Grenache Blanc (Santa ...,Grenache Blanc,Coghlan
2,France,"While the acidity is intense, it is balanced b...",,Beaujolais,Beaujolais-Villages,,Roger Voss,@vossroger,Domaine de Roche Guillon 2013 Beaujolais-Vill...,Gamay,Domaine de Roche Guillon
3,France,Red fruits and a soft tannic profile give a re...,,Southwest France,Cahors,,Roger Voss,@vossroger,Domaine de Cause 2011 Malbec (Cahors),Malbec,Domaine de Cause
4,Spain,Shows true Priorat depth and minerality while ...,Balcons,Catalonia,Priorat,,Michael Schachner,@wineschach,Pinord 2004 Balcons Red (Priorat),Red Blend,Pinord


In [7]:
train_df, test_df, train_labels, test_labels =  train_test_split(
    train,
    df.is_good,
    random_state=3)

In [8]:
train_df.shape

(87729, 11)

In [9]:
test_df.shape

(29244, 11)

In [10]:
train_labels.shape

(87729,)

In [12]:
%%time
 
model = make_pipeline(
            CountVectorizer(),
            MultinomialNB())
 
model.fit(train_df.description, train_labels);

Wall time: 3.57 s


In [53]:
%%time
categorical_features = ['country', 'winery', 'region_1', 'region_2', 'variety', 'taster_twitter_handle', 'designation']
categorical_transformer = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing'),
    OneHotEncoder(handle_unknown='ignore'))
 
model = make_pipeline(
            make_column_transformer(
                ('description', CountVectorizer(ngram_range=(1,3), min_df=2, stop_words='english')),
                ('description', TfidfVectorizer(ngram_range=(1,3), min_df=2, stop_words='english')),
                (categorical_features, categorical_transformer)),
             LogisticRegression())
 
model.fit(train_df, train_labels)
 
predicted = model.predict(test_df)
 
score = metrics.accuracy_score(test_labels, predicted)
print('\nAccuracy: %0.3f' % score)
 
print(metrics.classification_report(test_labels, predicted))




Accuracy: 0.864
              precision    recall  f1-score   support

       False       0.87      0.88      0.87     15450
        True       0.86      0.85      0.85     13794

   micro avg       0.86      0.86      0.86     29244
   macro avg       0.86      0.86      0.86     29244
weighted avg       0.86      0.86      0.86     29244

Wall time: 3min 2s
