# Blind Tasting (Grape Variety Guessing): Red Wine

The goal of this work is to build a model that **predicts a wine's grape variety based on the tasting description** - this is what sommeliers call "**blind tasting**". In other words, we want to build a "sommelier-model". Besides making the prediction, we want the model to display the **main descriptors**, or **key words**, characteristic of the variety. In this notebook we will focus on red wines only.

## Preparing the data

First, we import the necessary **libraries and read the data**. For this task we will use the columns *desctiption* (our predictor variable) and *province* and *variety* columns (our target variables). See more explataions on these variables below.

We already explored the data in the data exploration notebook (same repo) and know that it has a few full duplicates and a handful of missing values in the columns of our interest (*province* and *variety*). At this point we deal with them by simply dropping the rows.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

pd.set_option('display.max_colwidth', None)

In [2]:
df = pd.read_csv('winemag-data-130k-v2.csv')
df.rename(columns={"Unnamed: 0": "id"}, inplace=True)

df.drop_duplicates(subset = df.columns.difference(['id']), inplace=True)
df.dropna(subset=['province', 'variety'], inplace = True)

Now we need to **prepare our predictor variable** - column *description*. 

First, we want to remove from this column any words that appear in the target columns (*province* and *variety*). Such words sometimes appear in the description because the tasters (who wrote these descriptions) didn't taste the wines blind, so they could mention the wines' variety or origin in the description. In our case, we want the descriptions to be as "blind" as possilbe and focus only on sensory parts of them.

Since words are often be surrounded by commas, dots and dashes, we first replace these punctuation marks with spaces. Then we exclude *variety* and *province* words from *description*. 

In [3]:
df.replace('[\.\,\-]', ' ', regex = True, inplace = True )

df['description'] = [' '.join(x for x in a.split(' ') if x not in set(b.split(' ')))
                            for a, b in zip(df['description'], df['variety'])]

df['description'] = [' '.join(x for x in a.split(' ') if x not in set(b.split(' ')))
                            for a, b in zip(df['description'], df['province'])]

Now we **prepare our target variable** - combination of columns *province* and *variety*. 

By creating this combination, we'll be predicting the wine more accurately. For example, we'll predict that the wine is a "Oregon Syrah". And it's important to specify the wine's origin (not just the variety) because the same variety's wine can taste very different when it comes from different regions. 

We call this combination *prov_var* (new column) and display the top 60 potential classes by the number of entries.

In [4]:
df['prov_var'] = df['province'] + '_' + df['variety']
df['prov_var'].value_counts()[:60]

California_Pinot Noir                   6418
California_Cabernet Sauvignon           5328
California_Chardonnay                   4784
Bordeaux_Bordeaux style Red Blend       4336
Oregon_Pinot Noir                       2560
Piedmont_Nebbiolo                       2478
California_Zinfandel                    2458
Burgundy_Chardonnay                     2130
Tuscany_Red Blend                       1988
Tuscany_Sangiovese                      1949
California_Syrah                        1749
California_Sauvignon Blanc              1675
California_Red Blend                    1667
Burgundy_Pinot Noir                     1443
California_Merlot                       1296
Mendoza Province_Malbec                 1265
Washington_Cabernet Sauvignon           1255
Champagne_Champagne Blend               1136
Northern Spain_Tempranillo              1110
Washington_Syrah                        1043
Provence_Rosé                            947
Mosel_Riesling                           921
Beaujolais

Theoretically, we could try to predict every single class presented in the data, but let's focus on a more limited number of the most populous of them. 

The code below creates **curated lists of red, white and sparking** wines of different varieties from different regions. These lists were manually and meticulously selected from the top 60 classes above to ensure we focus on the most popular varieties and regions. 

Later in this notebook we will focus only on red wines, but we'll also prepare the white and sparking lists for future use.

In [5]:
selected_red = ['California_Pinot Noir',
                'California_Cabernet Sauvignon',
                'Oregon_Pinot Noir',  
                'Washington_Cabernet Sauvignon',
                'Washington_Syrah',                             
                'Piedmont_Nebbiolo',
                'Piedmont_Barbera',
                'Tuscany_Sangiovese',
                'Veneto_Corvina, Rondinella, Molinara',              
                'Bordeaux_Bordeaux-style Red Blend',
                'Burgundy_Pinot Noir',
                'Beaujolais_Gamay',
                'Rhône Valley_Rhône-style Red Blend',           
                'Douro_Portuguese Red',
                'Northern Spain_Tempranillo',                
                'Mendoza Province_Malbec', 
                'South Australia_Shiraz']

selected_white = ['California_Chardonnay', 
                  'Oregon_Chardonnay',
                  'Oregon_Pinot Gris',
                  'Washington_Riesling',
                  'Mosel_Riesling',      
                  'Burgundy_Chardonnay', 
                  'Bordeaux_Bordeaux-style White Blend',
                  'Loire Valley_Sauvignon Blanc',
                  'Alsace_Riesling',
                  'Alsace_Gewürztraminer',
                  'Alsace_Pinot Gris',
                  'Northeastern Italy_Pinot Grigio',
                  'Marlborough_Sauvignon Blanc']

                  
selected_spark = ['Champagne_Champagne Blend',
                  'California_Sparkling Blend',
                  'Veneto_Glera',
                  'Catalonia_Sparkling Blend']

Finally, we prepare **three training datasets** - with red, white and sparkling wines separately.

Why separately? Mainly because when you "blindly" taste wine, you can usually know right away if it's red, white or sparkling. Also it's easier to explore and interpret the resutls when you separate these three.

The code below creates the three datasets with the only columns we may need for training *id*, *description* and *prov_var*.

In [6]:
def df_select(lst):
    return df[df.prov_var.isin(lst)][['id', 'description', 'prov_var']]

df_red = df_select(selected_red)
df_white = df_select(selected_white)
df_spark = df_select(selected_spark)

## Red wines: further data explorationg and preparation

The code below show the number of entries in each class. We can see that the **classes are imbalanced**; therefore, we decide to use ***f1_macro*** score to measure the performance of our models.

In [7]:
df_red['prov_var'].value_counts()

California_Pinot Noir            6418
California_Cabernet Sauvignon    5328
Oregon_Pinot Noir                2560
Piedmont_Nebbiolo                2478
Tuscany_Sangiovese               1949
Burgundy_Pinot Noir              1443
Mendoza Province_Malbec          1265
Washington_Cabernet Sauvignon    1255
Northern Spain_Tempranillo       1110
Washington_Syrah                 1043
Beaujolais_Gamay                  892
Douro_Portuguese Red              819
Piedmont_Barbera                  428
South Australia_Shiraz            416
Name: prov_var, dtype: int64

We will use *TfidfVectorizer* to learn vocabulary and idf from the predictor column (*description*). Note that this piece of code is just for exploration; it's not used in the actual training. 

We play with different *max_df* and *min_df* thresholds and **explore the vocabulary and stop words**. Using  *stop_words = 'english'* and limiting *max_df* to *0.16* helps avoid most frequent, but useless, descriptors from appearing in the results (see the stop words list below). We'll use this observation when we train the model.

In [8]:
vectorizer = TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2), max_df = 0.16, min_df = 1)
vectorizer.fit(df_red['description'])

#print('stop words count:', len(vectorizer.stop_words_))
print('stop words:', vectorizer.stop_words_)
#print('vocabulary size:', len(vectorizer.vocabulary_))
#vectorizer.vocabulary_

stop words: {'ripe', 'cherry', 'flavors', 'wine', 'palate', 'drink', 'aromas', 'black', 'oak', 'berry', 'finish', 'red', 'fruit', 'tannins', 'acidity'}


Finally, for convenience, we assign **numerical values for each class** and create *X* and *y* variables. We are ready for training!

In [9]:
df_red['prov_var_code'] = pd.Categorical(df_red.prov_var).codes
class_map = df_red.drop_duplicates(['prov_var', 'prov_var_code'])[['prov_var', 'prov_var_code']]
class_map = class_map.values.tolist()
print(class_map)

X = df_red['description']
y = df_red['prov_var_code']
#print(X.shape)
#print(y.shape)

[['Douro_Portuguese Red', 4], ['Oregon_Pinot Noir', 7], ['California_Cabernet Sauvignon', 2], ['Mendoza Province_Malbec', 5], ['California_Pinot Noir', 3], ['Beaujolais_Gamay', 0], ['Tuscany_Sangiovese', 11], ['Piedmont_Nebbiolo', 9], ['Piedmont_Barbera', 8], ['South Australia_Shiraz', 10], ['Burgundy_Pinot Noir', 1], ['Northern Spain_Tempranillo', 6], ['Washington_Syrah', 13], ['Washington_Cabernet Sauvignon', 12]]


## MultinomialNB: training and results

First, we train a **multinomial Naive Bayes classifier** as a baseline for our model. Using *GridSearchCV*, we try different parameter values for both *TfidfVectorizer* and *MultinomialNB* and display the best parameters and score.

In [10]:
pipe = Pipeline([('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

parameters = {
    'tfidf__stop_words': ['english'],
    'tfidf__ngram_range': [(1, 2), (2, 2)],
    'tfidf__max_df': [0.1, 0.13, 0.16],
    'tfidf__min_df': [0.0, 0.001, 1],
    'clf__alpha': np.arange(0.03, 0.05, 0.01)
}

gs_clf = GridSearchCV(pipe, parameters, scoring = "f1_macro", cv = 3, verbose = 1)
gs_clf.fit(X, y)

print('\nBEST PARAMETERS AND BEST SCORE:\n', gs_clf.best_params_, gs_clf.best_score_)

Fitting 3 folds for each of 54 candidates, totalling 162 fits

BEST PARAMETERS AND BEST SCORE:
 {'clf__alpha': 0.04, 'tfidf__max_df': 0.16, 'tfidf__min_df': 0.001, 'tfidf__ngram_range': (1, 2), 'tfidf__stop_words': 'english'} 0.6882762678527033


Now we **retrain the model with the best parameters** from the grid search above. We will use this model to display the main descriptor for each class.

In [11]:
vectorizer = TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2), max_df = 0.16, min_df = 0.001)
X_vect = vectorizer.fit_transform(X)
clf = MultinomialNB(alpha = 0.04).fit(X_vect, y)

The code below prints the **top 10 best descriptors for each class**. Top descriptors are selected using *feature_log_prob_* (empirical log probability of descriptors given a class). 

This is the main result of our work - let's review it carefully! Not all of the descriptors are useful and make sense, but if we filter them out, we get quite satisfying results. For example, when you have Oregon Pinot Noir in your hand, you most likely experience *cherry fruit, strawberry, chocolate* and the wine might be *light and tart*.

In [12]:
feature_names = vectorizer.get_feature_names_out()

for cl in class_map:
    print(cl[0])
    
    feature_map = list(zip(feature_names, clf.feature_log_prob_[cl[1]]))
    
    for f in sorted(feature_map, key = lambda t: t[1], reverse = True)[:10]:
        print(f[0], end = ', ')
    print('\n')

Douro_Portuguese Red
fruits, wood, aging, rich, structure, dense, black fruits, ready, firm, dark, 

Oregon_Pinot Noir
cherry fruit, light, tart, vineyard, strawberry, new, raspberry, chocolate, bit, pretty, 

California_Cabernet Sauvignon
blackberry, currant, dry, soft, blackberries, cab, cedar, chocolate, rich, cassis, 

Mendoza Province_Malbec
plum, blackberry, herbal, feels, notes, nose, berry aromas, jammy, oaky, berry flavors, 

California_Pinot Noir
cola, raspberry, silky, cherries, dry, texture, light, nose, vineyard, cranberry, 

Beaujolais_Gamay
fruits, fruity, structure, firm, rich, structured, ready, ready drink, juicy, character, 

Tuscany_Sangiovese
alongside, offers, palate offers, black cherry, spice, leather, underbrush, opens, delivers, espresso, 

Piedmont_Nebbiolo
spice, licorice, offers, alongside, barolo, firm, leather, palate offers, opens, rose, 

Piedmont_Barbera
spice, bright, asti, alongside, fresh, offers, black cherry, blackberry, skinned, alba, 

South Aus

Here is how we can **use the model for prediction**. In other words, here is how we "blind taste" with it.

In [13]:
test_input = ['Bright strawberry and cherry aromas and flavors. Light wine with a tart finish.',
              'Dark fruit flavors with notes of smoked meat, dried hers and earth.']

result = list(clf.predict(vectorizer.transform(test_input)))
print(list(c[0] for c in class_map if c[1] in result))

['California_Pinot Noir', 'Washington_Syrah']


## SGDClassifier: training and results

We will train a few more models using **SGDClassifier** and compare them with the baseline above. Similarly, we use *GridSearchCV* to go over different parameter values for both *TfidfVectorizer* and *SGDClassifier* and display the best parameters and score.

In [14]:
pipe = Pipeline([('tfidf', TfidfVectorizer()), ('clf', SGDClassifier())])

parameters = {
    'tfidf__stop_words': ['english', None],
    'tfidf__ngram_range': [(1, 2), (2, 2)],
    'tfidf__max_df': [0.1, 0.13, 0.16],
    'tfidf__min_df': [0.001],
    'clf__loss': ['hinge', 'modified_huber'],
    'clf__penalty': ['l2', 'l1', 'elasticnet'],
    'clf__max_iter': [5],
    'clf__tol': [None]
}

gs_clf = GridSearchCV(pipe, parameters, scoring = "f1_macro", cv = 3, verbose = 1)
gs_clf.fit(X, y)

print('\nBEST PARAMETERS AND BEST SCORE:\n', gs_clf.best_params_, gs_clf.best_score_)

Fitting 3 folds for each of 72 candidates, totalling 216 fits

BEST PARAMETERS AND BEST SCORE:
 {'clf__loss': 'modified_huber', 'clf__max_iter': 5, 'clf__penalty': 'l2', 'clf__tol': None, 'tfidf__max_df': 0.1, 'tfidf__min_df': 0.001, 'tfidf__ngram_range': (1, 2), 'tfidf__stop_words': None} 0.7281575261864273


Note that this model gives us a little higher score.

Similarly to before, we **retrain the model with the best parameters** from the grid search above. We will use this model to display the main descriptor for each class. 

In [15]:
vectorizer = TfidfVectorizer(stop_words = None, ngram_range = (1, 2), max_df = 0.1, min_df = 0.001)
X_vect = vectorizer.fit_transform(X)
clf = SGDClassifier(loss = 'modified_huber', penalty = 'l2', max_iter = 5, tol = None).fit(X_vect, y)

Similarly to before, we now print the **top 10 best descriptors for each class**. Top descriptors are selected using *coef_* (weights assigned to the descriptors). 

Here we see more descriptors that are not useful and don't make sense than in the previous model. For example, for Oregon Pinot Noir we can only use *cherry fruit* as a meaningful descriptor. Due to this, let's use the previous model as our final model for making predictions and generating descriptiors, even though this one give a slightly higher score.

In [16]:
feature_names = vectorizer.get_feature_names_out()

for cl in class_map:
    print(cl[0])
    
    feature_map = list(zip(feature_names, clf.coef_[cl[1]]))
    
    for f in sorted(feature_map, key = lambda t: t[1], reverse = True)[:10]:
        print(f[0], end = ', ')
    print('\n')

Douro_Portuguese Red
quinta, superior, touriga, river, black currant, second wine, smooth, mineral, black plum, port, 

Oregon_Pinot Noir
cherry fruit, cuvée, ava, now through, pinots, willamette, reserve, streaked, highlights, gracefully, 

California_Cabernet Sauvignon
cab, blackberries, napa, cedar, currant, mountain, blackberry cherry, blackberry and, cabs, black currants, 

Mendoza Province_Malbec
malbecs, tartaric, herbal, sticky, the bouquet, salty, palate is, saturated, generic, weedy, 

California_Pinot Noir
silky, cola, tea, pomegranate, cherries, raspberries, earthy, pink, raspberry and, sagebrush, 

Beaujolais_Gamay
banana, cru, cherry fruits, cru wine, cherry flavors, granite, months, red cherry, cherry fruit, the cru, 

Tuscany_Sangiovese
chianti, brunello, riserva, mediterranean, tuscan, rosso, wild cherry, underbrush, savory herb, palate offers, 

Piedmont_Nebbiolo
barolo, barbaresco, rose, drink after, cru, camphor, tar, ginger, licorice, hazelnut, 

Piedmont_Barbera
a

## Next steps and improvements

To improve the models (especially the second one) we can better clean and normalize the data before training (using NLTK, for example) to avoid cases like "the cru"/"this cru" and "blackberries"/"blackberry". 

Cheers!