# Intro
Prepare the *Wine Spectator* Top 100 review set for training a text classifier model.
Reference: [Dataset Splitting Best Practices in Python](https://www.kdnuggets.com/2020/05/dataset-splitting-best-practices-python.html)
Train and test a Support Vector Machine (SVM) Model to predict wine style from wine review. 
References:
* [scikit-learn: Working with Text Data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
* [datacamp: Support Vector Machines with Scikit-learn](https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python)
* [Python : How to Save and Load ML Models](https://www.kaggle.com/prmohanty/python-how-to-save-and-load-ml-models)

# Load the *Wine Spectator* Top 100 dataset

## File setup

In [3]:
# import and initialize main python libraries
import numpy as np
import pandas as pd

# import libraries for file navigation
import os
import shutil
import glob
from pandas_ods_reader import read_ods
import pickle

# import ML libraries
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

In [4]:
print(sklearn.__version__)

0.24.1


## Load dataframe and reformat

In [5]:
# Note: save CSV files in UTF-8 format to preserve special characters.
df_Wine = pd.read_csv('./CSV_Wines.csv')

In [6]:
# CSV of wines is retaining a blank row at the end of the dataset. Remove the last row to prevent data type errors.

# number of rows to drop
n = 1

df_Wine.drop(df_Wine.tail(n).index, inplace = True)

In [7]:
df_Wine.shape

(3300, 18)

In [8]:
df_Wine.dtypes

Review_Year           float64
Rank                   object
Vintage                object
Score                 float64
Price                  object
Winemaker              object
Wine                   object
Wine_Style             object
Grape_Blend            object
Blend_List             object
Geography              object
Cases_Made            float64
Cases_Imported        float64
Reviewer               object
Drink_now             float64
Best_Drink_from       float64
Best_Drink_Through    float64
Review                 object
dtype: object

In [9]:
# limit dataset to text and classifier dimensions
df_Wine = df_Wine[['Wine_Style', 'Review']]
df_Wine.head()

Unnamed: 0,Wine_Style,Review
0,Red,"Maturing well, this round red is a lovely exam..."
1,Red,"Powerful and structured, with minerally richne..."
2,Red,"Effusive aromas of black currant, blueberry, v..."
3,Red,This distinctive red throws a lot of wild sage...
4,Red,"A lush, ripe style, with açaí berry, blueberry..."


In [10]:
# convert wine_style classifier from string to int for easier analysis.

# set up dictionary
style = {
    'Dessert & Fortified': 0,
    'Red': 1,
    'Rosé | Rosado': 2,
    'Sparkling': 3,
    'White': 4
}

# apply dictionary to Wine_Style column:
df_Wine.Wine_Style = [style[item] for item in df_Wine.Wine_Style]
df_Wine.head()

Unnamed: 0,Wine_Style,Review
0,1,"Maturing well, this round red is a lovely exam..."
1,1,"Powerful and structured, with minerally richne..."
2,1,"Effusive aromas of black currant, blueberry, v..."
3,1,This distinctive red throws a lot of wild sage...
4,1,"A lush, ripe style, with açaí berry, blueberry..."


# Explore dataset

## Randomly shuffle instances
Shuffle records in dataset to prevent biases. Avoids circumstance where one classifier might appear in the training dataset but not the testing dataset, or vice versa.

In [11]:
X, y = df_Wine.Review, df_Wine.Wine_Style
print(f'Dataset labels: {df_Wine.Wine_Style}')

Dataset labels: 0       1
1       1
2       1
3       1
4       1
5       4
6       1
7       1
8       1
9       3
10      1
11      1
12      1
13      1
14      1
15      1
16      4
17      4
18      1
19      1
20      4
21      1
22      4
23      1
24      1
25      4
26      1
27      1
28      1
29      1
       ..
3270    1
3271    4
3272    1
3273    1
3274    1
3275    1
3276    4
3277    1
3278    1
3279    4
3280    4
3281    1
3282    1
3283    4
3284    4
3285    1
3286    1
3287    1
3288    4
3289    4
3290    4
3291    1
3292    1
3293    4
3294    4
3295    4
3296    1
3297    1
3298    1
3299    3
Name: Wine_Style, Length: 3300, dtype: int64


In [12]:
# Split the dataset and shuffle the instances
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size = .75, 
                                                    random_state = 42)

print(f'Train labels:\n{y_train}')
print(f'Test labels:\n{y_test}')

Train labels:
650     1
227     1
365     1
3140    1
862     4
276     1
2196    1
1613    4
2216    1
1204    1
2415    4
407     1
2526    4
2322    1
2793    4
3115    1
847     4
2645    4
3225    1
844     1
7       1
1192    4
1477    1
1061    4
415     0
1159    1
990     1
1953    1
1018    1
199     1
       ..
747     1
2300    0
21      1
459     4
1184    1
2324    1
955     1
1215    1
2433    1
2853    4
1515    3
2391    1
769     1
1685    1
130     4
2919    1
3171    1
2135    3
1482    1
330     4
1238    1
466     1
2169    1
1638    1
3092    1
1095    1
1130    1
1294    1
860     1
3174    1
Name: Wine_Style, Length: 2475, dtype: int64
Test labels:
52      1
679     1
1253    4
2130    1
203     1
2451    1
2073    1
1488    1
1665    4
485     1
1511    4
511     1
1703    1
734     1
70      1
1812    1
2213    4
2780    1
2857    1
1233    1
2987    4
1411    1
2004    1
1178    4
1525    1
1590    1
3266    1
864     4
80      4
2899    1
       ..
1146    

## Stratify classes
Ensure even distribution of counts of the the different classifiers.

In [13]:
# Split the dataset and shuffle the instances
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size = .75, 
                                                    random_state = 42,
                                                    stratify = y)

print(f'Train labels:\n{y_train}')
print(f'Test labels:\n{y_test}')

Train labels:
2742    1
1810    1
2685    4
1550    4
1086    1
1711    4
2104    1
1612    3
1072    1
1875    1
971     4
96      1
2863    4
3258    3
2240    4
883     3
2267    4
100     1
2653    4
518     1
2398    4
1608    0
2257    1
415     0
1722    1
2232    1
2052    1
2567    1
3149    4
1534    1
       ..
2404    4
409     1
1172    4
2647    4
2505    1
1437    1
821     1
2842    4
2794    3
2829    4
559     3
2860    1
1473    1
2155    4
2273    1
134     1
2235    4
1264    1
486     1
1252    1
579     4
2393    1
2719    1
2386    1
2942    1
3073    1
2790    1
774     4
726     1
1216    3
Name: Wine_Style, Length: 2475, dtype: int64
Test labels:
1454    1
89      4
3059    0
2448    1
2776    4
3253    1
1223    1
689     4
249     1
977     4
3228    1
993     4
1791    1
414     1
371     1
258     4
1700    1
3026    1
1000    1
3009    1
152     0
2394    3
1056    4
1643    4
1       1
3271    4
1026    4
853     1
3011    4
1841    4
       ..
865     

In [14]:
print(f'Number of train instances by class: {np.bincount(y_train)}')
print(f'Number of test instances by class: {np.bincount(y_test)}')

Number of train instances by class: [  59 1718    7   67  624]
Number of test instances by class: [ 20 573   2  22 208]


## Extract features from text files

In [15]:
X_train.head()

2742    Smooth, ripe and exotic. This fat, opulent win...
1810    A wine of great intensity, depth and complexit...
2685    Fresh as a sea breeze. Intense and vivid, offe...
1550    Bright and tangy. Generous with its creamy gra...
1086    This is mouthfilling, but stylish and focused,...
Name: Review, dtype: object

In [16]:
X_train.shape

(2475,)

In [17]:
# Convert X_train to unicode text so that CountVectorizer will understand it

X_train_U = X_train.values.astype('U')
X_train_U

array(['Smooth, ripe and exotic. This fat, opulent wine displays spectacular blackberry, black cherry, currant and plum aromas and flavors that all come together in a silky, supple package. Delicious now, but should keep growing through 1998. 531 cases made. ',
       'A wine of great intensity, depth and complexity. Beautifully crafted, ripe, rich and creamy, showing off pretty vanilla and mocha-scented oak before a deep core of currant, blackberry, black cherry and minerally flavors fold in. Firm tannins add structure on the finish. Cabernet Sauvignon, Cabernet Franc, Merlot and Petit Verdot. Best from 2004 through 2014. 7,000 cases made. ',
       "Fresh as a sea breeze. Intense and vivid, offering crisp lemon-lime and light herbal notes with a nervy acidity that gives it lift and elegance. It's fresh, clean and maintains good focus through the long finish. Drink now. 10,000 cases made. ",
       ...,
       'Delivers toasted hazelnut and Brazil nut aromatics, followed by a ripe, we

In [18]:
# Tokenize text
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train_U)
X_train_counts.shape

(2475, 4397)

In [19]:
# Convert occurences to frequencies
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2475, 4397)

## Train the SVM Classifier Model

## Pipeline alternative

In [20]:
pipe_model = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss = 'hinge', penalty = 'l2', alpha = 1e-3, 
                          random_state = 42, max_iter = 5, tol = None))
])

In [21]:
pipe_model.fit(X_train.values.astype('U'), y_train)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf',
                 SGDClassifier(alpha=0.001, max_iter=5, random_state=42,
                               tol=None))])

In [22]:
# Predict the response for the test dataset
y_pred = pipe_model.predict(X_test.values.astype('U'))

In [23]:
# Convert y_test from tuple to array for comparison to y_pred
y_test_arr = np.asarray(y_test)

In [24]:
# Calculate accuracy
np.mean(y_pred == y_test_arr)

0.9345454545454546

## Evaluate the Model

In [25]:
class_descriptions = ['Dessert & Fortified', 'Red', 'Rosé | Rosado', 'Sparkling', 'White']

print(metrics.classification_report(y_test_arr, y_pred, target_names = class_descriptions))

                     precision    recall  f1-score   support

Dessert & Fortified       0.00      0.00      0.00        20
                Red       0.95      1.00      0.97       573
      Rosé | Rosado       0.00      0.00      0.00         2
          Sparkling       1.00      0.05      0.09        22
              White       0.89      0.96      0.92       208

           accuracy                           0.93       825
          macro avg       0.57      0.40      0.40       825
       weighted avg       0.91      0.93      0.91       825



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [26]:
metrics.confusion_matrix(y_test_arr, y_pred)

array([[  0,  10,   0,   0,  10],
       [  0, 571,   0,   0,   2],
       [  0,   1,   0,   0,   1],
       [  0,   9,   0,   1,  12],
       [  0,   9,   0,   0, 199]])

## Inference on New Data

In [27]:
examples = [
    # https://www.winemag.com/buying-guide/far-niente-2018-cabernet-sauvignon-napa-valley/
    'This memorable Cabernet Sauvignon has small amounts of Petit Verdot, Cabernet Franc, Merlot and Malbec blended in. Aged 17 months in a majority of new French oak, it unfurls flavors of red fruit, dried herb and clove over a core of youthful tannin and spicy oak. Best from 2028–2038.',
    # https://www.winemag.com/buying-guide/wohlmuth-2018-ried-hochsteinriegl-sauvignon-blanc-sudsteiermark/
    'An initial hint of crushed ivy and citrus leaf peeks through on the nose. The palate then shows green-tinged ripeness, as if a juicy Mirabelle were spritzed with lime. All is bedded on a light-footed yet profound palate. It offers a gorgeous combination of smoothness and freshness. Drink by 2040. ANNE KREBIEHL MW',
    # https://www.winemag.com/buying-guide/g-h-mumm-2013-brut-millesime-champagne/
    'A Pinot Noir-dominated Champagne, this is richly textured and showing attractive signs of maturity. The toastiness is balanced by crispness with a tangy lemon flavor that broadens into ripe apples. Drink this seductive wine now. ROGER VOSS'
]

examples = np.asarray(examples)

In [28]:
ex_pred = pipe_model.predict(examples.astype('U'))

In [29]:
ex_pred

array([1, 4, 4])

## Save the Model

In [30]:
filename = './model/SVMClassifier.pkl'

with open(filename, 'wb') as file:
    pickle.dump(pipe_model, file)