# Context

Build a machine learning model to determine the variety of the wine being reviewed based on the review text. 

## Data

We will be using the wine magazine dataset at https://www.kaggle.com/zynicide/wine-reviews which is provided by Kaggle user zackthoutt.

winemag-data-130k-v2.csv contains 10 columns and 130k rows of wine reviews.

## Acknowledgements

The data was scraped from WineEnthusiast during the week of November 22nd, 2017.

## Import Necessary Libraries

In [0]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer

## Read Data

In [2]:
from google.colab import files
uploaded = files.upload()

In [3]:
import io
df = pd.read_csv(io.BytesIO(uploaded['winemag-data-130k-v2.csv']))
# Dataset is now stored in a Pandas Dataframe

KeyError: ignored

In [0]:
# df = pd.read_csv('../DATA/wine-reviews/winemag-data-130k-v2.csv')

## Explore Data

In [0]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [0]:
df.shape

(129971, 14)

## Find Top 10 Variety

Find Top 10 Variety and build a dictionary.

    {
        'Pinot Noir': 0,
        'Chardonnay': 1,
        'Cabernet Sauvignon': 2,
        'Red Blend': 3,
        'Bordeaux-style Red Blend': 4,
        'Riesling': 5,
        'Sauvignon Blanc': 6,
        'Syrah': 7
        'Rosé': 8,
        'Merlot': 9
    }

In [0]:
df['variety'].value_counts()

Pinot Noir                    13272
Chardonnay                    11753
Cabernet Sauvignon             9472
Red Blend                      8946
Bordeaux-style Red Blend       6915
Riesling                       5189
Sauvignon Blanc                4967
Syrah                          4142
Rosé                           3564
Merlot                         3102
Nebbiolo                       2804
Zinfandel                      2714
Sangiovese                     2707
Malbec                         2652
Portuguese Red                 2466
White Blend                    2360
Sparkling Blend                2153
Tempranillo                    1810
Rhône-style Red Blend          1471
Pinot Gris                     1455
Champagne Blend                1396
Cabernet Franc                 1353
Grüner Veltliner               1345
Portuguese White               1159
Bordeaux-style White Blend     1066
Pinot Grigio                   1052
Gamay                          1025
Gewürztraminer              

In [0]:
counter = Counter(df['variety'].tolist())
top_10_varieties = {i[0]: idx for idx, i in enumerate(counter.most_common(10))}
top_10_varieties

{'Bordeaux-style Red Blend': 4,
 'Cabernet Sauvignon': 2,
 'Chardonnay': 1,
 'Merlot': 9,
 'Pinot Noir': 0,
 'Red Blend': 3,
 'Riesling': 5,
 'Rosé': 8,
 'Sauvignon Blanc': 6,
 'Syrah': 7}

## Create a dataframe with only top 10 variety

There are 130K rows 707 variety wine. Please select rows which contain only top 10 variety.

In [0]:
df = df[df['variety'].map(lambda x: x in top_10_varieties)]
df

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
10,10,US,"Soft, supple plum envelopes an oaky structure ...",Mountain Cuvée,87,19.0,California,Napa Valley,Napa,Virginie Boone,@vboone,Kirkland Signature 2011 Mountain Cuvée Caberne...,Cabernet Sauvignon,Kirkland Signature
12,12,US,"Slightly reduced, this wine offers a chalky, t...",,87,34.0,California,Alexander Valley,Sonoma,Virginie Boone,@vboone,Louis M. Martini 2012 Cabernet Sauvignon (Alex...,Cabernet Sauvignon,Louis M. Martini
14,14,US,Building on 150 years and six generations of w...,,87,12.0,California,Central Coast,Central Coast,Matt Kettmann,@mattkettmann,Mirassou 2012 Chardonnay (Central Coast),Chardonnay,Mirassou
15,15,Germany,Zesty orange peels and apple notes abound in t...,Devon,87,24.0,Mosel,,,Anna Lee C. Iijima,,Richard Böcking 2013 Devon Riesling (Mosel),Riesling,Richard Böcking
20,20,US,Ripe aromas of dark berries mingle with ample ...,Vin de Maison,87,23.0,Virginia,Virginia,,Alexander Peartree,,Quiévremont 2012 Vin de Maison Red (Virginia),Red Blend,Quiévremont
21,21,US,"A sleek mix of tart berry, stem and herb, alon...",,87,20.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Acrobat 2013 Pinot Noir (Oregon),Pinot Noir,Acrobat
23,23,US,This wine from the Geneseo district offers aro...,Signature Selection,87,22.0,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,Bianchi 2011 Signature Selection Merlot (Paso ...,Merlot,Bianchi
25,25,US,Oak and earth intermingle around robust aromas...,King Ridge Vineyard,87,69.0,California,Sonoma Coast,Sonoma,Virginie Boone,@vboone,Castello di Amorosa 2011 King Ridge Vineyard P...,Pinot Noir,Castello di Amorosa


In [0]:
df.shape

(71322, 14)

## Create a list of description

Assign description of all rows in a list.

In [0]:
description_list = df['description'].tolist()
description_list

['Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish.',
 "Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew.",
 'Soft, supple plum envelopes an oaky structure in this Cabernet, supported by 15% Merlot. Coffee and chocolate complete the picture, finishing strong at the end, resulting in a value-priced wine of attractive flavor and immediate accessibility.',
 'Slightly reduced, this wine offers a chalky, tannic backbone to an otherwise juicy explosion of rich black cherry, the whole accented throughout by firm oak and cigar box.',
 'Building on 150 years and six generations of winemaking tradition, the winery trends toward a leaner style, with the

## Create target variable which is `variety`

Create an array which will contain variety number as target variable. Use dictionary we have created earlier.

In [0]:
varietal_list = [top_10_varieties[i] for i in df['variety'].tolist()]
varietal_list = np.array(varietal_list)
varietal_list

array([5, 0, 2, ..., 2, 5, 0])

## Count Vectorizer

Create a count vectorizer for list of description.

In [0]:
count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(description_list)

## Tfidf Transformer

Transform CountVectorizer to TfidfTransformer

In [0]:
tfidf_transformer = TfidfTransformer()
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)

## Split Tfidf to train and test

In [0]:
train_x, test_x, train_y, test_y = train_test_split(x_train_tfidf, varietal_list, test_size=0.3)

## Naive Bayes

Build a naive bayes model and find score.

In [0]:
clf = MultinomialNB().fit(train_x, train_y)
clf_score = clf.score(test_x, test_y)
print("Accuracy: %.2f%%" % ((clf_score*100)))

Accuracy: 63.32%


## SVC

Build a SVC Model and find score.

In [0]:
# from sklearn.svm import SVC

In [0]:
# clf = SVC(kernel='linear').fit(train_x, train_y)
# clf_score = clf.score(test_x, test_y)
# print("Accuracy: %.2f%%" % ((clf_score*100)))

## Keras with Tenserflow

In [0]:
from keras.models import Sequential
from keras.layers import Dense, Conv1D, Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.utils import to_categorical

import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split

from google.colab import files
import io

Using TensorFlow backend.


## Read Data

In [0]:
from google.colab import files
uploaded = files.upload()

In [0]:
df = pd.read_csv(io.BytesIO(uploaded['winemag-data-130k-v2.csv']))
# Dataset is now stored in a Pandas Dataframe

NameError: ignored

In [0]:
# df = pd.read_csv('../DATA/wine-reviews/winemag-data-130k-v2.csv')

## Create dictionary of Top 10 variety

In [0]:
counter = Counter(df['variety'].tolist())
top_10_varieties = {i[0]: idx for idx, i in enumerate(counter.most_common(10))}

## Filter Dataset with Top 10 Variety

In [0]:
df = df[df['variety'].map(lambda x: x in top_10_varieties)]

In [0]:
# import nltk
# nltk.download('punkt')

## Data Processing
First, we will have to restructure the data in a way that can be easily processed and understood by our neural network. We can do this by replacing the words with uniquely identifying numbers. Combined with an embedding vector, we are able to represent the words in a manner that is both flexible and semantically sensitive.

In practice, we will want to be a little smarter about this preprocessing. It would make sense to focus on the commonly used words, and to also filter out the most commonly used words (e.g., the, this, a).manner

In [0]:
from nltk import word_tokenize
from collections import defaultdict

def count_top_x_words(corpus, top_x, skip_top_n):
    count = defaultdict(lambda: 0)
    for c in corpus:
        for w in word_tokenize(c):
            count[w] += 1
    count_tuples = sorted([(w, c) for w, c in count.items()], key=lambda x: x[1], reverse=True)
    return [i[0] for i in count_tuples[skip_top_n: skip_top_n + top_x]]


def replace_top_x_words_with_vectors(corpus, top_x):
    topx_dict = {top_x[i]: i for i in range(len(top_x))}

    return [
        [topx_dict[w] for w in word_tokenize(s) if w in topx_dict]
        for s in corpus
    ], topx_dict


def filter_to_top_x(corpus, n_top, skip_n_top=0):
    top_x = count_top_x_words(corpus, n_top, skip_n_top)
    return replace_top_x_words_with_vectors(corpus, top_x)

In [0]:
description_list = df['description'].tolist()
mapped_list, word_list = filter_to_top_x(description_list, 2500, 10)

## Pad Sequence 

In [0]:
max_review_length = 150
mapped_list = sequence.pad_sequences(mapped_list, maxlen=max_review_length)

NameError: name 'sequence' is not defined

## Split Data to Test and Train

In [0]:
train_x, test_x, train_y, test_y = train_test_split(mapped_list, varietal_list, test_size=0.3)

## Build Sequential Model

In [0]:
model = Sequential()

model.add(Embedding(2500, embedding_vector_length, input_length=max_review_length))
model.add(Conv1D(50, 5))
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dense(max(varietal_list_o) + 1, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_x, train_y, epochs=3, batch_size=64)

NameError: name 'Sequential' is not defined

## Find Score

In [0]:
y_score = model.predict(test_x)
y_score = [[1 if i == max(sc) else 0 for i in sc] for sc in y_score]
n_right = 0
for i in range(len(y_score)):
    if all(y_score[i][j] == test_y[i][j] for j in range(len(y_score[i]))):
        n_right += 1

print("Accuracy: %.2f%%" % ((n_right/float(len(test_y)) * 100)))

NameError: name 'model' is not defined