## Wine2Vec Exploration

By Zack Thoutt

Here is a little data exploration of my new wine review dataset
using word2vec. My theory is that the words a sommelier would use to describe a wine (oaky, tannic, acidic, berry, etc.) can be used to predict the type of wine (Pinot Noir, Cabernet Sav., etc.). Let's see if we can extract some interesting relationships from the data and somewhat validate this theory.

https://www.kaggle.com/zynicide/word2vec

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import multiprocessing
import gensim.models.word2vec as w2v



In [2]:
data = pd.read_csv('winemag-data_first150k.csv')

In [3]:
labels = data['variety']
descriptions = data['description']

In [4]:
print('{}   :   {}'.format(labels.tolist()[0], descriptions.tolist()[0]))
print('{}   :   {}'.format(labels.tolist()[56], descriptions.tolist()[56]))
print('{}   :   {}'.format(labels.tolist()[93], descriptions.tolist()[93]))

Cabernet Sauvignon   :   This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
Sauvignon Blanc   :   Delicious while also young and textured, this wine comes from biodynamically grown grapes. It has a strong sense of minerality as well as intense citrus and green fruits. It's tight at the moment and needs to round out, so drink from 2018.
Chardonnay   :   A smoky scent and earthy, crisp-apple flavors make this medium-bodied wine a change of pace from the average butterball Chardonnay. It has welcome acidity and a nicely smooth texture.


In [5]:
varietal_counts = labels.value_counts()
print(varietal_counts[:5])

Chardonnay                  14482
Pinot Noir                  14291
Cabernet Sauvignon          12800
Red Blend                   10062
Bordeaux-style Red Blend     7347
Name: variety, dtype: int64


In [6]:
corpus_raw = ""
for description in descriptions[:10000]:
    corpus_raw += description

In [7]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [8]:
raw_sentences = tokenizer.tokenize(corpus_raw)

In [9]:
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw) #[^a-zA-Z] means any character that IS NOT a-z OR A-Z
    words = clean.split()
    return words

In [10]:
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [11]:
print(raw_sentences[234])
print(sentence_to_wordlist(raw_sentences[234]))

Tart cherry lingers on the finish.A deeper salmon color with elegantly lacy bubbles and a slight cloudy appearance, this sparkler by Norm Yost offers dessicated watermelon, dried orange blossoms, yeast, citrus rinds and fresher strawberry notes on the nose.
['Tart', 'cherry', 'lingers', 'on', 'the', 'finish', 'A', 'deeper', 'salmon', 'color', 'with', 'elegantly', 'lacy', 'bubbles', 'and', 'a', 'slight', 'cloudy', 'appearance', 'this', 'sparkler', 'by', 'Norm', 'Yost', 'offers', 'dessicated', 'watermelon', 'dried', 'orange', 'blossoms', 'yeast', 'citrus', 'rinds', 'and', 'fresher', 'strawberry', 'notes', 'on', 'the', 'nose']


In [12]:
token_count = sum([len(sentence) for sentence in sentences])
print('The wine corpus contains {0:,} tokens'.format(token_count))

The wine corpus contains 408,741 tokens


In [13]:
num_features = 300
min_word_count = 10
num_workers = multiprocessing.cpu_count()
context_size = 10
downsampling = 1e-3
seed=1993

In [14]:
wine2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [15]:
wine2vec.build_vocab(sentences)

In [16]:
print('Word2Vec vocabulary length:', len(wine2vec.wv.vocab))

Word2Vec vocabulary length: 2612


In [17]:
print(wine2vec.corpus_count)

17323


In [18]:
wine2vec.train(sentences, total_examples=wine2vec.corpus_count, epochs=wine2vec.iter)

  """Entry point for launching an IPython kernel.


(1354045, 2043705)

In [19]:
wine2vec.most_similar('melon')

  """Entry point for launching an IPython kernel.


[('papaya', 0.9007611870765686),
 ('pineapple', 0.8695796132087708),
 ('banana', 0.8555118441581726),
 ('honeydew', 0.8542354106903076),
 ('pit', 0.8285791873931885),
 ('kiwi', 0.8108007907867432),
 ('cantaloupe', 0.8044856786727905),
 ('mango', 0.7983492016792297),
 ('tropical', 0.7967276573181152),
 ('guava', 0.7951662540435791)]

In [20]:
wine2vec.most_similar('acidic')

  """Entry point for launching an IPython kernel.


[('tartness', 0.8265663981437683),
 ('tad', 0.8175104856491089),
 ('flat', 0.8051095008850098),
 ('punchy', 0.8033303022384644),
 ('cloying', 0.7999146580696106),
 ('energy', 0.7998477816581726),
 ('watery', 0.7965128421783447),
 ('lacking', 0.7943735718727112),
 ('decent', 0.7884608507156372),
 ('honest', 0.7818407416343689)]

In [22]:
wine2vec.most_similar('Chardonnay')

  """Entry point for launching an IPython kernel.


[('Gris', 0.8395384550094604),
 ('Roussanne', 0.7962852120399475),
 ('Chenin', 0.7916781902313232),
 ('Blanc', 0.7905148863792419),
 ('Marsanne', 0.7744864225387573),
 ('Grigio', 0.7682349681854248),
 ('Albari', 0.7595458626747131),
 ('Viognier', 0.7554578185081482),
 ('done', 0.7521730661392212),
 ('Muscat', 0.7467808127403259)]