# Varietal by Description Document Embedding using DocMAP

In this notebook we'll explore wine varietals based on the reviews themselves. To do this, we'll embed the reviews (as documents) and then compare them. 

Documents are an example of **variable width categorical data** where **counts matter**. Documents can be thought of as **bags of words** or **bags of ngrams** which can be translated into a **bag of probabilities** (e.g. IDF).

Either way, a document becomes a multinomial distribution across our vocabulary space, which is the initial vectorization (via `DocVectorize`), resulting in an Document by Vocabulary matrix. The distance metric we then use with UMAP to reduce dimensions to a 2-dimensional space is Hellinger distance.

In [None]:
#Quick cell to make jupyter notebook use the full screen width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
#Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib inline

from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE 
output_notebook(resources=INLINE)

In [None]:
import umap
import umap.plot
from textmap.vectorizers import DocVectorizer
import hdbscan
import numpy as np
import scipy
import random

In [None]:
from src.data import Dataset

## Get Data

In [None]:
ds = Dataset.load('wine_reviews_130k')

In [None]:
ds.data.shape

Only take reviews that have a variety.

In [None]:
df_variety = ds.data.dropna(axis=0, subset=['variety']).copy()

In [None]:
df_variety.head(2)

It doesn't make sense to include varieties that don't appear in enough reviews.

In [None]:
varietal_counts = df_variety.variety.value_counts()
varietal_counts[varietal_counts < 100].plot.hist(bins=20);

In [None]:
scipy.stats.describe(varietal_counts)

Let's say we need a varietal to appear in a minimum number of reviews to be included. Also, we'll reduce the number of reviews to make computation easier overall.

In [None]:
min_reviews = 75

In [None]:
df_variety['common_varietal'] = df_variety.variety.apply(lambda x: varietal_counts[x] > min_reviews)
df_common_variety = df_variety[df_variety.common_varietal].copy()
df_common_variety.reset_index(inplace=True)
df_common_variety.drop(columns=['index'], inplace=True)

In [None]:
df_common_variety.head()

In [None]:
common_varietal_counts = df_common_variety.variety.value_counts()
common_varietal_counts.plot.hist(bins=20);

In [None]:
len(df_common_variety)

In [None]:
len(df_common_variety.variety.value_counts())

This is saved off for future use as a dataset:

In [None]:
ds_common_variety = Dataset.load(f'wine_reviews_130k_varietals_{min_reviews}')

In [None]:
ds_common_variety.data.shape

## Vectorize the document using DocVectorizer

In [None]:
%%time
description_vectorizer = DocVectorizer(tokenizer='tweet', token_contractor=None)
description_matrix = description_vectorizer.fit_transform(df_common_variety.description.astype(str))

## Dimension reduce to embed in a 2D space

In [None]:
%%time
description_model = umap.UMAP(n_neighbors=15, n_components=2, metric='hellinger',
                              unique=True, random_state=42).fit(description_matrix)

In [None]:
umap_plot = umap.plot.points(description_model,
                             labels=df_common_variety.variety, theme='fire', show_legend=False);

Shorten descriptions for display

In [None]:
df_common_variety['short_description'] = [a[:140] for a in df_common_variety.description]

Sample down to 30000 points for interactive display (too many points otherwise)

In [None]:
N = 30000
sample = random.sample(range(len(df_common_variety)), N)
df_filter = np.zeros(len(df_common_variety), dtype=bool)
for x in sample:
    df_filter[x] = True

In [None]:
hover_df = df_common_variety[['short_description', 'points', 'title',
                       'variety', 'winery', 'country']].copy()
f = umap.plot.interactive(description_model, labels=df_common_variety.variety, 
                          hover_data=hover_df, theme='fire', point_size=5, subset_points=df_filter);
#save(f, filename=outfile_html)
show(f)

## Now lets cluster to better understand the embedding

In [None]:
%%time
clusterer = hdbscan.HDBSCAN(min_cluster_size=20)
clusterer.fit_predict(description_model.embedding_)
labels = clusterer.labels_

In [None]:
hover_df['cluster'] = labels
hover_cols = ['short_description', 'points', 'title',
              'variety', 'winery', 'cluster']

In [None]:
hover_df.cluster.value_counts()

In [None]:
umap_plot = umap.plot.points(description_model, labels=hover_df['cluster'],
                             theme='fire', show_legend=False);

In [None]:
f = umap.plot.interactive(description_model, labels=hover_df['cluster'], 
                          hover_data=hover_df, theme='fire', point_size=5,
                          subset_points=df_filter);
show(f)

In [None]:
# largest clusters
hover_df.cluster.value_counts()[:10]

### Let's look at a single large cluster now.

In [None]:
cluster_id = hover_df.cluster.value_counts().index[3]

In [None]:
cluster_filter = (hover_df.cluster==cluster_id)

In [None]:
f = umap.plot.interactive(description_model, labels=hover_df.variety, 
                          hover_data=hover_df, theme='fire', point_size=5,
                          subset_points=cluster_filter);
show(f)

Which varietals contribute more than 5% towards the cluster?

In [None]:
cluster_variety_df = hover_df[cluster_filter].value_counts('variety')

In [None]:
cluster_variety_df[(cluster_variety_df / sum(cluster_filter) * 100) > 5]

Looks like they might be mostly from the same country

In [None]:
hover_df[cluster_filter].value_counts('country')[:10]

### Now a single medium cluster

In [None]:
cluster_id = hover_df.cluster.value_counts().index[int(len(hover_df.cluster.value_counts()) / 4)]
cluster_id

In [None]:
cluster_filter = (hover_df.cluster==cluster_id)

In [None]:
f = umap.plot.interactive(description_model, labels=hover_df.variety, 
                          hover_data=hover_df, theme='fire', point_size=5,
                          subset_points=cluster_filter);
show(f)

Which varietals appear in this cluster?

In [None]:
cluster_variety_df = hover_df[cluster_filter].value_counts('variety')

In [None]:
cluster_variety_df[(cluster_variety_df / sum(cluster_filter) * 100) > 5]

In [None]:
hover_df[cluster_filter].value_counts('country')[:10]

### And a small cluster

In [None]:
cluster_id = hover_df.cluster.value_counts().index[-2]
cluster_id

In [None]:
cluster_filter = (hover_df.cluster==cluster_id)

In [None]:
f = umap.plot.interactive(description_model, labels=hover_df.variety, 
                          hover_data=hover_df, theme='fire', point_size=5,
                          subset_points=cluster_filter);
show(f)

Which varietals appear in this cluster?

In [None]:
cluster_variety_df = hover_df[cluster_filter].value_counts('variety')

In [None]:
cluster_variety_df[(cluster_variety_df / sum(cluster_filter) * 100) > 5]

In [None]:
hover_df[cluster_filter].value_counts('country')[:10]