In [None]:
#Quick cell to make jupyter notebook use the full screen width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
#Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib notebook
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE 
output_notebook(resources=INLINE)

In [None]:
from src import workflow
from src.data import Dataset

In [None]:
import umap
import umap.plot
import numpy as np

## First step of the embedding: CountVectorize

We are going to vectorize our data and look at the number of categorical values they have in common.  A useful thing to do here is to require each row to have a minimum support before being included.  Filtering this early, will ensure indices line up later on.


In [None]:
beer_style_dset = Dataset.load("beer_style_reviewers")
beer_style_dset.metadata

In [None]:
beer_style = beer_style_dset.data
beer_style.review_profilename_list


This step turns a sequence of space seperated text into a sparse matrix of counts.  One row per row of our data frame and one column per unique token that appeared in our categorical field of interest.

If we want to deal with sets (i.e. just presence or absence of a category) use:<BR>
`beer_by_authors_vectorizer = CountVectorizer(binary=True)`<BR>
If we think counts should matter we might use:<BR>
`beer_by_authors_vectorizer = CountVectorizer()`<BR>
or if we want to correct for very unbalanced column frequencies:<BR>
`beer_by_authors_vectorizer = TfidfVectorizer()`<BR>
    
We use `min_df=10` in our CountVectorize to only count reviewers who have reviewed at least 10 beers.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

style_by_authors_vectorizer = CountVectorizer(binary=True, min_df=10)
style_by_authors = style_by_authors_vectorizer.fit_transform(beer_style.review_profilename_list)
style_by_authors

This results in an impractically large vector space. We want an embedding into a much smaller space. 

Now we reduce the dimension of this data.

If we are dealing with sets (i.e. just presence or absence of a category) use:<BR>
`metric='jaccard'`<BR>
If we think counts should matter we might use:<BR>
`metric='hellinger'`<BR>
or if we want to correct for very unbalanced column frequencies:<BR>
`metric='hellinger'`<BR>
    
As you get more and more points I'd recommend increasing the `n_neighbors` parameter to compensate.  Thing of this as a resolution parameter.

`n_components` controls the dimension you will be embedding your data into (2-dimensions for easy visualization).  Feel free to embed into higher dimensions for clustering if you'd like.

`unique=True` says that if you have two identical points you want to map them to the exact same co-ordinates in your low space.  This becomes especially important if you have more exact dupes that your `n_neighbors` parameter.  That is the problem case where exact dupes can be pushed into very different regions of your space.

In [None]:
%%time
style_by_authors_model = umap.UMAP(n_neighbors=5, n_components=2, metric='jaccard', min_dist=0.1,
                                  unique=True, random_state=42).fit(style_by_authors.todense())

In [None]:
np.log(beer_style.num_reviewers).describe()

In [None]:
#outfile ='results/beer_style_by_reviewer_jaccard'

In [None]:
umap_plot = umap.plot.points(style_by_authors_model, labels=np.log(beer_style.num_reviewers), theme='fire');
#umap_plot.figure.savefig(outfile+'.png', dpi=300, bbox_inches='tight')

... and now for an interactive plot with mouseover.

In [None]:
hover_df = beer_style['beer_style beer_abv num_reviewers review_overall brewery_name'.split()]
f = umap.plot.interactive(style_by_authors_model, labels=np.log(beer_style.num_reviewers), 
                          hover_data=hover_df, theme='fire', point_size=5);
#save(f,outfile+'.html')
show(f)

### What if we wanted to only group beer by the users that liked them?

Are two beers similar if two reviewer tried them?  Perhaps not, instead lets filter to only the reviewers who enjoyed the beer.

Because this is talking about reviewers and not beer we need to filter our initial data frame and re-run our process.

## XXX Link to the Positive Reviewer Notebook: EmbedAllTheThings_Beer_by_reviewer_positive