In [None]:
#Quick cell to make jupyter notebook use the full screen width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
#Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib notebook
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE 
output_notebook(resources=INLINE)

In [None]:
from src import workflow
from src.data import Dataset

## First step of the embedding: CountVectorize

We are going to vectorize our data and look at the number of categorical values they have in common.  A useful thing to do here is to require each row to have a minimum support before being included.  Filtering this early, will ensure indices line up later on.


In [None]:
beer_style_dset = Dataset.load("beer_style_reviewers")
beer_style_dset.metadata

In [None]:
beer_style = beer_style_dset.data
beer_style.review_profilename_list


This step turns a sequence of space seperated text into a sparse matrix of counts.  One row per row of our data frame and one column per unique token that appeared in our categorical field of interest.

If we want to deal with sets (i.e. just presence or absence of a category) use:<BR>
`beer_by_authors_vectorizer = CountVectorizer(binary=True)`<BR>
If we think counts should matter we might use:<BR>
`beer_by_authors_vectorizer = CountVectorizer()`<BR>
or if we want to correct for very unbalanced column frequencies:<BR>
`beer_by_authors_vectorizer = TfidfVectorizer()`<BR>
    
We use `min_df=10` in our CountVectorize to only count reviewers who have reviewed at least 10 beers.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

style_by_authors_vectorizer = CountVectorizer(binary=True, min_df=10)
style_by_authors = style_by_authors_vectorizer.fit_transform(beer_style.review_profilename_list)
style_by_authors