In this model we will use a [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) metric, which we can use for content-based filtering by comparing the beers directly to each other.

We'll collect the attributes of all the beers into a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) and calculate the similarity between those attributes.

In [1]:
import pandas as pd
import pickle 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

First, we modify our data to be two columns, the beer name and the attributes that describe the beer in string format. 

In [2]:
#Read in data from pickle

df = pd.read_pickle("beer_data.pickle")

df = df.drop_duplicates(subset=['beer_name'])

cols = ['brewery_name', 'beer_style']
df['key_words'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

dfbag = df[['beer_name', 'key_words']].copy()

dfbag["key_words"] = dfbag["key_words"].str.lower()
dfbag["key_words"] = dfbag["key_words"].replace('/', '')

dfbag = dfbag.reset_index(drop=True)

We will create a matrix using the sk-learn's CountVectorizer. This module allows use to use textual data for predictive modeling. For this to happen, the text needs to be parsed to remove certain words, also known as tokenization. Those words then need to be encoded as integers for use as inputs in ML algorithms. This entire process is cqalled feature extraction.

In [3]:
count = CountVectorizer()
count_matrix = count.fit_transform(dfbag['key_words'])
count_matrix.shape

(17578, 4882)

In [4]:
#Generate the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [5]:
# Create a Series for the beers so they are associated to an ordered numerical list
indices = pd.Series(dfbag['beer_name'])
indices[indices == 'Coors']

12607    Coors
Name: beer_name, dtype: object

In [6]:
#Takes in the name of the beer and returns the top n nunber of recommended beers

def beer_recs(string, n, cosine_sim = cosine_sim):
    
    recommended_beers = []
    
    #Get the index of the beer that matches the beer name
    idx = indices[indices == string].index[0]
    
    #Creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    #Get the indices of the n most similar unique beers
    n = n + 1
    top_n_indexes = list(score_series.iloc[1:n].index)
    
    #Populating the list with the names of the n most similar beers
    for i in top_n_indexes:
        recommended_beers.append(dfbag.iloc[i]['beer_name'])
        
    return recommended_beers

In [7]:
beer_recs('Cauldron DIPA', 5)

['Imperial IPA',
 'Hopknocker Imperial IPA',
 'Loopy Lupulin Double IPA',
 'Arbor Brewing Aurora Arborealis',
 '724.0']

In [8]:
beer_recs('Sausa Weizen', 5)

['Sausa Pils',
 'Red Moon',
 'Black Horse Black Beer',
 'Tadarida IPA',
 'Saison 2.1']

In [9]:
beer_recs('Coors', 5)

['Keystone Ice',
 'Coors',
 'Batch 19',
 'Keystone Premium',
 'Olympia Genuine Draft']

In [10]:
df = df.drop_duplicates(subset=['beer_name'])

# Shorten the dataset for testing purposes
df = df.head(9000)
# df = df.head(40000)

cols = ['brewery_name', 'beer_style', 'beer_abv']
df['key_words'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

dfbag = df[['beer_name', 'key_words']].copy()

dfbag["key_words"] = dfbag["key_words"].str.lower()
dfbag["key_words"] = dfbag["key_words"].replace('/', '')

dfbag = dfbag.reset_index(drop=True)

count = CountVectorizer()
count_matrix = count.fit_transform(dfbag['key_words'])

# Generate the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)

# Create a Series for the beers so they are associated to an ordered numerical list
indices = pd.Series(dfbag['beer_name'])

In [None]:
import awswrangler as wr

pickle_out_cos_sim_cosine_sim = open('cosine_sim.pickle', 'wb')
pickle_out_cos_sim_indices = open('indices.pickle', 'wb')
pickle_out_cos_sim_dfbag = open('dfbag.pickle', 'wb')
pickle.dump(cosine_sim, pickle_out_cos_sim_cosine_sim)
pickle.dump(indices, pickle_out_cos_sim_indices)
pickle.dump(dfbag, pickle_out_cos_sim_dfbag)
pickle_out_cos_sim_cosine_sim.close()
pickle_out_cos_sim_indices.close()
pickle_out_cos_sim_dfbag.close()

df = wr.s3.read_csv("s3://geoallen-model-serving/beer-recs/")

wr.s3.upload(local_file='./cosine_sim.pickle', path='s3://geoallen-model-serving/beer-recs/cosine_sim.pickle')

# with open(file='./cosine_sim.pickle', mode='wb') as local_f:
# wr.s3.upload(local_file=local_f, path='s3://geoallen-model-serving/beer-recs/')
