# Building an NMF Model

In [1]:
from sklearn.decomposition import NMF
import pandas as pd
import numpy as np
import feather

Load in `reviews` and `beer-info` feather files, then merge them to produce a review database called `review_db` containing beer names, the user_id of the reviewer, and the score the user gave the beer.

In [2]:
reviews = feather.read_dataframe('../data/reviews.feather')
beer_info = feather.read_dataframe('../data/beer-info.feather')
review_db = reviews.merge(beer_info[['id','name']], left_on='beer_id', right_on='id')[['name','user_id','rating']]
review_db.head()

Unnamed: 0,name,user_id,rating
0,Surf Wax DIPA,Vasen_pakki,3.75
1,Surf Wax DIPA,Dave-Hill,3.5
2,Surf Wax DIPA,jsapas,3.75
3,Surf Wax DIPA,vanatyhi1,3.25
4,Surf Wax DIPA,stennibal,3.75


Convert the "tidy" `review_db` into a sparse matrix of `ratings` by pivoting on name and user_id, with intersections of reviewer/product interactions populated by the score that was given. Fill all missing values with zeroes. 

In [3]:
ratings = review_db.pivot_table(index='name', columns='user_id', values='rating', fill_value=0)
print('ratings is an M x N matrix, where M={0} and N={1}'.format(ratings.shape[0], ratings.shape[1]))
ratings.head()

ratings is an M x N matrix, where M=1220 and N=138329


user_id,--------,--JFG--,-1X,-Alix-,-Beer,-C-,-Chubbs-,-GMS-,-Hammer-,-Jamin,...,zyphus,zysurge,zytle,zzDebra,zzandman,zzavilla,zzzigga,zzzirk,zzzzbeer,zzzzbeerzzzz
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(New) English Bulldog Hazy IPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0,0.0,0.0,0,0.0,0,0,0.0
(SIPAS) Hazy Session IPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0,0.0,0.0,0,0.0,0,0,0.0
01 18 Off-Tempo DIPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0,0.0,0.0,0,0.0,0,0,0.0
04609 Double IPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0,0.0,0.0,0,0.0,0,0,0.0
06 18 Off Tempo DIPA 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0,0.0,0.0,0,0.0,0,0,0.0


Select the number of components, k, and initialize the model. Fit and transform the model with the sparse matrix.

In [4]:
%%time
k = 5
model = NMF(n_components=k)
nmf_features = model.fit_transform(ratings)

Wall time: 1min 19s


The result of fitting and transforming the model will be the `beer_feat` table, containing one row for each beer and one column for each latent variable. 

In [5]:
beer_feat = pd.DataFrame(nmf_features, index=ratings.index)
print('beer_feat is an M x k matrix, where M={0} and k={1}'.format(beer_feat.shape[0], beer_feat.shape[1]))
beer_feat.head()

beer_feat is an M x k matrix, where M=1220 and k=5


Unnamed: 0_level_0,0,1,2,3,4
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
(New) English Bulldog Hazy IPA,0.048447,0.031167,0.010494,0.051159,0.080932
(SIPAS) Hazy Session IPA,0.0,0.008912,0.008071,0.017882,0.002347
01 18 Off-Tempo DIPA,0.005913,0.00922,0.019533,0.0,0.092596
04609 Double IPA,0.003631,0.0,0.009278,0.031429,0.030516
06 18 Off Tempo DIPA 2,0.0,0.0,0.009912,0.0,0.00253


The components will be `user_feat` table, containing one column for each user and one row for each latent variable.

In [6]:
user_feat = pd.DataFrame(model.components_, columns=ratings.columns)
print('beer_feat is a k x N matrix, where k={0} and N={1}'.format(user_feat.shape[0], user_feat.shape[1]))
user_feat.head()

beer_feat is a k x N matrix, where k=5 and N=138329


user_id,--------,--JFG--,-1X,-Alix-,-Beer,-C-,-Chubbs-,-GMS-,-Hammer-,-Jamin,...,zyphus,zysurge,zytle,zzDebra,zzandman,zzavilla,zzzigga,zzzirk,zzzzbeer,zzzzbeerzzzz
0,0.000726,0.000284,0.0,2e-05,0.248966,0.0,0.000515,0.0,0.000181,0.000293,...,0.0,0.0,0.0004,0.000435,0.000108,1.2e-05,0.000857,5.9e-05,0.000183,0.000854
1,0.00048,7.9e-05,0.278645,0.0,1.3e-05,0.000619,0.000123,7.8e-05,0.0,0.000116,...,0.27736,0.0,0.001352,0.0,0.000315,0.000219,0.000752,0.0,0.000141,0.00034
2,0.000114,7.1e-05,0.0,0.0,0.000276,0.00299,0.000389,0.0,0.0,0.000363,...,0.0,0.0,0.003491,0.0,0.000752,0.000768,0.330865,0.0,0.000236,0.000395
3,0.000445,0.000209,0.000994,0.00028,0.000528,0.0,0.000247,0.000292,0.000374,0.002715,...,0.0,7e-06,0.006924,0.002247,0.0,0.0,0.010298,0.0,0.000408,0.000395
4,0.005438,0.006328,0.004648,0.001,0.0,0.045463,0.003123,0.001443,0.002472,0.0037,...,0.0,2.9e-05,0.003144,0.00226,0.008758,0.006336,0.004982,0.001597,0.003682,0.005508


Since `beer_feat` is M x k and `user_feat` is k x N, the dot product of the matrices will be M x N, producing a predictive matrix of `ratings` called `predictions`.

In [7]:
predictions = beer_feat.dot(user_feat).apply(lambda x: 0.25 * np.round(x/0.25))
predictions.head()

user_id,--------,--JFG--,-1X,-Alix-,-Beer,-C-,-Chubbs-,-GMS-,-Hammer-,-Jamin,...,zyphus,zysurge,zytle,zzDebra,zzandman,zzavilla,zzzigga,zzzirk,zzzzbeer,zzzzbeerzzzz
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(New) English Bulldog Hazy IPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
(SIPAS) Hazy Session IPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
01 18 Off-Tempo DIPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
04609 Double IPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
06 18 Off Tempo DIPA 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
predictions.LetThereBeR0ck.describe()

count    1220.0
mean        0.0
std         0.0
min         0.0
25%         0.0
50%         0.0
75%         0.0
max         0.0
Name: LetThereBeR0ck, dtype: float64

In [9]:
ratings.LetThereBeR0ck.describe()

count    1220.000000
mean        0.007172
std         0.177138
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         4.500000
Name: LetThereBeR0ck, dtype: float64