# Building an NMF Model

In [1]:
from sklearn.decomposition import NMF
import pandas as pd
import numpy as np
import feather

Load in `reviews` and `beer-info` feather files, then merge them to produce a review database called `review_db` containing beer names, the user_id of the reviewer, and the score the user gave the beer.

In [2]:
reviews = feather.read_dataframe('../data/reviews.feather')
beer_info = feather.read_dataframe('../data/beer-info.feather')
review_db = reviews.merge(beer_info[['id','name']], left_on='beer_id', right_on='id')[['name','user_id','rating']]
review_db.head()

Unnamed: 0,name,user_id,rating
0,Surf Wax DIPA,Vasen_pakki,3.75
1,Surf Wax DIPA,Dave-Hill,3.5
2,Surf Wax DIPA,jsapas,3.75
3,Surf Wax DIPA,vanatyhi1,3.25
4,Surf Wax DIPA,stennibal,3.75


Convert the "tidy" `review_db` into a sparse matrix of `ratings` by pivoting on name and user_id, with intersections of reviewer/product interactions populated by the score that was given. Fill all missing values with zeroes. 

In [3]:
ratings = review_db.pivot_table(index='name', columns='user_id', values='rating', fill_value=0)
print('ratings is an M x N matrix, where M={0} and N={1}'.format(ratings.shape[0], ratings.shape[1]))
ratings.head()

ratings is an M x N matrix, where M=708 and N=78862


user_id,--------,--JFG--,-Alix-,-Chubbs-,-Hammer-,-Jamin,-MOTA-,-Piels-,-TheDude-,-Z-inNYC,...,zyankali7,zychr,zygspytz,zymman,zymurgeek,zymurgenius,zysurge,zytle,zzzigga,zzzzbeer
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(New) English Bulldog Hazy IPA,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0,0.0,0,0.0,0
01 18 Off-Tempo DIPA,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0,0.0,0,0.0,0
04609 Double IPA,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0,0.0,0,0.0,0
06 18 Off Tempo DIPA 2,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0,0.0,0,0.0,0
077XX,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0,0.0,0,0.0,0


Select the number of components, k, and initialize the model. Fit and transform the model with the sparse matrix.

In [4]:
%%time
k = 5
model = NMF(n_components=k)
nmf_features = model.fit_transform(ratings)

Wall time: 42.6 s


The result of fitting and transforming the model will be the `beer_feat` table, containing one row for each beer and one column for each latent variable. 

In [5]:
beer_feat = pd.DataFrame(nmf_features, index=ratings.index)
print('beer_feat is an M x k matrix, where M={0} and k={1}'.format(beer_feat.shape[0], beer_feat.shape[1]))
beer_feat.head()

beer_feat is an M x k matrix, where M=708 and k=5


Unnamed: 0_level_0,0,1,2,3,4
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
(New) English Bulldog Hazy IPA,0.067242,0.044476,0.029721,0.002607,0.044691
01 18 Off-Tempo DIPA,0.011612,0.033121,0.0,0.059883,0.025388
04609 Double IPA,9e-05,0.002727,0.020308,0.038381,0.048067
06 18 Off Tempo DIPA 2,5.1e-05,0.000187,0.0,0.000671,0.000602
077XX,0.0,0.020628,0.0,0.0,0.192855


The components will be `user_feat` table, containing one column for each user and one row for each latent variable.

In [6]:
user_feat = pd.DataFrame(model.components_, columns=ratings.columns)
print('beer_feat is a k x N matrix, where k={0} and N={1}'.format(user_feat.shape[0], user_feat.shape[1]))
user_feat.head()

beer_feat is a k x N matrix, where k=5 and N=78862


user_id,--------,--JFG--,-Alix-,-Chubbs-,-Hammer-,-Jamin,-MOTA-,-Piels-,-TheDude-,-Z-inNYC,...,zyankali7,zychr,zygspytz,zymman,zymurgeek,zymurgenius,zysurge,zytle,zzzigga,zzzzbeer
0,0.0,0.0,6e-05,0.000332,0.000219,0.000176,4e-06,0.000654,0.000136,0.000143,...,0.0,0.000191,0.0,0.00013,0.0,0.0,0.0,0.000593,0.0,0.000152
1,0.000741,0.000865,2e-05,0.0,0.0,0.000894,4.5e-05,2e-06,0.000543,0.002476,...,0.0,0.001794,0.001165,4.7e-05,0.001498,0.000698,4.157008e-07,0.000669,0.0,0.0
2,4.9e-05,0.000564,0.0,4.3e-05,0.0,0.000216,0.000136,1.8e-05,7.7e-05,0.001369,...,0.0,0.001033,0.001443,8e-05,1e-05,4.7e-05,0.0,0.009034,0.370389,0.0
3,2e-05,0.0,0.002451,0.0,2.5e-05,0.002428,0.000235,0.000143,7e-05,0.000604,...,0.0,0.0,0.005096,0.00011,0.0,1.9e-05,0.0,0.000509,0.0,0.0
4,0.003694,0.006571,0.000286,0.004598,0.006935,0.00236,0.00137,0.00019,0.000407,0.005697,...,0.048457,0.006876,0.006879,0.000851,0.002201,0.003476,5.802541e-05,0.003352,0.0,0.008542


Since `beer_feat` is M x k and `user_feat` is k x N, the dot product of the matrices will be M x N, producing a predictive matrix of `ratings` called `predictions`.

In [7]:
predictions = beer_feat.dot(user_feat).apply(lambda x: 0.25 * np.round(x/0.25))
predictions.head()

user_id,--------,--JFG--,-Alix-,-Chubbs-,-Hammer-,-Jamin,-MOTA-,-Piels-,-TheDude-,-Z-inNYC,...,zyankali7,zychr,zygspytz,zymman,zymurgeek,zymurgenius,zysurge,zytle,zzzigga,zzzzbeer
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(New) English Bulldog Hazy IPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
01 18 Off-Tempo DIPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
04609 Double IPA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
06 18 Off Tempo DIPA 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
077XX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
predictions.LetThereBeR0ck.describe()

count    708.0
mean       0.0
std        0.0
min        0.0
25%        0.0
50%        0.0
75%        0.0
max        0.0
Name: LetThereBeR0ck, dtype: float64

In [9]:
ratings.LetThereBeR0ck.describe()

count    708.000000
mean       0.006003
std        0.159725
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        4.250000
Name: LetThereBeR0ck, dtype: float64