# Collaborative Filtering with Deep Learning (Matrix Factorization)

A popular method of collaborative filtering used that leverages the power of deep learning in Matrix Factorization. The idea is that the matrix representing the users rating for each item (or interaction score) can be factored into two matrix such that the dimension of all three matricies satsify (n x l) * (l x m) = (n x m). The n and m represent the number of users and items respectively, while the l represent the feature space which size we may choose. Since this is all based around linear algebra, its very natually can be formed into a trainable deep learning model. And whatsmore is additional hidden layers can be added. (This makes it go from General Matrix Factorization to Neural Matrix Factorization)

Using the matrix made in the third notebook, a deep learning model will be trained for proof of concept. Since this will not be fully productionized, it will be built with keras, the convient python framework for deep learning that can be backend by tensorflow.

Sources

[Functional API example](https://keras.io/guides/functional_api/)

[Article example](https://towardsdatascience.com/neural-collaborative-filtering-96cef1009401)

[Second simple example with PCA Viz](https://towardsdatascience.com/building-a-book-recommendation-system-using-keras-1fba34180699)

In [8]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print("numpyy version: ",np.__version__)
print("tensorflow version: ", tf.__version__)
print("keras version: ", keras.__version__)

numpyy version:  1.19.1
tensorflow version:  1.14.0
keras version:  2.2.4-tf


The spare matrix from the thrid notebook will be used for the Matrix Factorization(MF)

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse
from scipy.special import expit
from timeit import default_timer

dtype_dict_games = {'buyer-count': 'int64',
 'player-count': 'int64',
 'accumulated-hours-played': 'float64',
 'player-frac-of-buyer': 'float64',
 'avg-hours-played': 'float64',
 'std-hours-played': 'float64'}

dtype_dict_users = {'purchased-game-count': 'int64',
 'played-game-count': 'int32',
 'played-hours-count': 'float64',
 'purhased-gametitles-list': 'object',
 'played-gametitles-list': 'object',
 'percent-library-played': 'float64',
 'played-hours-avg': 'float64',
 'played-hours-std': 'float64',
 'played-hours-max': 'float64',
 'most-played-game': 'object'}

games_stats_df = pd.read_csv('./steam_game_aggregate_data.CSV', index_col= 0,dtype=dtype_dict_games)
users_stats_df = pd.read_csv('./steam_user_aggregate_data.CSV', index_col= 0,dtype=dtype_dict_users)


#names to assign to columns
column_names = ['user-id','game-title','behavior-name','value']

#dtypes to assign
dtypes = {'user-id':int, 'game-title':str, 'behavior-name':'category', 'value':np.float64}

#read in data from csv (please adjust path if you want to run this). Dropped last column because useless
df = pd.read_csv('./steam-200k.csv', 
                usecols=[0,1,2,3],
                names=column_names)

play_mask = df['behavior-name'] == 'play'
play_df = df[play_mask]

full_game_list = games_stats_df.index.to_list()
full_user_list = users_stats_df.index.to_list()

games_played_matrix = pd.DataFrame(columns=full_game_list, index=full_user_list, dtype=np.float64)

for ind, row in play_df.iterrows():
    games_played_matrix.at[int(row['user-id']), row['game-title']] = 1
    
#for ind, row in games_played_matrix.iterrows():
#    row.fillna(0, inplace=True)
    
games_played_matrix_sparse = sparse.csr_matrix(games_played_matrix.values.T)

In [4]:
games_played_matrix_sparse.shape

(5155, 12393)

In [5]:
#Constant parameters for model
num_games, num_users = games_played_matix_spares.shape
latent_dim = 100
fc_dim = 50


NameError: name 'games_played_matix_spares' is not defined

In [6]:
#design input layer
user_input = keras.Input(shape = (1,), name="user")
game_input = keras.Input(shape = (1,), name="game")

#create embeddings for vector representing user and their games played
user_embeddings = layers.Embedding(input_dim = num_users, output_dim = latent_dim, name = 'user_embedding',
                                  embeddings_initializer="uniform", input_length=1)(user_input)
game_embeddings = layers.Embedding(input_dim = num_games, output_dim = latent_dim, name = 'game_embedding',
                                  embeddings_initializer="uniform", input_length=1)(game_input)

#add dropout to both embedding layers
user_dropout = layers.Dropout(rate=.5, name='user_dropout')(user_embeddings)
game_dropout = layers.Dropout(rate=.5, name='game_dropout')(game_embeddings)

#flatten out results into one dimension
user_latent = layers.Flatten(name = "user_flatten")((user_dropout))
game_latent = layers.Flatten(name = "game_flatten")((game_dropout))

#dot product of these latent vectors
predict_vector = layers.Multiply(name = "dot_product")([user_latent, game_latent])

#one hidden fully connected layer after dot product
dense_layer = layers.Dense(fc_dim, name = "fully_connected")(predict_vector)

#dropout after fully connected to help against overfitting
dense_dropout = layers.Dropout(rate=.5, name = "fc_dropout")(dense_layer)

#final dense layer
prediction = layers.Dense(1, activation= 'sigmoid', kernel_initializer= 'lecun_uniform', name = 'predicition')(dense_dropout)

#Build the model
model = keras.Model(inputs = [user_input, game_input], outputs = prediction, name='Matrix Factorization')

#print summary
model.summary()

NameError: name 'keras' is not defined

[]