# Dependencies 

In [1]:
import numpy as np
import pandas as pd
import scipy
import scipy.optimize
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error

%matplotlib inline

# Dataset 

I downloaded the [Steam Video Games Dataset](https://www.kaggle.com/tamber/steam-video-games/downloads/steam-video-games.zip/3) on Kaggle, which contains data about users and games from Steam. **Also, the dataset is already clean**. Thus, no further preprocessing is required. The dataset has 200 thousand lines. Each line has 5 columns: 

- `user_id`: identify the user
- `game`: identify the game
- `behavior`: indicates wheter the user purchased the game or played
- `hours_or_behavior`: amount of hours if behavior is play; "1.0" if behavior is purchase.

A full description of the dataset can be seen at [dataset page on Kaggle](https://www.kaggle.com/tamber/steam-video-games/downloads/steam-video-games.zip/3).

In [2]:
df = pd.read_csv('data/steam-200k.csv', sep=',', usecols=range(4))
print(df.shape)

df.head(10)

(200000, 4)


Unnamed: 0,user_id,game,behavior,hours_or_behavior
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0
1,151603712,The Elder Scrolls V Skyrim,play,273.0
2,151603712,Fallout 4,purchase,1.0
3,151603712,Fallout 4,play,87.0
4,151603712,Spore,purchase,1.0
5,151603712,Spore,play,14.9
6,151603712,Fallout New Vegas,purchase,1.0
7,151603712,Fallout New Vegas,play,12.1
8,151603712,Left 4 Dead 2,purchase,1.0
9,151603712,Left 4 Dead 2,play,8.9


Since the goal of this project is to predict a rating given an user and an item (in my case, I'll predict the amount of hours a user may play), we will consider only the lines where `behavior == play`.

In [3]:
df = df[df.behavior == 'play']
print(df.shape)

df.head(10)

(70489, 4)


Unnamed: 0,user_id,game,behavior,hours_or_behavior
1,151603712,The Elder Scrolls V Skyrim,play,273.0
3,151603712,Fallout 4,play,87.0
5,151603712,Spore,play,14.9
7,151603712,Fallout New Vegas,play,12.1
9,151603712,Left 4 Dead 2,play,8.9
11,151603712,HuniePop,play,8.5
13,151603712,Path of Exile,play,8.1
15,151603712,Poly Bridge,play,7.5
17,151603712,Left 4 Dead,play,3.3
19,151603712,Team Fortress 2,play,2.8


Now, let's see how many unique users and games we have:

In [4]:
users = df.user_id.unique()
games = df.game.unique()

n_users = len(users)
n_games = len(games)

print('Unique users:', n_users)
print('Unique games:', n_games)

Unique users: 11350
Unique games: 3600


# Model 

It's time to build the model. We're going to develop a **Latency Factor-Based Recommender** using **gradient descent**. 

So, we'll start by defining the parameters of our model:

In [5]:
alpha = df.hours_or_behavior.mean() # Mean of Playing time
userBiases = defaultdict(float)
gameBiases = defaultdict(float)

y_true = df.hours_or_behavior.values

and the method to predict the amount of hours played by an user in a specific game:

In [6]:
def predict(user, game):
    return alpha + userBiases[user] + gameBiases[game]

The methods below are necessary to optimize our model using *scipy*:

In [7]:
def unpack(theta):
    global alpha
    global userBiases
    global gameBiases
    alpha = theta[0]
    userBiases = dict(zip(users, theta[1:n_users + 1]))
    gameBiases = dict(zip(games, theta[n_users + 1:]))
    return alpha, userBiases, gameBiases

In [8]:
def cost(theta, labels, lamb):
    alpha, userBiases, gameBiases = unpack(theta)
    y_pred = np.array([predict(row.user_id, row.game) for _, row in df.iterrows()])
    cost = mean_squared_error(y_true, y_pred)
    print('MSE =', cost)
    
    cost += sum([lamb * (userBiases[u] ** 2) for u in userBiases])
    cost += sum([lamb * (gameBiases[i] ** 2) for i in gameBiases])
    return cost

In [9]:
def derivative(theta, labels, lamb):
    alpha, userBiases, gameBiases = unpack(theta)

    dalpha = 0
    dUserBiases = defaultdict(float)
    dGameBiases = defaultdict(float)
    
    N = df.shape[0]
    for _, row in df.iterrows():
        u, i = row.user_id, row.game
        
        pred = predict(u, i)
        diff = pred - row.hours_or_behavior
        
        dalpha += (2 / N) * diff
        dUserBiases[u] += (2 / N) * diff
        dGameBiases[i] += (2 / N) * diff
        
    for u in userBiases:
        dUserBiases[u] += 2*lamb*userBiases[u]
        
    for i in gameBiases:
        dGameBiases[i] += 2*lamb*gameBiases[i]
        
    dtheta = [dalpha] + [dUserBiases[u] for u in users] + [dGameBiases[i] for i in games]
    return np.array(dtheta)

Before we optimize our model, let's see how a model which always predicts the mean would perform in terms of **MSE**:

In [10]:
alwaysMean = np.array([alpha] * df.shape[0])

mean_squared_error(alwaysMean, y_true)

52593.90432988401

Now, it's time to optimize our model:

In [11]:
scipy.optimize.fmin_l_bfgs_b(cost, [alpha] + [0.0]*(n_users + n_games), derivative, args=(y_true, 0.001))

MSE = 52593.90432988401
MSE = 52570.829221584085
MSE = 52479.725023696454
MSE = 52134.44799713846
MSE = 55061.684182216275
MSE = 51102.321611387495
MSE = 50016.470698536745
MSE = 49949.69311110058
MSE = 49690.25116865041
MSE = 49354.29032047057
MSE = 48974.94424641083
MSE = 48728.59649079929
MSE = 48546.09727736406
MSE = 48415.117314709394
MSE = 48222.783510080524
MSE = 47943.49784074479
MSE = 47696.67828279402
MSE = 47585.32956433661
MSE = 47440.8876610729
MSE = 47381.185300932964
MSE = 47372.562229249896
MSE = 47346.29757679069
MSE = 47326.24117828224
MSE = 47271.678983623824
MSE = 46652.451326072645
MSE = 47166.127454864036
MSE = 47077.484497642545
MSE = 47103.43028586453
MSE = 47118.647376891946
MSE = 47099.6646700884
MSE = 47087.31993645023
MSE = 47073.88656742347
MSE = 47058.94071183196
MSE = 47046.466454380614
MSE = 47052.259577975565
MSE = 47053.24210076634
MSE = 47056.19028950625
MSE = 47055.280019451275
MSE = 47054.069453605305
MSE = 47052.62871160874
MSE = 47052.33943181173


(array([24.37812654, -6.19801325, -2.7685535 , ..., -0.18402408,
        -0.21197276, -0.2133702 ]),
 48069.18134934699,
 {'grad': array([-8.05700790e-03, -7.26075680e-06,  6.14640380e-06, ...,
         -5.88787711e-07, -5.32704199e-07, -5.29900023e-07]),
  'task': b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH',
  'funcalls': 61,
  'nit': 51,
  'warnflag': 0})

# Analysis 

Let's compute the prediction for each user and game in our dataset. Then, we'll visualize the first 50 rows:

In [12]:
df_pred = df.copy()
df_pred['pred'] = [predict(row.user_id, row.game) for _, row in df.iterrows()]
df_pred.head(50)

Unnamed: 0,user_id,game,behavior,hours_or_behavior,pred
1,151603712,The Elder Scrolls V Skyrim,play,273.0,91.627999
3,151603712,Fallout 4,play,87.0,46.880483
5,151603712,Spore,play,14.9,21.788242
7,151603712,Fallout New Vegas,play,12.1,40.384515
9,151603712,Left 4 Dead 2,play,8.9,33.238064
11,151603712,HuniePop,play,8.5,17.319254
13,151603712,Path of Exile,play,8.1,39.401232
15,151603712,Poly Bridge,play,7.5,15.878643
17,151603712,Left 4 Dead,play,3.3,31.21592
19,151603712,Team Fortress 2,play,2.8,66.967032


Let's compare the mean of hours in the dataset with the mean of predicted hours by our model: 

In [13]:
print('Mean of playing hours in original dataset:', df.hours_or_behavior.mean())
print('Mean of playing hours by our model:', df_pred.pred.mean())

Mean of playing hours in original dataset: 48.8780632439104
Mean of playing hours by our model: 48.87403473995844


We can see our model did not go too much beyond the mean. **To improve it, maybe we could have use a Complete Latent Factor Model**.

Finally, it's time to see a recommendation by our model. We'll choose the `user = 151603712` for example purposes:

In [14]:
df[df.user_id == 151603712]

Unnamed: 0,user_id,game,behavior,hours_or_behavior
1,151603712,The Elder Scrolls V Skyrim,play,273.0
3,151603712,Fallout 4,play,87.0
5,151603712,Spore,play,14.9
7,151603712,Fallout New Vegas,play,12.1
9,151603712,Left 4 Dead 2,play,8.9
11,151603712,HuniePop,play,8.5
13,151603712,Path of Exile,play,8.1
15,151603712,Poly Bridge,play,7.5
17,151603712,Left 4 Dead,play,3.3
19,151603712,Team Fortress 2,play,2.8


Using the `game = Grand Theft Auto IV` as example, our model predicts the following result:

In [23]:
print('Original value:', df[(df.user_id == 151603712) & (df.game == 'Grand Theft Auto IV')].hours_or_behavior.values)
print('Predicted value: ', predict(151603712, 'Grand Theft Auto IV'))

Original value: [0.6]
Predicted value:  15.610165420238825


Let's test our model for the **GTA V**, a game this user has not played before:

In [24]:
predict(151603712, 'Grand Theft Auto V')

65.63684380112471

If we consider, for example, that the user played **Fallout 3** for 0.8 hours and **Fallout 4** for 87 hours, our prediction for GTA IV and V makes some sense.

I hope you enjoyed it!