In the previous experiment, we have just used the purchase information. Things get a bit more complicated when we factor in the playtime data. Why? Because in our binary treatment, we were just interested in if the user owns a game (1) or not (0). However, now we are interested in how much did the player liked a game, and putting a 0 would mean user hated it. So, we have to treat these ratings as missing values that we want to fill.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from scipy.sparse import coo_matrix

It is a good thing that we have converted the implicit ratings (play times) into explicit ratings between 1-10 in exp002 notebook. Let's just load the data.

In [2]:
data = pd.read_parquet("../dat/play_with_ratings.parquet")

In [4]:
data

Unnamed: 0,user_enc,game_enc,userId,game,actionValue,rating
0,2166,3036,151603712,The Elder Scrolls V Skyrim,273.0,10.00
1,2166,1154,151603712,Fallout 4,87.0,9.64
2,2166,2785,151603712,Spore,14.9,8.92
3,2166,1155,151603712,Fallout New Vegas,12.1,8.20
4,2166,1719,151603712,Left 4 Dead 2,8.9,7.48
...,...,...,...,...,...,...
61263,1906,1147,128470551,Fallen Earth,2.4,1.00
61264,1906,1818,128470551,Magic Duels,2.2,2.80
61265,1906,3189,128470551,Titan Souls,1.5,6.40
61266,1906,1365,128470551,Grand Theft Auto Vice City,1.5,4.00


It still have some information about user encoding and game encoding we have previously used, but I will be running the encoders again and overwrite them.

In [5]:
user_encoder = OrdinalEncoder(dtype=np.int64)
game_encoder = OrdinalEncoder(dtype=np.int64)

In [6]:
data["user_enc"] = user_encoder.fit_transform(np.array(data.userId)[:, np.newaxis])
data["game_enc"] = game_encoder.fit_transform(np.array(data.game)[:, np.newaxis])

Now, let's create the ratings matrix.

In [13]:
rating_matrix = coo_matrix((np.array(data.rating), (np.array(data.user_enc), np.array(data.game_enc))),
           shape=((data.user_enc.max() + 1, data.game_enc.max() + 1)),
           dtype=np.float64).toarray()

In [18]:
rating_matrix[rating_matrix == 0] = np.nan

In [26]:
np.isnan(rating_matrix).sum() / (rating_matrix.shape[0] * rating_matrix.shape[1])

0.9950401624762079

As expected, we have a very high number of sparse data. To make things easier, we can drop of some games and users that don't have enough ratings but I don't want to change the ratings again. Though, it would be as easy as just running experiment 2 after changing the threshold.