Matrix Factorization would be a nice complement to the similarity/dissimilarity analysis of users and games. In this notebook, I will see if it is viable to develop a matrix factorization based recommendation model.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from scipy.sparse import coo_matrix

There are various methods we can measure the similarity. To measure it, obviously we need a function for the similarity.

The first similarity measure that we can think of is also popular in literature, and it is known as Jaccard index. It is a similarity measure of two sets and it is calculated as follows:

Let $S_1$ and $S_2$ be different sets defined over the same field.

$$
Sim(S_1, S_2) = \frac{|S_1 \cap S_2|}{|S_1 \cup S_2|}
$$

To use this for the players and games, the idea is simple. We just treat them as sets.

Let's say we are interested in a game called XYZ. If it is owned by Alice, Bob, Charlie, we say

$$
S_{XYZ} = \{ Alice, Bob, Charlie \}
$$

We will use this on the "purchase" set. However, with the additional information we have, the play times, we can use other similarity functions, such as cosine similarity. For cosine similarity, we will treat the "ratings" of the game as a vector, and cosine similarity will let us know how similar two games is and vice versa. (You can go the other way, and measure how similar users is)

One possible privacy concern about using user-user similarity is this, how will you explain the reason for the recommendation? For example, on Instagram we see some posts on the discovery tab and some of these posts have "we recommend this because you follow xyz". So, in that sense, we can conclude Instagram is treating us as the user and the instagram account of others as "item". This is kind of creepy, but it would be a lot more creepier if it said "you and some random user you don't know have similar tastes".

In [2]:
purchase_data = pd.read_parquet("../dat/steam_purchase.parquet")

In [3]:
purchase_data

Unnamed: 0,userId,game
0,151603712,The Elder Scrolls V Skyrim
1,151603712,Fallout 4
2,151603712,Spore
3,151603712,Fallout New Vegas
4,151603712,Left 4 Dead 2
...,...,...
129506,128470551,Fallen Earth
129507,128470551,Magic Duels
129508,128470551,Titan Souls
129509,128470551,Grand Theft Auto Vice City


In [4]:
user_encoder = OrdinalEncoder(dtype=np.int64)
game_encoder = OrdinalEncoder(dtype=np.int64)

In [5]:
purchase_data["row"] = user_encoder.fit_transform(np.array(purchase_data.userId)[:, np.newaxis])
purchase_data["column"] = game_encoder.fit_transform(np.array(purchase_data.game)[:, np.newaxis])

In [6]:
len(purchase_data)

129511

In [7]:
np.ones(shape=(1, purchase_data.shape[0]))

array([[1., 1., 1., ..., 1., 1., 1.]])

In [8]:
purchase_data.row.max()

12392

In [9]:
purchase_data.column.max()

5154

In [10]:
purchase_data.column.shape

(129511,)

In [11]:
rating_matrix = coo_matrix((np.ones(shape=(len(purchase_data),)), (np.array(purchase_data.row), np.array(purchase_data.column))),
           shape=((purchase_data.row.max() + 1, purchase_data.column.max() + 1)),
           dtype=np.int8).toarray()

In [12]:
rating_matrix

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)

In [13]:
purchase_data

Unnamed: 0,userId,game,row,column
0,151603712,The Elder Scrolls V Skyrim,5494,4364
1,151603712,Fallout 4,5494,1678
2,151603712,Spore,5494,3997
3,151603712,Fallout New Vegas,5494,1679
4,151603712,Left 4 Dead 2,5494,2475
...,...,...,...,...
129506,128470551,Fallen Earth,4447,1662
129507,128470551,Magic Duels,4447,2602
129508,128470551,Titan Souls,4447,4585
129509,128470551,Grand Theft Auto Vice City,4447,1979


In [14]:
purchase_data[purchase_data.column == 25]

Unnamed: 0,userId,game,row,column
2657,9823354,3DMark,171,25
6750,1950243,3DMark,47,25
7700,64787956,3DMark,1796,25
10187,78341587,3DMark,2291,25
17839,24469287,3DMark,473,25
31787,11978743,3DMark,227,25
38261,3449240,3DMark,68,25
40704,37422528,3DMark,893,25
52427,8585433,3DMark,148,25
55871,22301321,3DMark,427,25


In [15]:
purchase_data.groupby("game").count().sort_values("userId", ascending=False)

Unnamed: 0_level_0,userId,row,column
game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dota 2,4841,4841,4841
Team Fortress 2,2323,2323,2323
Unturned,1563,1563,1563
Counter-Strike Global Offensive,1412,1412,1412
Half-Life 2 Lost Coast,981,981,981
...,...,...,...
EverQuest Seeds of Destruction,1,1,1
Everlasting Summer DLC One pioneer's story,1,1,1
Everyday Genius SquareLogic,1,1,1
EvilQuest,1,1,1


In [16]:
purchase_data[purchase_data.game == "Dota 2"]

Unnamed: 0,userId,game,row,column
21,151603712,Dota 2,5494,1336
40,187131847,Dota 2,7429,1336
601,176410694,Dota 2,6814,1336
602,197278511,Dota 2,8018,1336
604,197455089,Dota 2,8032,1336
...,...,...,...,...
129224,295386628,Dota 2,11766,1336
129225,300991661,Dota 2,12054,1336
129455,99096740,Dota 2,3171,1336
129483,176449171,Dota 2,6820,1336


In [17]:
rating_matrix[:, 10].sum()

13

In [18]:
rating_matrix[:, 2595].sum()

12

In [19]:
rating_matrix[:, 1336].sum()

4841

In [20]:
rating_matrix.shape

(12393, 5155)

Let's take one of the games.

In [21]:
game = rating_matrix[4937]

Calculating its similarity with another game:

In [22]:
game2 = rating_matrix[2652]

In [23]:
def jaccard(x1, x2):
    """
    Given two numpy arrays, returns a similarity between 0 and 1. (Float)
    """
    intersection = (x1 & x2).sum()
    union = (x1 | x2).sum()
    similarity = intersection/union
    return similarity

In [24]:
sims = []

for col in range(rating_matrix.shape[1]):
    sims.append(jaccard(rating_matrix[:, 15], rating_matrix[:, col]))
    
sims = np.array(sims)

In [25]:
rating_matrix[:, 15].sum()

1

In [26]:
sims[sims > 0.3]

array([1.        , 1.        , 0.33333333, 0.5       , 0.5       ,
       0.5       , 0.5       , 0.5       , 0.5       , 0.5       ,
       0.5       , 1.        , 0.33333333, 0.33333333, 0.5       ,
       1.        , 0.5       , 0.5       , 0.5       , 0.5       ,
       0.5       , 0.5       , 0.33333333, 0.33333333, 0.33333333,
       0.5       , 1.        , 0.33333333, 1.        , 0.5       ,
       0.33333333, 0.5       , 0.33333333, 0.33333333])

So, here we see that if a game only has one buyer, results might be a bit problematic. Let's drop the games owned by less than 20 people:

In [27]:
purchase_data = pd.read_parquet("../dat/steam_purchase.parquet")

In [28]:
cleaned_set = purchase_data.set_index("game")[purchase_data.groupby("game").count().userId >= 20].reset_index()

  cleaned_set = purchase_data.set_index("game")[purchase_data.groupby("game").count().userId >= 20].reset_index()


In [29]:
cleaned_set["row"] = user_encoder.fit_transform(np.array(cleaned_set.userId)[:, np.newaxis])
cleaned_set["column"] = game_encoder.fit_transform(np.array(cleaned_set.game)[:, np.newaxis])

In [30]:
rating_matrix = coo_matrix((np.ones(shape=(len(cleaned_set),)), (np.array(cleaned_set.row), np.array(cleaned_set.column))),
           shape=((cleaned_set.row.max() + 1, cleaned_set.column.max() + 1)),
           dtype=np.int8).toarray()

In [31]:
rating_matrix

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)

In [32]:
sims = []

for col in range(rating_matrix.shape[1]):
    sims.append(jaccard(rating_matrix[:, 15], rating_matrix[:, col]))
    
sims = np.array(sims)

And for users, you have to iterate through the rows:

In [33]:
sims = []

for row in range(rating_matrix.shape[0]):
    sims.append(jaccard(rating_matrix[15, :], rating_matrix[row, :]))
    
sims = np.array(sims)

Now, how to apply this for each column?

In [36]:
df = pd.DataFrame(rating_matrix)

In [59]:
def jaccard(x1, x2):
    """
    Given two numpy arrays, returns a similarity between 0 and 1. (Float)
    """
    x1 = x1.astype(np.int8)
    x2 = x2.astype(np.int8)
    intersection = (x1 & x2).sum()
    union = (x1 | x2).sum()
    similarity = intersection/union
    return similarity

I don't know how this method is implemented behind the scenes, but experiments show it is better than using for loops. You can pass in a custom pairwise correlation function to Pandas correlation.

In [61]:
similarity_matrix = df.corr(method=jaccard)

In [62]:
similarity_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1187,1188,1189,1190,1191,1192,1193,1194,1195,1196
0,1.000000,0.015748,0.000000,0.015625,0.012195,0.007299,0.024194,0.019553,0.085714,0.053254,...,0.007937,0.014815,0.000000,0.023810,0.032787,0.006452,0.008197,0.007937,0.035354,0.032353
1,0.015748,1.000000,0.027397,0.060000,0.022989,0.089286,0.041667,0.014085,0.005319,0.010000,...,0.041667,0.034483,0.045455,0.083333,0.027523,0.025974,0.121951,0.020408,0.024000,0.014815
2,0.000000,0.027397,1.000000,0.013333,0.037037,0.050000,0.028169,0.040134,0.019231,0.016393,...,0.028169,0.050633,0.057471,0.027397,0.015038,0.051546,0.014706,0.013889,0.041379,0.031250
3,0.015625,0.060000,0.013333,1.000000,0.000000,0.050847,0.062500,0.003472,0.010638,0.009901,...,0.020000,0.033898,0.029412,0.127660,0.000000,0.012658,0.021739,0.040816,0.000000,0.003650
4,0.012195,0.022989,0.037037,0.000000,1.000000,0.042553,0.023529,0.090604,0.000000,0.014706,...,0.225352,0.000000,0.019231,0.011364,0.020548,0.017544,0.012195,0.000000,0.044304,0.023026
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,0.006452,0.025974,0.051546,0.012658,0.017544,0.060241,0.013158,0.032787,0.023697,0.040650,...,0.013158,0.023529,0.021277,0.000000,0.045113,1.000000,0.013889,0.000000,0.068966,0.052448
1193,0.008197,0.121951,0.014706,0.021739,0.012195,0.057692,0.047619,0.007143,0.005495,0.000000,...,0.000000,0.058824,0.000000,0.045455,0.000000,0.013889,1.000000,0.023256,0.000000,0.003745
1194,0.007937,0.020408,0.013889,0.040816,0.000000,0.017241,0.043478,0.003509,0.010811,0.000000,...,0.000000,0.035714,0.015152,0.086957,0.009174,0.000000,0.023256,1.000000,0.008000,0.003690
1195,0.035354,0.024000,0.041379,0.000000,0.044304,0.030075,0.008000,0.083333,0.019231,0.029070,...,0.041322,0.014925,0.006944,0.007874,0.032967,0.068966,0.000000,0.008000,1.000000,0.093750


In [83]:
similarity_matrix.index = game_encoder.inverse_transform(np.array(similarity_matrix.index)[:, np.newaxis])[:,0]

In [87]:
similarity_matrix.columns = similarity_matrix.index

In [88]:
similarity_matrix

Unnamed: 0,7 Days to Die,8BitBoy,8BitMMO,A Virus Named TOM,A.V.A - Alliance of Valiant Arms,ACE - Arena Cyber Evolution,AI War Fleet Command,APB Reloaded,ARK Survival Evolved,Ace of Spades,...,Xam,Yet Another Zombie Defense,You Have to Win the Game,Zeno Clash,Zombie Panic Source,Zombies Monsters Robots,iBomber Defense Pacific,resident evil 4 / biohazard 4,sZone-Online,theHunter
7 Days to Die,1.000000,0.015748,0.000000,0.015625,0.012195,0.007299,0.024194,0.019553,0.085714,0.053254,...,0.007937,0.014815,0.000000,0.023810,0.032787,0.006452,0.008197,0.007937,0.035354,0.032353
8BitBoy,0.015748,1.000000,0.027397,0.060000,0.022989,0.089286,0.041667,0.014085,0.005319,0.010000,...,0.041667,0.034483,0.045455,0.083333,0.027523,0.025974,0.121951,0.020408,0.024000,0.014815
8BitMMO,0.000000,0.027397,1.000000,0.013333,0.037037,0.050000,0.028169,0.040134,0.019231,0.016393,...,0.028169,0.050633,0.057471,0.027397,0.015038,0.051546,0.014706,0.013889,0.041379,0.031250
A Virus Named TOM,0.015625,0.060000,0.013333,1.000000,0.000000,0.050847,0.062500,0.003472,0.010638,0.009901,...,0.020000,0.033898,0.029412,0.127660,0.000000,0.012658,0.021739,0.040816,0.000000,0.003650
A.V.A - Alliance of Valiant Arms,0.012195,0.022989,0.037037,0.000000,1.000000,0.042553,0.023529,0.090604,0.000000,0.014706,...,0.225352,0.000000,0.019231,0.011364,0.020548,0.017544,0.012195,0.000000,0.044304,0.023026
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zombies Monsters Robots,0.006452,0.025974,0.051546,0.012658,0.017544,0.060241,0.013158,0.032787,0.023697,0.040650,...,0.013158,0.023529,0.021277,0.000000,0.045113,1.000000,0.013889,0.000000,0.068966,0.052448
iBomber Defense Pacific,0.008197,0.121951,0.014706,0.021739,0.012195,0.057692,0.047619,0.007143,0.005495,0.000000,...,0.000000,0.058824,0.000000,0.045455,0.000000,0.013889,1.000000,0.023256,0.000000,0.003745
resident evil 4 / biohazard 4,0.007937,0.020408,0.013889,0.040816,0.000000,0.017241,0.043478,0.003509,0.010811,0.000000,...,0.000000,0.035714,0.015152,0.086957,0.009174,0.000000,0.023256,1.000000,0.008000,0.003690
sZone-Online,0.035354,0.024000,0.041379,0.000000,0.044304,0.030075,0.008000,0.083333,0.019231,0.029070,...,0.041322,0.014925,0.006944,0.007874,0.032967,0.068966,0.000000,0.008000,1.000000,0.093750


In [89]:
similarity_matrix.to_parquet("../dat/purchase_similarities.parquet")

Now, using the similarity matrix we can get the most similar games.

In [90]:
similarity_matrix["Mass Effect"].sort_values(ascending=False)[:20]

Mass Effect                                           1.000000
Mass Effect 2                                         0.537313
BioShock                                              0.207547
Borderlands                                           0.182203
Far Cry 2                                             0.181250
Far Cry 2 Fortunes Pack                               0.181250
Borderlands DLC The Secret Armory of General Knoxx    0.175879
Borderlands DLC The Zombie Island of Dr. Ned          0.175000
Batman Arkham Asylum GOTY Edition                     0.172996
Borderlands DLC Mad Moxxi's Underdome Riot            0.172589
Alan Wake's American Nightmare                        0.167742
The Walking Dead                                      0.165775
Alan Wake                                             0.164948
Red Faction Guerrilla Steam Edition                   0.163121
Borderlands DLC Claptraps New Robot Revolution        0.160622
The Elder Scrolls IV Oblivion                         0

In [91]:
similarity_matrix["Dota 2"].sort_values(ascending=False)[:20]

Dota 2                                          1.000000
Counter-Strike Global Offensive                 0.112831
Team Fortress 2                                 0.112250
Unturned                                        0.096012
Warframe                                        0.089029
Left 4 Dead 2                                   0.076780
Garry's Mod                                     0.061333
Robocraft                                       0.057361
Heroes & Generals                               0.055876
War Thunder                                     0.055588
Counter-Strike Source                           0.052451
The Elder Scrolls V Skyrim                      0.050265
Counter-Strike                                  0.044746
Terraria                                        0.043402
PAYDAY 2                                        0.042962
Portal 2                                        0.042594
Dead Island Epidemic                            0.041717
Borderlands 2                  

In [92]:
similarity_matrix["Counter-Strike"].sort_values(ascending=False)[:20]

Counter-Strike                                  1.000000
Counter-Strike Condition Zero                   0.793224
Counter-Strike Condition Zero Deleted Scenes    0.793224
Deathmatch Classic                              0.612150
Ricochet                                        0.612150
Day of Defeat                                   0.606936
Half-Life                                       0.341963
Half-Life Blue Shift                            0.335159
Half-Life Opposing Force                        0.332606
Team Fortress Classic                           0.332606
Counter-Strike Source                           0.273611
Half-Life 2 Deathmatch                          0.194168
Half-Life 2 Lost Coast                          0.188228
Counter-Strike Global Offensive                 0.182482
Half-Life 2                                     0.171630
Day of Defeat Source                            0.148955
Portal                                          0.136113
Left 4 Dead 2                  

In [93]:
similarity_matrix["Grand Theft Auto San Andreas"].sort_values(ascending=False)[:20]

Grand Theft Auto San Andreas                                1.000000
Grand Theft Auto Vice City                                  0.524590
Grand Theft Auto III                                        0.495775
Sid Meier's Civilization IV                                 0.083168
Sid Meier's Civilization IV Colonization                    0.072805
Sid Meier's Civilization IV Warlords                        0.071579
Sid Meier's Civilization IV Beyond the Sword                0.071279
Rocket League                                               0.007937
ORION Prelude                                               0.006369
Hazard Ops                                                  0.005305
South Park The Stick of Truth - Ultimate Fellowship Pack    0.005291
Echo of Soul                                                0.005168
NARUTO SHIPPUDEN Ultimate Ninja STORM 3 Full Burst          0.005115
Killing Floor 2                                             0.005025
Assassin's Creed Revelations      

In [94]:
similarity_matrix["South Park The Stick of Truth"].sort_values(ascending=False)[:20]

South Park The Stick of Truth                               1.000000
South Park The Stick of Truth - Ultimate Fellowship Pack    0.271028
Dishonored                                                  0.176471
The Walking Dead Season Two                                 0.159236
The Wolf Among Us                                           0.158621
The Walking Dead                                            0.150754
Prison Architect                                            0.143590
The Stanley Parable                                         0.142105
BioShock Infinite                                           0.138418
Dark Souls Prepare to Die Edition                           0.135514
Fallout New Vegas Honest Hearts                             0.131868
Fallout New Vegas Dead Money                                0.131868
Dungeon Defenders                                           0.130233
Metro Last Light                                            0.129730
The Binding of Isaac              

In [95]:
similarity_matrix["Deus Ex Human Revolution"].sort_values(ascending=False)[:20]

Deus Ex Human Revolution                              1.000000
Deus Ex Human Revolution - The Missing Link           0.337423
Dishonored                                            0.292308
Mark of the Ninja                                     0.264423
Alan Wake                                             0.252137
Mirror's Edge                                         0.248227
FTL Faster Than Light                                 0.245136
Deus Ex Game of the Year Edition                      0.239796
Dead Space                                            0.230769
Metro 2033                                            0.227390
BioShock Infinite                                     0.224000
The Walking Dead                                      0.223176
LIMBO                                                 0.223140
Batman Arkham Asylum GOTY Edition                     0.219081
Borderlands DLC The Secret Armory of General Knoxx    0.218623
Borderlands                                           0

In [96]:
similarity_matrix["Dishonored"].sort_values(ascending=False)[:20]

Dishonored                           1.000000
BioShock Infinite                    0.295580
Deus Ex Human Revolution             0.292308
BioShock                             0.272436
Fallout New Vegas Honest Hearts      0.258389
Fallout New Vegas Dead Money         0.258389
Batman Arkham City GOTY              0.254072
BioShock 2                           0.243902
The Walking Dead                     0.239496
Borderlands 2                        0.238683
Fallout New Vegas Courier's Stash    0.227891
Dead Space                           0.226766
Fallout New Vegas                    0.225962
Mark of the Ninja                    0.224215
Metro Last Light                     0.222222
FTL Faster Than Light                0.222222
Mirror's Edge                        0.214765
Tomb Raider                          0.214689
Max Payne 3                          0.212500
Alan Wake                            0.212000
Name: Dishonored, dtype: float64

In [97]:
similarity_matrix["The Elder Scrolls V Skyrim"].sort_values(ascending=False)[:20]

The Elder Scrolls V Skyrim                           1.000000
The Elder Scrolls V Skyrim - Dawnguard               0.527197
The Elder Scrolls V Skyrim - Hearthfire              0.511855
The Elder Scrolls V Skyrim - Dragonborn              0.509066
Skyrim High Resolution Texture Pack                  0.327755
Borderlands 2                                        0.270510
Fallout New Vegas                                    0.253270
Portal 2                                             0.244817
BioShock Infinite                                    0.227879
Left 4 Dead 2                                        0.213091
The Witcher 2 Assassins of Kings Enhanced Edition    0.212951
Sid Meier's Civilization V                           0.212373
Terraria                                             0.210579
Fallout New Vegas Dead Money                         0.193506
Fallout New Vegas Honest Hearts                      0.193506
Portal                                               0.192870
Garry's 

In [98]:
similarity_matrix["The Witcher 2 Assassins of Kings Enhanced Edition"].sort_values(ascending=False)[:20]

The Witcher 2 Assassins of Kings Enhanced Edition    1.000000
The Witcher Enhanced Edition                         0.408638
BioShock                                             0.252604
BioShock Infinite                                    0.251131
Batman Arkham City GOTY                              0.250667
Metro 2033                                           0.239651
Batman Arkham Asylum GOTY Edition                    0.236620
BioShock 2                                           0.231844
Borderlands 2                                        0.231598
Saints Row The Third                                 0.228571
Fallout New Vegas                                    0.224742
The Elder Scrolls V Skyrim - Dragonborn              0.224409
Tomb Raider                                          0.223810
Torchlight II                                        0.220812
The Elder Scrolls V Skyrim - Dawnguard               0.218810
Trine 2                                              0.217631
XCOM Ene

In [99]:
similarity_matrix["XCOM Enemy Unknown"].sort_values(ascending=False)[:20]

XCOM Enemy Unknown                                   1.000000
XCOM Enemy Within                                    0.432990
FTL Faster Than Light                                0.290441
BioShock                                             0.251497
BioShock Infinite                                    0.221945
The Witcher 2 Assassins of Kings Enhanced Edition    0.215633
Dishonored                                           0.211221
The Walking Dead                                     0.206107
Batman Arkham City GOTY                              0.197640
Trine 2                                              0.195584
Sid Meier's Civilization V Brave New World           0.195055
Bastion                                              0.192171
Mark of the Ninja                                    0.190283
Magicka                                              0.188525
Prison Architect                                     0.187739
Dungeon Defenders                                    0.187050
Borderla

So, this is it for the Jaccard. It was a simple question of checking the common users between games to say games are similar. Things get a bit more complicated, when we factor in the playtime data. Why? Because in our binary treatment, we were just interested in if the user owns a game (1) or not (0). However, now we are interested in how much did the player liked a game, and putting a 0 would mean user hated it. So, we have to treat these ratings as missing values that we want to fill.