# SD201 project

# Recommendations based on hours played : k-nn on players
The aim of the algorithm is to predict a list of games someone may like knowing how much he played to other games.
It means that the feature we want to predict is a list of games, and the features used to do so are, for each game in the database, the amount of hours spent playing this game.

To do so, we will use our own dataset created from scratch. If you are interested by our progression and the justification of the choices we made to create our model, please consult the knn_progress.ipynb notebook in the kaggle_dataset section.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors

Firstly, let's gather the dataset we created.

In [2]:
#Get the data
#read data
data = pd.read_json('SteamGameData.json')
#clean data
data.columns = ['id','game','hours_played']
played_games = data.loc[data['hours_played'] > 0]
played_games.head()

Unnamed: 0,id,game,hours_played
0,76561198006667424,240,2677
1,76561198006667424,4000,57279
2,76561198006667424,4760,19541
3,76561198006667424,4770,1720
4,76561198006667424,10500,35248


As we only got the appid and not the game name, we will use another dataset which will gives us the name of the games. However, as our dataset is huge, we will only use it when making predictions and not on the entire dataset.

In [3]:
steam_games_info = pd.read_csv('steam.csv')

def game_name(appid):
    try:
        return steam_games_info.loc[steam_games_info['appid']==appid]['name'].values[0]
    except:
        return 'Unknown game'

def appid(game_name):
    try:
        return steam_games_info.loc[steam_games_info['name']==game_name]['appid'].values[0]
    except:
        return 0

game_name(240), appid('Counter-Strike: Source')

('Counter-Strike: Source', 240)

Now let's reorder the dataset by creating the dictionnaries of games played for each player

In [4]:
#Get a dict of games and hours played for each id
played_dict = played_games.groupby('id').apply(lambda g : dict(zip(g['game'], g['hours_played'])))

played_dict_3 = played_dict.loc[played_dict.map(len)>=3]
played_dict_100 = played_dict.loc[played_dict.map(len)>=100]
len(played_dict_3),len(played_dict_100)

(8061, 2842)

We also will encode the hours played and compute the games list

In [5]:
#Create vectors of hours played 
hours_encoded = played_dict.apply(pd.Series)
#Replace NaN values by 0 : a game not in the dict has never been played
hours_encoded = hours_encoded.fillna(0)
#We drop the ids because they are not useful anymore
hours_encoded = hours_encoded.reset_index(drop=True)
#Sort by name of games
hours_encoded = hours_encoded.reindex(sorted(hours_encoded.columns),axis=1)

games_list = list(hours_encoded.columns)

We then need to get the model of predictions we created

In [6]:
class SteamPredictionModel():
    
    def __init__(self, k_neighbors = 5):
        self.neigh = NearestNeighbors(n_neighbors=k_neighbors, metric='euclidean')
        self.games_list = []
        self.likeness = None
        self.mean_dict = {}
        self.std_dict = {}
        self.average_played = {}

    def dict_to_likeness(self, dicti):
        d = dicti.copy()
        for game in d.keys():
            if d[game] <= self.average_played[game]:
                d[game]=0
        return d
    
    #We fit the model on a dataset containing ids and dictionnaries of games associated with time played
    def fit(self, data):

        #Firstly we encode the hours played
        hours_encoded = data.apply(pd.Series)
        #Replace NaN values by 0 : a game not in the dict has never been played
        hours_encoded = hours_encoded.fillna(0)
        hours_encoded = hours_encoded.reindex(sorted(hours_encoded.columns),axis=1)
        

        #For each player, we compute the list of game he likes with the time he has played aboved average time played
        non_zero_dict = hours_encoded.replace(0, np.NaN)
        self.average_played = non_zero_dict.mean(axis=0)
        likeness_games = data.map(self.dict_to_likeness)
        #And encode them
        likeness_games_encoded = likeness_games.apply(pd.Series)
        #Replace NaN values by 0 : a game not in the dict has never been played
        likeness_games_encoded = likeness_games_encoded.fillna(0)
        likeness_games_encoded = likeness_games_encoded.reindex(sorted(likeness_games_encoded.columns),axis=1)

        #standardization
        #We standardize each column separately

        def standardize(c):
            m = c.mean()
            if c.std() > 0:
                std = c.std()
            else:
                std = 1e-8
            return (c-m)/std

        hours_encoded = hours_encoded.apply(lambda column : standardize(column),axis=0)

        #we store the mean and std to standardize vectors on which we want to make predictions
        self.mean_dict = hours_encoded.mean(axis=0)
        self.std_dict = hours_encoded.std(axis=0)
        self.std_dict.replace(to_replace=0,value=1e-8,inplace=True)

        #we also standardize the likeness
        likeness_games_encoded = likeness_games_encoded.apply(lambda column : standardize(column),axis=0)
        self.likeness = likeness_games_encoded



        self.neigh.fit(hours_encoded.values)
        
        #Get the list of games
        games_list = list(hours_encoded.columns)
        games_list.sort()
        self.games_list = games_list
        
        
    
    #We predict a certain number of games (maximum) using a dictionnary of games associated with time played
    def predict(self, X_init, recommendations_number_max, predict_names=False):
        #One-hot-encode X
        X = pd.Series(X_init,index=self.games_list).fillna(0)
        
        #Create a vector with all games and a null score
        score = pd.Series({self.games_list[0]:0.0},index=self.games_list).fillna(0)

        #Get the list of games played by X
        already_owned = [self.games_list[index] for index in np.asarray(X).nonzero()[0]]

        
        stand_vector = (X-self.mean_dict)/self.std_dict
        #Get the neighbors of X
        kneighbors = self.neigh.kneighbors([stand_vector])
        kneighbors_distances = kneighbors[0][0]
        kneighbors_indices = kneighbors[1][0]

        for i in range(len(kneighbors_indices)):
            neighbor = kneighbors_indices[i]
            #get the list of liked games
            liked = [self.games_list[index] for index in np.asarray(self.likeness.iloc[neighbor]).nonzero()[0]]
            #Add to each game score (1/d)*l with d the distance between X and the neighbor
            #and l the amount of time played above the average
            for liked_game in liked:
                if liked_game not in already_owned:
                    score[liked_game] = score[liked_game] + 1/kneighbors_distances[i] + self.likeness.iloc[neighbor][liked_game]


        score = score.sort_values(ascending=False)
        return score.iloc[:recommendations_number_max]

Before evaluating, we will create stereotyped players (focused on a single genre of games for instance) in order to manually test our results.  
Note : the game names are not the same as in the kaggle dataset, and the time played is not in hours.

In [7]:
#Create and fit the model
SPM = SteamPredictionModel(100)
SPM.fit(played_dict_3)

In [8]:
rpg_player = {appid('The Witcher® 3: Wild Hunt'):700,appid('The Elder Scrolls V: Skyrim'):800, appid('Far Cry® 4'):500, appid('Fallout 3'):500}
encoded_rpg_player = pd.Series(rpg_player,index=games_list).fillna(0)

In [9]:
predictions = SPM.predict(encoded_rpg_player,10)

In [10]:
for game in predictions.to_dict().keys():
    print(game_name(game))

Skylar & Plux: Adventure On Clover Island
Unknown game
The Elder Scrolls V: Skyrim Special Edition
RUNNING WITH RIFLES
Metro: Last Light Redux
Garbage Day
Fallout 4
Kingdom Come: Deliverance
Far Cry 3
ARMA: Cold War Assault


The results we obtain are coherent : mostly adventure games and rpgs.  
However, we notice that the calculations are way longer (about 1min there) than in the kaggle dataset, which is logical because this one is much more dense.

In [11]:
fps_player = {appid('Call of Duty® 4: Modern Warfare®'):700,appid('Counter-Strike: Global Offensive'):800}
encoded_fps_player = pd.Series(fps_player,index=games_list).fillna(0)

In [12]:
predictions = SPM.predict(encoded_fps_player,10)

In [13]:
for game in predictions.to_dict().keys():
    print(game_name(game))

Collider
Unknown game
Unknown game
Unknown game
Easy™ eSports
Counter-Strike
J.A.W.S
Counter-Strike: Source
Day of Defeat: Source
Defiance


Once again, the results seem coherent with the type of player.

# Knn evaluation

As the dataset contains player that played a lot of games compared to the Kaggle dataset (150 games played on average in this dataset compared to 10 in the Kaggle dataset), we will reduce the evaluation to players that played at least 100 games which still represents 2842 players, and use 20 neighbors (because we have about the same number of players as in the kaggle dataset's case), otherwise the calculations are too long.

In [14]:
def train_validation_test_split(data, validation_size = 0.1, test_size=0.1):
    
    #shuffle
    shuffled = data.sample(frac=1, random_state=42) #set seed for reproducability
    
    #split train and test
    separator1 = len(shuffled) - int(len(shuffled)*(test_size+validation_size))
    separator2 = len(shuffled) - int(len(shuffled)*(test_size))
    X_train = shuffled.iloc[:separator1]
    X_validation = shuffled.iloc[separator1:separator2]
    X_test = shuffled.iloc[separator2:]
    
    return (X_train,X_validation,X_test)

X_train, X_validation, X_test = train_validation_test_split(played_dict_100)

In [15]:
#Create and fit the model
SPM = SteamPredictionModel(20)
SPM.fit(X_train)

In [16]:
non_zero_dict = hours_encoded.replace(0, np.NaN)
average_played = non_zero_dict.mean(axis=0)

In [17]:
from tqdm import tqdm

#Evaluate the model
true_positives = 0
false_positives = 0

for player in tqdm(X_test):
    
    player_games = len(player)
    
    games_to_predict = int(len(player)*5/100)
    if games_to_predict == 0:
        games_to_predict = 1 #we remove at least one game
    
    games_for_prediction = {}
    games_removed = []
    n = 0

    for (name,time) in list(player.items()):
        if n < games_to_predict and time >average_played[name]: #the game is liked
            games_removed.append(name)
            n += 1
        else:
            games_for_prediction[name]=time
    
    prediction = []
    if len(games_removed) > 0:
        prediction = list(SPM.predict(games_for_prediction,games_to_predict).index)
    
    for predicted_game in prediction:
        if predicted_game in games_removed:
            true_positives += 1
        else:
            false_positives += 1

precision = true_positives/(true_positives+false_positives)

100%|██████████| 284/284 [51:21<00:00, 10.85s/it]


In [23]:
precision

0.024196597353497166

In [19]:
precision/(1/len(list(hours_encoded.columns)))

456.0332703213611

Even if the precision is lower than with the Kaggle's dataset, the algorithm seems to perform better as it performs 450x better than random compared to 200x for the Kaggle's dataset (indeed the precision is lower but there are much more games in this dataset !).  
However, we can think that this result may be a bit biased due to the method we used to create the dataset : as we took friends of friends of friends..., it is possible that we gathered the same types of players, and thus the recommendations on others may be easier.  
We're still satisfied with the results, but the calculation times are too high for this dataset to be used in a real model. A perspective to improve could be to try to clean the dataset but it would necessite a very meticulous process to determine which games are negligeable, how to reduce the calculation times, etc...