![](images/steam_bg.jpg)

# Steam Rec System

#### Author: Chris O'Malley

## Overview

The goal of this project was to build a collaborative recommendation system for Steam using user game libraries. A profile was built for each user using the hours they spent playing different kinds of games, and recommendations are given to a user based off users with similar profiles. By basing recommendations on hours played rather than games purchased the aim is to find more engaging games for users to keep them interested in gaming in lieu of other pastimes.

## Business Understanding

Steam already has it's own recommendation system, the Discovery Que. It is based on previously purchased games and incorporates new and popular games. The Steam interactive Recommender was also released in 2020, which uses machine learning to provide recommendations based on games played and users that have played those same games. It has a tag filtering system and can be filtered for popularity and release date. My recommendation system isn't meant to replace those systems as much as to show an alternative process that I believe would be a good addition to the above system. Though purchases are understandably a high priority for any company, I think that being the only focus is short-sighted. I believe keeping users engaged will ultimately keep them gaming, on the platform, and purchasing more games.

## Data Understanding

There were two main datasets used for this project, a game dataset and a user dataset, scraped or requested from various sources. It should be noted that I used https://nik-davis.github.io/posts/2019/steam-data-collection/, a post from Nik Davis, as a basis for much of my scraping methods. While I could mostly copy, paste, and run code for the game dataset with minor adjustments, I had to make major changes to adapt it for my user library requests. I will go into more detail below.
#### Game dataset:
Comprised of app ids, titles, genres, and tags.

Appids: The appids of the 1,000 most owned games were requested from the SteamSpy.com API. Unfortunately there were some changes to the API in recent years and I was not able to request every game title at once, limiting me to the first thousand most owned games. My attempts to run multiple requests per SteamSpy API documentation did not work, so I must acknowledge there is some bias towards games with more owners in this recommendation system. Nik Davis's code was used for this portion, however it was a bit dated in the regard that SteamSpy started limiting the 'all' request for it's API. I could not adapt it to gather the full steam library. However, a decent system can be built as the 1,000 games were the most owned, with the understanding that it won't be perfect.

Titles, genres, tags: The appids acquired from SteamSpy API were used to request additional data from the Steam API, consisting of game titles, genres and tags. This was again copied from Nik Davis's code and worked smoothly with minor adjustment.

The api requests and scraping for the game dataset and user libraries can be found in data/data_harvest. Scrapoing for user ids with Selenium can be in the assorted Python files in the data folder as well.

#### User dataset
Comprised of steamids and game libraries.

Steamids: Firstly the appids of the 1,000 top rated and 1,000 most followed games were scrapped from Steamdb.com using Selenium. Those appids were then used to scrape up to 100 steamids from the most recent reviews of each of those games on Steampowered.com. This was done both to have relevant/active users as well as gather a sampling from different types of games. This ultimately ended with around 60,000 steamids.

Game libraries: Game libraries were requested from the Steam API. I was able to adapt the code from Nik for this purpose with major changes. Steam API has a spam limit that had to be accounted for, as well as needing to create extra helper functions to deal with the errors. Request retries needed to be removed because if a request failed, it was likely because a steamid was no longer valid. The return format also proved to be an issue. However all issues were eventually resolved with adjustments.

The above data was combined into the library_df dataset.

The api requests/scraping for the game dataset and user libraries can be found in data/data_harvest. The python files for scraping user ids with Selenium can be found in the data folder as well.

### Technical Understanding

Of the 57,000 user libraries, around 50,000 were hidden. After dropping duplicates and users with 0 hours played there were 5,500 users left. Though at least 10,000 users would have been preferable, 5,500 was enough to build a relatively accurate model.

The recommendation system created here uses an hours-by-tag personalized rating system. Each game in a user's library has a community-voted list of tags associated with it. A profile was built for each of user based on the total number of hours played by tag. By normalizing these hours, we get a value between 0 and 1 that correlates with the percentage of time spent playing games with said tag. For each game in the user's library we added the above values for each tag associated with it to score it. Every game in the user library ended up with a score between 0 and 1, which was then used in the collaborative recommendation system to find users with similar rating profiles.

From the Python Surprise Library I used SVD, KNNWtihMeans, KNNBasic and KNNBaseline to model my data. They are standard starting points and are not extremely computationally expensive. RMSE was used as the metric to score the model as it easy to interpret(being on the same scale as the ratings) and it is standard use for a collaborative recommendation system. KNNBaseline scored best out of the three with default parameters and cross-validationk, though KNNBasic was close. After running grid searches with SVD, KNNBasic and KNNBaseline the chosen model was KNNBasic. The hyperparameters used were K=10 with similarity options of MSD, min_support=5, and user_based=True. The RMSE of the final model was .037 for a rating scale of 0 to 1.

In [201]:
# Import relevant modules
import pandas as pd
import pickle
import json
import ast
from surprise import Dataset, Reader,accuracy
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV
from surprise.prediction_algorithms import knns, SVD, KNNWithMeans, KNNBasic, KNNBaseline
from surprise.similarities import cosine, msd, pearson
import matplotlib.pyplot as plt
%matplotlib inline

### Game dataset

In [152]:
game_data = pd.read_csv('data/steamspy_data.csv')
game_data.head()

Unnamed: 0,appid,name,developer,publisher,score_rank,positive,negative,userscore,owners,average_forever,average_2weeks,median_forever,median_2weeks,price,initialprice,discount,languages,genre,ccu,tags
0,10,Counter-Strike,Valve,Valve,,185686,4807,0,"10,000,000 .. 20,000,000",9363,426,262,323,199,999,80,"English, French, German, Italian, Spanish - Sp...",Action,11955,"{'Action': 5372, 'FPS': 4796, 'Multiplayer': 3..."
1,20,Team Fortress Classic,Valve,Valve,,5235,874,0,"2,000,000 .. 5,000,000",852,3,27,3,99,499,80,"English, French, German, Italian, Spanish - Sp...",Action,94,"{'Action': 745, 'FPS': 306, 'Multiplayer': 258..."
2,30,Day of Defeat,Valve,Valve,,4885,541,0,"5,000,000 .. 10,000,000",811,0,16,0,99,499,80,"English, French, German, Italian, Spanish - Spain",Action,119,"{'FPS': 785, 'World War II': 246, 'Multiplayer..."
3,40,Deathmatch Classic,Valve,Valve,,1791,403,0,"5,000,000 .. 10,000,000",271,0,12,0,99,499,80,"English, French, German, Italian, Spanish - Sp...",Action,10,"{'Action': 628, 'FPS': 138, 'Classic': 106, 'M..."
4,50,Half-Life: Opposing Force,Gearbox Software,Valve,,12501,638,0,"5,000,000 .. 10,000,000",1919,3,171,5,99,499,80,"English, French, German, Korean",Action,122,"{'FPS': 879, 'Action': 321, 'Classic': 250, 'S..."


In [153]:
game_data.set_index('appid', inplace=True)

Here I create a function to return a list of the top ten tags by user vote.

In [154]:
def get_app_tags(tag_dict):
    # reading in tags as a dictionary rather than a string
    tags = ast.literal_eval(tag_dict)
    tag_list = []
    for tag in tags:
        tag_list.append(tag)
        
    if len(tag_list) > 10:
        return tag_list[:10]
    else:
        return tag_list

I apply that function to the 'tags' column as I don't need the number of votes.

In [155]:
game_data['tags'] = game_data['tags'].apply(get_app_tags)

In [156]:
game_data.head()

Unnamed: 0_level_0,name,developer,publisher,score_rank,positive,negative,userscore,owners,average_forever,average_2weeks,median_forever,median_2weeks,price,initialprice,discount,languages,genre,ccu,tags
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
10,Counter-Strike,Valve,Valve,,185686,4807,0,"10,000,000 .. 20,000,000",9363,426,262,323,199,999,80,"English, French, German, Italian, Spanish - Sp...",Action,11955,"[Action, FPS, Multiplayer, Shooter, Classic, T..."
20,Team Fortress Classic,Valve,Valve,,5235,874,0,"2,000,000 .. 5,000,000",852,3,27,3,99,499,80,"English, French, German, Italian, Spanish - Sp...",Action,94,"[Action, FPS, Multiplayer, Classic, Hero Shoot..."
30,Day of Defeat,Valve,Valve,,4885,541,0,"5,000,000 .. 10,000,000",811,0,16,0,99,499,80,"English, French, German, Italian, Spanish - Spain",Action,119,"[FPS, World War II, Multiplayer, Shooter, Acti..."
40,Deathmatch Classic,Valve,Valve,,1791,403,0,"5,000,000 .. 10,000,000",271,0,12,0,99,499,80,"English, French, German, Italian, Spanish - Sp...",Action,10,"[Action, FPS, Classic, Multiplayer, Shooter, F..."
50,Half-Life: Opposing Force,Gearbox Software,Valve,,12501,638,0,"5,000,000 .. 10,000,000",1919,3,171,5,99,499,80,"English, French, German, Korean",Action,122,"[FPS, Action, Classic, Sci-fi, Singleplayer, S..."


### User dataset

In [157]:
library_df = pd.read_csv('data/library_data.csv')
library_df

Unnamed: 0,steamid,library
0,76561198219067393,"[{'appid': 220, 'name': 'Half-Life 2', 'hours'..."
1,76561198148157441,"[{'appid': 17390, 'name': 'Spore', 'hours': 26..."
2,76561198993539076,hidden
3,76561198247182340,hidden
4,76561198278705159,hidden
...,...,...
56973,76561197990543347,hidden
56974,76561199206760437,hidden
56975,76561198324908021,hidden
56976,76561198253735927,hidden


In [158]:
# Setting index to steamid for future usage.
library_df.set_index('steamid', inplace=True)

Here I drop hidden libraries, bring the total down to around 6,600.

In [159]:
hidden_libraries = library_df[library_df['library'] == 'hidden'].index
library_df = library_df.drop(hidden_libraries)
len(library_df)

6600

In [None]:
# Dropping Duplicates
library_df.drop_duplicates(inplace=True)

In [161]:
len(library_df)

6340

A helper function to get the tags from game_data with an appid.

In [162]:
def get_app_tags(appid):
    return game_data.loc[appid]['tags']

Here I create a function go through every game in a library, retrieve the tags, and add the hours to a dictionary. It will then return a dictionary of the user's total hours by tag.

In [163]:
def tag_hours(games_list):
    # reading as a dictionary instead of a string
    games_list = ast.literal_eval(games_list)
    tag_dict = {}
    for game_dict in games_list:
        # checks if we can pull tags for it and that the hours for the game aren't 0
        if game_dict['appid'] in game_data.index and game_dict['hours'] != 0:            
            tags = get_app_tags(game_dict['appid'])
            for tag in tags:
                if tag in tag_dict.keys():
                    tag_dict[tag] += game_dict['hours']
                else:
                    tag_dict[tag] = game_dict['hours']
    return dict(sorted(tag_dict.items(), key=lambda item: item[1], reverse=True))

Applying the tag_hours function to the user's game library and saving tag_hours as a new column.

In [None]:
library_df['tag_hours'] = library_df['library'].apply(tag_hours)

In [165]:
library_df

Unnamed: 0_level_0,library,tag_hours
steamid,Unnamed: 1_level_1,Unnamed: 2_level_1
76561198148157441,"[{'appid': 17390, 'name': 'Spore', 'hours': 26...","{'Action': 115061, 'Multiplayer': 112293, 'Fre..."
76561198170079242,"[{'appid': 3830, 'name': 'Psychonauts', 'hours...","{'Multiplayer': 281821, 'Action': 213414, 'Fir..."
76561198088650778,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...",{}
76561198886682654,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...","{'Multiplayer': 109690, 'Singleplayer': 95062,..."
76561198311899167,"[{'appid': 10, 'name': 'Counter-Strike', 'hour...","{'Sandbox': 76219, 'Multiplayer': 74026, 'Firs..."
...,...,...
76561198208253879,"[{'appid': 70, 'name': 'Half-Life', 'hours': 3...","{'Multiplayer': 209252, 'Pixel Graphics': 1578..."
76561199104131020,"[{'appid': 230410, 'name': 'Warframe', 'hours'...","{'Multiplayer': 18877, 'Action': 18558, 'Shoot..."
76561198012694491,"[{'appid': 240, 'name': 'Counter-Strike: Sourc...","{'Multiplayer': 375807, 'Free to Play': 310345..."
76561198393589724,"[{'appid': 2600, 'name': 'Vampire: The Masquer...","{'Action': 43418, 'Multiplayer': 35866, 'RPG':..."


Dropping users with no hours played.

In [166]:
no_hours = library_df[library_df['tag_hours'] == {}].index
library_df = library_df.drop(no_hours)
len(library_df)

5738

This function normalizes all the hours in the libary to a value between 0 and 1, effectively giving us the percentage by hours for each tag that we will use to rate the games.

In [167]:
def normalize_hours(tag_dict):
    tag_labels = tag_dict.keys()
    tag_hours = tag_dict.values()
    sum_hours = sum(tag_hours)
    norm_hours = [float(i)/sum_hours for i in tag_hours]
    return dict(zip(tag_labels, norm_hours))

Applying to the tag_hours column.

In [None]:
library_df['tag_hours'] = library_df['tag_hours'].apply(normalize_hours)

In [169]:
library_df

Unnamed: 0_level_0,library,tag_hours
steamid,Unnamed: 1_level_1,Unnamed: 2_level_1
76561198148157441,"[{'appid': 17390, 'name': 'Spore', 'hours': 26...","{'Action': 0.06109671527032911, 'Multiplayer':..."
76561198170079242,"[{'appid': 3830, 'name': 'Psychonauts', 'hours...","{'Multiplayer': 0.06902065567186039, 'Action':..."
76561198886682654,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...","{'Multiplayer': 0.05836965139977544, 'Singlepl..."
76561198311899167,"[{'appid': 10, 'name': 'Counter-Strike', 'hour...","{'Sandbox': 0.06608431018935979, 'Multiplayer'..."
76561199063236653,"[{'appid': 500, 'name': 'Left 4 Dead', 'hours'...","{'Multiplayer': 0.08737094891821529, 'Action':..."
...,...,...
76561198208253879,"[{'appid': 70, 'name': 'Half-Life', 'hours': 3...","{'Multiplayer': 0.08415388451422459, 'Pixel Gr..."
76561199104131020,"[{'appid': 230410, 'name': 'Warframe', 'hours'...","{'Multiplayer': 0.09915953143877712, 'Action':..."
76561198012694491,"[{'appid': 240, 'name': 'Counter-Strike: Sourc...","{'Multiplayer': 0.09924078778500166, 'Free to ..."
76561198393589724,"[{'appid': 2600, 'name': 'Vampire: The Masquer...","{'Action': 0.06086919949530352, 'Multiplayer':..."


Here I noticed some of the duplicates so they were dropped.

In [170]:
library_df = library_df[~library_df.index.duplicated(keep='first')]

In [None]:
library_df

This function applies a rating to every game in a user's library based on it's tag scores. The more significant the tags it has the higher it will score, whereas tags associated with low hours with get a low rating.

In [172]:
def rate_game_library(steamid):
    # here we try to read as a dictionary, and use indexing in case the first try doesn't work
    try:
        library = ast.literal_eval(library_df.loc[steamid]['library'])
    except ValueError:
        library = ast.literal_eval(library_df.loc[steamid]['library'].iloc[0])
    tag_hours = library_df.loc[steamid]['tag_hours']
    scores = {}
    for game in library:
        score = 0
        appid = game['appid']
        if appid in game_data.index:
            game_tags = game_data.loc[appid]['tags']
            # adding to score for each tag
            for tag in game_tags:
                if tag in tag_hours:
                    score += tag_hours[tag]
        if appid in scores:
            scores[appid] += score
        else:
            scores[appid] = score
    return scores

This function takes in the whole dataframe and applies the rate_game_library function to it.

In [173]:
def game_ratings(df):
    index = df.index.values
    library_ratings = []
    for steamid in index:
        rated_library = rate_game_library(steamid)
        library_ratings.append(rated_library)
    return library_ratings

Saving the results of game_ratings to a new column.

In [None]:
library_df['game_ratings'] = library_df.index.to_series().apply(rate_game_library)

In [175]:
library_df

Unnamed: 0_level_0,library,tag_hours,game_ratings
steamid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
76561198148157441,"[{'appid': 17390, 'name': 'Spore', 'hours': 26...","{'Action': 0.06109671527032911, 'Multiplayer':...","{17390: 0.055111349468474874, 17440: 0, 550: 0..."
76561198170079242,"[{'appid': 3830, 'name': 'Psychonauts', 'hours...","{'Multiplayer': 0.06902065567186039, 'Action':...","{3830: 0.14023839496074098, 4000: 0.2444579416..."
76561198886682654,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...","{'Multiplayer': 0.05836965139977544, 'Singlepl...","{4000: 0.21460970716729724, 400: 0.11458948611..."
76561198311899167,"[{'appid': 10, 'name': 'Counter-Strike', 'hour...","{'Sandbox': 0.06608431018935979, 'Multiplayer'...","{10: 0.2035782409655268, 80: 0.234358743150447..."
76561199063236653,"[{'appid': 500, 'name': 'Left 4 Dead', 'hours'...","{'Multiplayer': 0.08737094891821529, 'Action':...","{500: 0.3969835712361972, 3590: 0.114772421222..."
...,...,...,...
76561198054375336,"[{'appid': 220, 'name': 'Half-Life 2', 'hours'...","{'Open World': 0.08426903711625006, 'Adventure...","{220: 0.258722714828182, 320: 0.15850648804130..."
76561199104131020,"[{'appid': 230410, 'name': 'Warframe', 'hours'...","{'Multiplayer': 0.09915953143877712, 'Action':...","{230410: 0.5403950202237747, 238960: 0.1565162..."
76561198012694491,"[{'appid': 240, 'name': 'Counter-Strike: Sourc...","{'Multiplayer': 0.09924078778500166, 'Free to ...","{240: 0.22484247997000123, 300: 0.174308786792..."
76561198393589724,"[{'appid': 2600, 'name': 'Vampire: The Masquer...","{'Action': 0.06086919949530352, 'Multiplayer':...","{2600: 0, 6980: 0, 1700: 0, 22330: 0.368598065..."


This function takes a steamid(user) then creates a new dataframe for a user with every game and it's rating as it's own row from library_df.

In [176]:
def user_rating_df(steamid):
    big_df = pd.DataFrame(columns=['steamid', 'appid', 'rating'])
    rating_dict = library_df.loc[steamid]['game_ratings']
    for app, rating in rating_dict.items():
        big_df = big_df.append({'steamid': str(steamid), 'appid': str(app), 'rating': rating}, ignore_index=True)
    return big_df

Here we create the dataframe that will contain every user-library dataframe.

In [177]:
user_rec_df = pd.DataFrame(columns=['steamid', 'appid', 'rating'])

This for loop will go through every user in library_df, create a user library rating dataframe, and then append it to the user_rec_df dataframe.

In [None]:
# for num, steamid in enumerate(library_df.index):
#     total = len(library_df)
#     print(f"{num} out of {total}")
#     user_df = user_rating_df(steamid)
#     user_rec_df = pd.concat([user_rec_df, user_df], ignore_index = True, axis = 0)

As the above loop takes a long time to run, here I'll load a completed one from the data folder.

In [None]:
with open('data/big_rating_df.pickle', 'rb') as handle:
    big_rating_df = pickle.load(handle)

In [208]:
big_rating_df

Unnamed: 0,steamid,appid,rating
0,76561198148157441,17390,0.055111
1,76561198148157441,17440,0.0
2,76561198148157441,550,0.215663
3,76561198148157441,47870,0.083723
4,76561198148157441,65600,0.0
...,...,...,...
901104,76561198008893422,1546540,0.0
901105,76561198008893422,582660,0.228932
901106,76561198008893422,1551360,0.0
901107,76561198008893422,46500,0.0


# Modeling

Creating the reader and data for working with Surprise.

In [179]:
# instantiate a reader and read in our rating data
reader = Reader(rating_scale=(0, 1))
data = Dataset.load_from_df(big_rating_df[['steamid','appid','rating']], reader)

The train/test for modeling.

In [180]:
# train on 75% of known rates
trainset, testset = train_test_split(data, test_size=.25)

## First Simple Model - SVD

We start with a SVD with default parameters as it is simple to run and easy to understand.

In [181]:
svd = SVD()
svd.fit(trainset)
predictions = svd.test(testset)

In [182]:
# check the accuracy using Root Mean Square Error
accuracy.rmse(predictions)

RMSE: 0.0623


0.06232429769673886

In [183]:
# Run 5-fold cross-validation and then print results
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0614  0.0619  0.0617  0.0617  0.0616  0.0616  0.0002  
MAE (testset)     0.0370  0.0371  0.0370  0.0371  0.0371  0.0370  0.0001  
Fit time          36.98   37.55   37.43   37.47   37.30   37.34   0.20    
Test time         1.77    1.81    1.75    1.22    1.65    1.64    0.22    


{'test_rmse': array([0.06136392, 0.06186382, 0.06170037, 0.06168942, 0.06160522]),
 'test_mae': array([0.03697267, 0.03710449, 0.03695218, 0.03706001, 0.03706741]),
 'fit_time': (36.975181579589844,
  37.54749536514282,
  37.426324129104614,
  37.46649169921875,
  37.30397152900696),
 'test_time': (1.7737417221069336,
  1.810443639755249,
  1.749300241470337,
  1.220996379852295,
  1.6500017642974854)}

While the .06 RMSE is not necessarily bad, I will try some other models to see if I can work from another starting point.

## KNNWithMeans

In [184]:
sim_cos = {'name':'cosine', 'user_based':True}

In [None]:
knnwm = KNNWithMeans(sim_options=sim_cos)
knnwm.fit(trainset)
predictionswm = knnwm.test(testset)

In [186]:
# check the accuracy using Root Mean Square Error
accuracy.rmse(predictionswm)

RMSE: 0.0587


0.0587471492667859

In [187]:
# Run 5-fold cross-validation and then print results
cross_validate(knnwm, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0582  0.0581  0.0581  0.0583  0.0584  0.0582  0.0001  
MAE (testset)     0.0334  0.0335  0.0334  0.0334  0.0335  0.0334  0.0000  
Fit time          42.73   41.88   42.98   42.74   43.05   42.67   0.42    
Test time         57.87   60.98   56.57   56.02   55.50   57.39   1.96    


{'test_rmse': array([0.05816917, 0.05814788, 0.05814837, 0.0582556 , 0.05838775]),
 'test_mae': array([0.03341842, 0.0334976 , 0.03343016, 0.03341302, 0.03345722]),
 'fit_time': (42.73301815986633,
  41.876495599746704,
  42.97618126869202,
  42.737990617752075,
  43.048348903656006),
 'test_time': (57.867210388183594,
  60.982911586761475,
  56.572514057159424,
  56.018495321273804,
  55.50116586685181)}

With an RMSE of around .058 this is an improvement on the SVD model, but I will continue onto the next model in case it can be improved on.

In [None]:
KNNBaseline

## KNNBasic

In [188]:
basic = knns.KNNBasic(sim_options=sim_cos)
basic.fit(trainset)
pred_basic = basic.test(testset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [189]:
accuracy.rmse(pred_basic)

RMSE: 0.0557


0.05565945927847275

In [190]:
# Run 5-fold cross-validation and then print results
cross_validate(basic, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0549  0.0553  0.0553  0.0556  0.0552  0.0553  0.0002  
MAE (testset)     0.0260  0.0262  0.0262  0.0261  0.0260  0.0261  0.0001  
Fit time          42.82   42.36   44.38   42.91   42.67   43.03   0.70    
Test time         56.36   54.70   55.44   54.02   54.60   55.02   0.81    


{'test_rmse': array([0.0548661 , 0.05525115, 0.05533076, 0.05564632, 0.05521624]),
 'test_mae': array([0.02599303, 0.02615748, 0.02619979, 0.02610955, 0.02598134]),
 'fit_time': (42.81795001029968,
  42.364338397979736,
  44.37890362739563,
  42.90917110443115,
  42.66652846336365),
 'test_time': (56.36422348022461,
  54.69707226753235,
  55.443408250808716,
  54.01895880699158,
  54.59734320640564)}

This model performed the best out of the three, averaging around .055 RMSE.

## KNNBaseline

In [None]:
# train KNNBaseline on 75% of known rates
knnbl = KNNBaseline(sim_options=sim_cos)
knnbl.fit(trainset)
predictionsbl = knnbl.test(testset)

In [203]:
# check the accuracy using Root Mean Square Error
accuracy.rmse(predictionsbl)

RMSE: 0.0499


0.04990853482629698

In [204]:
# Run 5-fold cross-validation and then print results
cross_validate(knnbl, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0498  0.0488  0.0496  0.0499  0.0489  0.0494  0.0005  
MAE (testset)     0.0251  0.0248  0.0250  0.0250  0.0248  0.0249  0.0001  
Fit time          43.90   45.83   45.44   44.95   44.59   44.94   0.67    
Test time         59.61   61.32   62.40   61.92   59.84   61.02   1.11    


{'test_rmse': array([0.04976837, 0.04876642, 0.0496154 , 0.04990214, 0.0489024 ]),
 'test_mae': array([0.0250824 , 0.02479574, 0.02501899, 0.02496406, 0.02475209]),
 'fit_time': (43.89653968811035,
  45.83095097541809,
  45.43925452232361,
  44.950703859329224,
  44.58678960800171),
 'test_time': (59.6131854057312,
  61.32313895225525,
  62.400230884552,
  61.91628360748291,
  59.83726191520691)}

KNNBaseline performed the best out of all initial model with a mean .49 RMSE, however it takes a lot of processing power and time. I will run SVD, KNNBasic and KNNBaseline through gridsearch and determine the most efficient model.

## GridSearch

### SVD

In [191]:
## Perform a gridsearch with SVD
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}
g_s_svd = GridSearchCV(SVD,param_grid=params,cv=3)
g_s_svd.fit(data)

In [192]:
svd_results_df = pd.DataFrame.from_dict(g_s_svd.cv_results)
svd_results_df.sort_values('rank_test_rmse', axis=0, ascending=True)

Unnamed: 0,split0_test_rmse,split1_test_rmse,split2_test_rmse,mean_test_rmse,std_test_rmse,rank_test_rmse,split0_test_mae,split1_test_mae,split2_test_mae,mean_test_mae,std_test_mae,rank_test_mae,mean_fit_time,std_fit_time,mean_test_time,std_test_time,params,param_n_factors,param_reg_all
1,0.055569,0.055898,0.056427,0.055965,0.000353,1,0.032815,0.032862,0.032935,0.03287,4.9e-05,1,12.367687,0.065654,2.630999,0.640686,"{'n_factors': 20, 'reg_all': 0.05}",20,0.05
0,0.056004,0.056357,0.05689,0.056417,0.000364,2,0.032794,0.032878,0.032959,0.032877,6.7e-05,2,12.72168,0.298643,2.647029,0.061911,"{'n_factors': 20, 'reg_all': 0.02}",20,0.02
2,0.056452,0.056871,0.057352,0.056892,0.000368,3,0.034489,0.034588,0.034642,0.034573,6.3e-05,4,12.352256,0.142177,2.34682,0.741288,"{'n_factors': 20, 'reg_all': 0.1}",20,0.1
4,0.056565,0.056912,0.057403,0.05696,0.000344,4,0.03365,0.033743,0.033784,0.033726,5.6e-05,3,19.110698,0.236326,2.601053,0.59254,"{'n_factors': 50, 'reg_all': 0.05}",50,0.05
5,0.056845,0.057256,0.057649,0.05725,0.000328,5,0.034762,0.034843,0.034855,0.03482,4.1e-05,5,18.6012,0.041395,2.816255,0.693492,"{'n_factors': 50, 'reg_all': 0.1}",50,0.1
8,0.057404,0.057813,0.05826,0.057826,0.00035,6,0.035134,0.03523,0.035279,0.035214,6e-05,7,29.429306,0.42509,2.35406,0.677032,"{'n_factors': 100, 'reg_all': 0.1}",100,0.1
7,0.058161,0.058608,0.059131,0.058633,0.000396,7,0.034867,0.035034,0.035139,0.035013,0.000112,6,29.298174,0.009767,2.871948,0.67893,"{'n_factors': 100, 'reg_all': 0.05}",100,0.05
3,0.058848,0.059112,0.059645,0.059202,0.000331,8,0.035304,0.035357,0.035446,0.035369,5.9e-05,8,18.611601,0.11015,2.578182,0.603821,"{'n_factors': 50, 'reg_all': 0.02}",50,0.02
6,0.063113,0.063563,0.064012,0.063563,0.000367,9,0.038578,0.038767,0.038855,0.038733,0.000115,9,29.444763,0.204757,2.413737,0.741982,"{'n_factors': 100, 'reg_all': 0.02}",100,0.02


### KNNBasic

In [193]:
## Perform a gridsearch with KNNBasic
params = {'k': [3, 5, 10, 20],
              'sim_options': {'name': ['msd', 'cosine', 'pearson'],
                              'min_support': [1, 5],
                              'user_based': [True]}
              }

g_s_knnb = GridSearchCV(knns.KNNBasic,param_grid=params, measures=['rmse', 'mae'], cv=3)
g_s_knnb.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...


  sim = construction_func[name](*args)


Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...


  sim = construction_func[name](*args)


Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity

In [194]:
knnb_results_df = pd.DataFrame.from_dict(g_s_knnb.cv_results)
knnb_results_df.sort_values('rank_test_rmse', axis=0, ascending=True)

Unnamed: 0,split0_test_rmse,split1_test_rmse,split2_test_rmse,mean_test_rmse,std_test_rmse,rank_test_rmse,split0_test_mae,split1_test_mae,split2_test_mae,mean_test_mae,std_test_mae,rank_test_mae,mean_fit_time,std_fit_time,mean_test_time,std_test_time,params,param_k,param_sim_options
13,0.040218,0.039928,0.039447,0.039864,0.000318,1,0.016662,0.016656,0.016533,0.016617,5.9e-05,1,13.097523,0.057223,62.239576,1.328243,"{'k': 10, 'sim_options': {'name': 'msd', 'min_...",10,"{'name': 'msd', 'min_support': 5, 'user_based'..."
19,0.040899,0.040611,0.040089,0.040533,0.000335,2,0.016981,0.016986,0.016885,0.016951,4.6e-05,2,13.517723,0.076073,69.103243,0.766686,"{'k': 20, 'sim_options': {'name': 'msd', 'min_...",20,"{'name': 'msd', 'min_support': 5, 'user_based'..."
7,0.040973,0.040647,0.040113,0.040578,0.000354,3,0.01715,0.017134,0.016954,0.017079,8.9e-05,3,12.887776,0.116211,55.316473,0.571564,"{'k': 5, 'sim_options': {'name': 'msd', 'min_s...",5,"{'name': 'msd', 'min_support': 5, 'user_based'..."
1,0.042884,0.042483,0.041961,0.042443,0.000378,4,0.018165,0.018107,0.017919,0.018064,0.000105,4,13.616208,0.108903,60.367647,0.258867,"{'k': 3, 'sim_options': {'name': 'msd', 'min_s...",3,"{'name': 'msd', 'min_support': 5, 'user_based'..."
23,0.046136,0.045926,0.045512,0.045858,0.000259,5,0.021183,0.0212,0.021111,0.021165,3.9e-05,6,33.518307,0.421631,69.754551,0.75247,"{'k': 20, 'sim_options': {'name': 'pearson', '...",20,"{'name': 'pearson', 'min_support': 5, 'user_ba..."
21,0.04666,0.046506,0.046167,0.046444,0.000206,6,0.021578,0.021636,0.021589,0.021601,2.5e-05,7,25.664355,0.152207,63.942294,0.712367,"{'k': 20, 'sim_options': {'name': 'cosine', 'm...",20,"{'name': 'cosine', 'min_support': 5, 'user_bas..."
17,0.047065,0.046845,0.046496,0.046802,0.000234,7,0.02179,0.021789,0.021736,0.021772,2.5e-05,8,32.435657,0.152704,62.080152,1.940532,"{'k': 10, 'sim_options': {'name': 'pearson', '...",10,"{'name': 'pearson', 'min_support': 5, 'user_ba..."
15,0.047579,0.047453,0.047194,0.047408,0.00016,8,0.022199,0.022235,0.022237,0.022224,1.8e-05,9,26.201667,0.082977,62.035468,0.155131,"{'k': 10, 'sim_options': {'name': 'cosine', 'm...",10,"{'name': 'cosine', 'min_support': 5, 'user_bas..."
18,0.047587,0.047646,0.04721,0.047481,0.000193,9,0.020842,0.020904,0.020815,0.020854,3.7e-05,5,13.657705,0.218125,68.829284,1.261472,"{'k': 20, 'sim_options': {'name': 'msd', 'min_...",20,"{'name': 'msd', 'min_support': 1, 'user_based'..."
11,0.049904,0.049861,0.049401,0.049722,0.000228,10,0.023337,0.023411,0.023333,0.02336,3.6e-05,11,31.707924,0.156391,55.110501,0.534974,"{'k': 5, 'sim_options': {'name': 'pearson', 'm...",5,"{'name': 'pearson', 'min_support': 5, 'user_ba..."


# Final Model - KNNBasic

Despite running a KNNBaseline gridsearch below, KNNBasic will still be the final model. Having similar mean RMSE's of .037 and .039 respectively, KNNBasic takes less processing and has a faster runtime.

In [195]:
sim_msd = {'name':'msd', 'min_support': 5, 'user_based':True}
basic = knns.KNNBasic(k=10, sim_options=sim_msd)
basic.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x22f06e5f130>

In [196]:
pred_basic = basic.test(testset)
accuracy.rmse(pred_basic)

RMSE: 0.0379


0.0378652009869832

In [197]:
# Run 5-fold cross-validation and then print results
cross_validate(basic, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0369  0.0372  0.0371  0.0368  0.0366  0.0369  0.0002  
MAE (testset)     0.0153  0.0153  0.0153  0.0153  0.0152  0.0153  0.0000  
Fit time          19.54   19.91   19.84   19.90   19.85   19.81   0.14    
Test time         46.21   48.11   49.17   46.36   48.45   47.66   1.17    


{'test_rmse': array([0.036866  , 0.03717723, 0.03710078, 0.03678117, 0.03662263]),
 'test_mae': array([0.0153145 , 0.01526091, 0.01533245, 0.01531441, 0.01522627]),
 'fit_time': (19.535670518875122,
  19.908380270004272,
  19.838443279266357,
  19.903308153152466,
  19.85189437866211),
 'test_time': (46.21174931526184,
  48.10717749595642,
  49.16745567321777,
  46.358625650405884,
  48.45045065879822)}

## KNNBaseline

In [205]:
## Perform a gridsearch with KNNBaseline
params = {'k': [3, 5, 10, 20],
              'sim_options': {'name': ['msd', 'cosine', 'pearson'],
                              'min_support': [1, 5],
                              'user_based': [True]}
              }

g_s_knnbl = GridSearchCV(knns.KNNBaseline,param_grid=params, measures=['rmse', 'mae'], cv=3)
g_s_knnbl.fit(data)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine sim

  sim = construction_func[name](*args)


Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating

In [206]:
knnbl_results_df = pd.DataFrame.from_dict(g_s_knnbl.cv_results)
knnbl_results_df.sort_values('rank_test_rmse', axis=0, ascending=True)

Unnamed: 0,split0_test_rmse,split1_test_rmse,split2_test_rmse,mean_test_rmse,std_test_rmse,rank_test_rmse,split0_test_mae,split1_test_mae,split2_test_mae,mean_test_mae,std_test_mae,rank_test_mae,mean_fit_time,std_fit_time,mean_test_time,std_test_time,params,param_k,param_sim_options
13,0.038181,0.038321,0.038512,0.038338,0.000136,1,0.020089,0.020092,0.020096,0.020092,3e-06,1,15.240307,0.103147,66.675976,0.408292,"{'k': 10, 'sim_options': {'name': 'msd', 'min_...",10,"{'name': 'msd', 'min_support': 5, 'user_based'..."
19,0.038814,0.038972,0.039104,0.038963,0.000118,2,0.020375,0.020386,0.020364,0.020375,9e-06,2,15.182183,0.098108,72.066206,1.752682,"{'k': 20, 'sim_options': {'name': 'msd', 'min_...",20,"{'name': 'msd', 'min_support': 5, 'user_based'..."
7,0.038955,0.03897,0.039307,0.039077,0.000162,3,0.020593,0.020553,0.020643,0.020596,3.7e-05,3,15.153389,0.140495,61.809252,1.192889,"{'k': 5, 'sim_options': {'name': 'msd', 'min_s...",5,"{'name': 'msd', 'min_support': 5, 'user_based'..."
1,0.040819,0.040825,0.041375,0.041006,0.000261,4,0.021623,0.021605,0.021757,0.021662,6.7e-05,4,15.370632,0.0829,62.922294,1.19082,"{'k': 3, 'sim_options': {'name': 'msd', 'min_s...",3,"{'name': 'msd', 'min_support': 5, 'user_based'..."
23,0.041083,0.041158,0.041207,0.041149,5.1e-05,5,0.021978,0.021926,0.021876,0.021927,4.2e-05,7,33.393756,0.103795,70.936835,2.903967,"{'k': 20, 'sim_options': {'name': 'pearson', '...",20,"{'name': 'pearson', 'min_support': 5, 'user_ba..."
21,0.041133,0.041215,0.041324,0.041224,7.8e-05,6,0.021786,0.021718,0.021684,0.021729,4.2e-05,5,27.421967,0.133225,67.987219,0.245597,"{'k': 20, 'sim_options': {'name': 'cosine', 'm...",20,"{'name': 'cosine', 'min_support': 5, 'user_bas..."
15,0.041533,0.041637,0.041705,0.041625,7.1e-05,7,0.02196,0.021902,0.021869,0.02191,3.8e-05,6,28.400524,0.214658,65.075172,0.194819,"{'k': 10, 'sim_options': {'name': 'cosine', 'm...",10,"{'name': 'cosine', 'min_support': 5, 'user_bas..."
17,0.041661,0.041736,0.041695,0.041698,3.1e-05,8,0.022282,0.022252,0.022151,0.022228,5.6e-05,8,34.523313,0.257811,67.45361,1.780019,"{'k': 10, 'sim_options': {'name': 'pearson', '...",10,"{'name': 'pearson', 'min_support': 5, 'user_ba..."
9,0.043783,0.043793,0.043836,0.043804,2.3e-05,9,0.023151,0.023093,0.02303,0.023091,5e-05,10,28.844024,0.175574,63.195681,0.799548,"{'k': 5, 'sim_options': {'name': 'cosine', 'mi...",5,"{'name': 'cosine', 'min_support': 5, 'user_bas..."
11,0.044118,0.044144,0.044127,0.04413,1.1e-05,10,0.023633,0.023585,0.023509,0.023576,5.1e-05,11,34.943372,0.195255,62.147529,1.467407,"{'k': 5, 'sim_options': {'name': 'pearson', 'm...",5,"{'name': 'pearson', 'min_support': 5, 'user_ba..."


## Testing

Below we test the functionality of the recommendation system. It returns a list of games of a requested length ordered from highest to lowest. It also checks to see if those games are already in your Steam Game Library before recommending them. After going through a few users it seems to be reasonably on point.

In [198]:
# return the top n recommendations
def recommended_games(user_ratings,game_title_df,n):
        for idx, rec in enumerate(user_ratings):
            title = game_title_df.loc[int(rec[0])]['name']
            print('Recommendation # ', idx+1, ': ', title, '\n')
            n-= 1
            if n == 0:
                break

In [209]:
def recs_for_user(steamid, num_games):
    list_of_games = []
    owned_games = big_rating_df[big_rating_df['steamid'] == steamid]['appid'].values
    for appid in big_rating_df['appid'].unique():
        if appid not in owned_games:
            list_of_games.append( (appid,basic.predict(steamid,appid)[3]))
    # order the predictions from highest to lowest rated
    ranked_games = sorted(list_of_games, key=lambda x:x[1], reverse=True)
    recommended_games(ranked_games,game_data,num_games)

In [211]:
recs_for_user('76561197963796380', 10)

Recommendation #  1 :  Tom Clancy's Rainbow Six Siege 

Recommendation #  2 :  Paladins 

Recommendation #  3 :  Insurgency 

Recommendation #  4 :  Black Squad 

Recommendation #  5 :  Splitgate: Arena Warfare 

Recommendation #  6 :  PUBG: BATTLEGROUNDS 

Recommendation #  7 :  Dirty Bomb 

Recommendation #  8 :  Warface 

Recommendation #  9 :  Halo Infinite 

Recommendation #  10 :  Day of Defeat: Source 



# Evaluation

As mentioned above the final model is KNNBasic. The hyperparameters used were K=10 with similarity options of MSD, min_support=5, and user_based=True. With an RMSE of .037 in a 0-1 rating system we can be reasonably confident in our predictions. It is easy to interpret our results as it is on the same scale as the rating system.

# Conclusion

All in all it the recommendation system created here isn't bad, though there are some issues that would prevent me from deploying it. Firstly it is currently working off of the 1,000 most owned games on Steam due to some scraping limitation. Though this does cover the majority of user game libraries, it does have a bias towards more popular games. There is also the cold start issue inherit in collaborative filtering, wherein user's without much or any hours played may not get proper recommendations and would be better off using something like a popularity recommendation system. Finally I found out pretty late in the process that there was defined difference between 'explicit' and 'implicit' rating systems. Explicit would be using defined user ratings, and implicit would be implying the ratings from other metrics, like what I did here with the user-tag hours. Unfortunately that means that there are other metrics that I should be using to judge my models here, such as Mean Average Precision at K (MAPK, MAP@K). However at this stage I will have to settle on the results here, and perhaps get some more reliable results at a later stage. I think there is some value to be drawn from the rating system I implemented here and will likely employ it again in the future to similar projects.