# Steam Rec System

#### Author: Chris O'Malley

## Overview

The goal of this project was to build a collaborative recommendation system for Steam using user game libraries. A profile was built for each user using the hours they spent playing different kinds of games, and recommendations are given to a user based off users with similar profiles. By basing recommendations on hours played rather than games purchased the aim is to find more engaging games for users to keep them playing rather than become interested in gaming in lieu of other pastimes.

## Business Understanding

Steam already has it's own recommendation system, the Discovery Que. It is based on previously purchased games and incorporates new and popular games. The Steam interactive Recommender was also released in 2020, which uses machine learning to provide recommendations based on games played and users that have played those same games. It has a tag filtering system and can be filtered for popularity and release date. My recommendation system isn't meant to replace those systems as much as to show an alternative process that I believe would be a good addition to the above system. Though game purchased are understandably a high priority for any company, I think that being the only focus is short-sighted. Having users find games they will enjoy will keep them on the Steam platform playing games, as oppose to losing interest and finding other ways to spend their time. I believe keeping users engaged will ultimately keep them gaming, on the platform, and purchasing more games.

## Data Understanding

There were two main datasets used for this project, a game dataset and a user dataset, scraped or requested from various sources. It should be noted that I used https://nik-davis.github.io/posts/2019/steam-data-collection/, a post from Nik Davis, as a basis for much of my scraping methods. While I could mostly copy, paste, and run code for the game dataset with minor adjustments, I had to make major changes to adapt it for my user library requests. I will go into more detail below.
#### Game dataset:
Comprised of app ids, titles, genres, and tags.

Appids: The appids of the 1,000 most owned games were requested from the SteamSpy.com API. Unfortunately there were some changes to the API in recent years and I was not able to request every game title at once, limiting me to the first thousand most owned games. My attempts to run multiple requests per SteamSpy API documentation did not work, so I must acknowledge there is some bias towards games with more owners in this recommendation system. Nik Davis's code was used for this portion, however it was a bit dated in the regard that SteamSpy started limiting the 'all' request for it's API. I could not adapt it to gather the full steam library unfortunately. However, a decent system can be built as the 1,000 games were the most owned, with the understanding that it won't be perfect.

Titles, genres, tags: The appids acquired from SteamSpy API were used to request additional data from the Steam API, consisting of game titles, genres and tags. This was again copied line by line from Nik Davis and worked smoothly with minor adjustment.

The above data was combined into the game_data dataset.

#### User dataset
Comprised of steamids and game libraries.

Steamids: Firstly the appids of the 1,000 top rated and 1,000 most followed games were scrapped from Steamdb.com using Selenium. Those appids were then used to scrape up to 100 steamids from the most recent reviews of each of those games on Steampowered.com. This was done both to have relevant/active users as well as gather a sampling from different types of games. This ultimately ended with around 60,000 steamids.

Game libraries: Game libraries were requested from the Steam API. I was able to adapt the code from Nik for this purpose with major changes. Steam API has a spam limit that had to be accounted for, as well as needing to create extra helper functions to deal with the errors. Request retries needed to be removed because if a request failed, it was likely because a steamid was no longer valid. The return format also proved to be an issue. However all issues were eventually resolved with adjustments.

The above data was combined into the library_df dataset.

This work can be found in data/data_harvest.

### Technical Understanding

Of the 57,000 user libraries, around 50,000 were hidden. After dropping duplicates and users with 0 hours played there were 5,000 users left. Though at least 10,000 users would have been preferable, 5,000 was enough to build a relatively accurate model.

The recommendation system created here uses an hours-by-tag personalized rating system. Each game in a user's library has a community-voted list of tags associated with it. A profile was built for each of user based on the total number of hours played by tag. By normalizing these hours, we get a value between 0 and 1 that correlates with the percentage of time spent playing games with said tag. For each game in the user's library we added the above values for each tag associated with it to score it. Every game in the user library ended up with a score between 0 and 1, which was then used in the collaborative recommendation system to find users with similar rating profiles.

From the Python Surprise Library I used SVD, KNNWtihMeans, and KNNBasic to model my data. They are standard starting points and are not extremely computationally expensive. RMSE was used as the metric to score the model as it easy to interpret(being on the same scale as the ratings) and it is standard use for a collaborative recommendation system. KNNBasic scored best out of the three with default parameters and cross-validation. After running grid searches with both SVD and KNNBasic, the best performing model was KNNBasic with K=10 and similarity options of MSD, min_support=5, and user_based=True. The RMSE of the final model was .037 for a rating scale of 0 to 1.

In [48]:
# Import relevant modules
import pandas as pd
import pickle
import json
import ast
from surprise import Dataset, Reader,accuracy
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV
from surprise.prediction_algorithms import knns, SVD, KNNWithMeans, KNNBasic
from surprise.similarities import cosine, msd, pearson
import matplotlib.pyplot as plt
%matplotlib inline

### Game dataset

In [49]:
game_data = pd.read_csv('data/steamspy_data.csv')
game_data.head()

Unnamed: 0,appid,name,developer,publisher,score_rank,positive,negative,userscore,owners,average_forever,average_2weeks,median_forever,median_2weeks,price,initialprice,discount,languages,genre,ccu,tags
0,10,Counter-Strike,Valve,Valve,,185686,4807,0,"10,000,000 .. 20,000,000",9363,426,262,323,199,999,80,"English, French, German, Italian, Spanish - Sp...",Action,11955,"{'Action': 5372, 'FPS': 4796, 'Multiplayer': 3..."
1,20,Team Fortress Classic,Valve,Valve,,5235,874,0,"2,000,000 .. 5,000,000",852,3,27,3,99,499,80,"English, French, German, Italian, Spanish - Sp...",Action,94,"{'Action': 745, 'FPS': 306, 'Multiplayer': 258..."
2,30,Day of Defeat,Valve,Valve,,4885,541,0,"5,000,000 .. 10,000,000",811,0,16,0,99,499,80,"English, French, German, Italian, Spanish - Spain",Action,119,"{'FPS': 785, 'World War II': 246, 'Multiplayer..."
3,40,Deathmatch Classic,Valve,Valve,,1791,403,0,"5,000,000 .. 10,000,000",271,0,12,0,99,499,80,"English, French, German, Italian, Spanish - Sp...",Action,10,"{'Action': 628, 'FPS': 138, 'Classic': 106, 'M..."
4,50,Half-Life: Opposing Force,Gearbox Software,Valve,,12501,638,0,"5,000,000 .. 10,000,000",1919,3,171,5,99,499,80,"English, French, German, Korean",Action,122,"{'FPS': 879, 'Action': 321, 'Classic': 250, 'S..."


Here I create a function to return a list of the top ten tags by user vote.

In [50]:
def get_app_tags(tag_dict):
    tags = ast.literal_eval(tag_dict)
    tag_list = []
    for tag in tags:
        tag_list.append(tag)
        
    if len(tag_list) > 10:
        return tag_list[:10]
    else:
        return tag_list

I apply that function to the 'tags' column as I don't need the number of votes.

In [51]:
game_data['tags'] = game_data['tags'].apply(get_app_tags)

In [52]:
game_data.head()

Unnamed: 0,appid,name,developer,publisher,score_rank,positive,negative,userscore,owners,average_forever,average_2weeks,median_forever,median_2weeks,price,initialprice,discount,languages,genre,ccu,tags
0,10,Counter-Strike,Valve,Valve,,185686,4807,0,"10,000,000 .. 20,000,000",9363,426,262,323,199,999,80,"English, French, German, Italian, Spanish - Sp...",Action,11955,"[Action, FPS, Multiplayer, Shooter, Classic, T..."
1,20,Team Fortress Classic,Valve,Valve,,5235,874,0,"2,000,000 .. 5,000,000",852,3,27,3,99,499,80,"English, French, German, Italian, Spanish - Sp...",Action,94,"[Action, FPS, Multiplayer, Classic, Hero Shoot..."
2,30,Day of Defeat,Valve,Valve,,4885,541,0,"5,000,000 .. 10,000,000",811,0,16,0,99,499,80,"English, French, German, Italian, Spanish - Spain",Action,119,"[FPS, World War II, Multiplayer, Shooter, Acti..."
3,40,Deathmatch Classic,Valve,Valve,,1791,403,0,"5,000,000 .. 10,000,000",271,0,12,0,99,499,80,"English, French, German, Italian, Spanish - Sp...",Action,10,"[Action, FPS, Classic, Multiplayer, Shooter, F..."
4,50,Half-Life: Opposing Force,Gearbox Software,Valve,,12501,638,0,"5,000,000 .. 10,000,000",1919,3,171,5,99,499,80,"English, French, German, Korean",Action,122,"[FPS, Action, Classic, Sci-fi, Singleplayer, S..."


### User dataset

In [53]:
library_df = pd.read_csv('data/library_data.csv')
library_df

Unnamed: 0,steamid,library
0,76561198219067393,"[{'appid': 220, 'name': 'Half-Life 2', 'hours'..."
1,76561198148157441,"[{'appid': 17390, 'name': 'Spore', 'hours': 26..."
2,76561198993539076,hidden
3,76561198247182340,hidden
4,76561198278705159,hidden
...,...,...
56973,76561197990543347,hidden
56974,76561199206760437,hidden
56975,76561198324908021,hidden
56976,76561198253735927,hidden


In [54]:
# Setting index to steamid for future usage.
library_df.set_index('steamid', inplace=True)

Here I drop hidden libraries, bring the total down to around 6,600.

In [55]:
hidden_libraries = library_df[library_df['library'] == 'hidden'].index
library_df = library_df.drop(hidden_libraries)
len(library_df)

6600

In [None]:
# Dropping Duplicates
library_df.drop_duplicates(inplace=True)

In [57]:
len(library_df)

6340

A helper function to get the tags from game_data with an appid.

In [58]:
def get_app_tags(appid):
    return game_data.loc[appid]['tags']

Here I create a function go through every game in a library, retrieve the tags, and add the hours to a dictionary. It will then return a dictionary of the user's total hours by tag.

In [59]:
def tag_hours(games_list):
    games_list = ast.literal_eval(games_list)
    tag_dict = {}
    for game_dict in games_list:
        if game_dict['appid'] in game_data.index and game_dict['hours'] != 0:            
            tags = get_app_tags(game_dict['appid'])
            for tag in tags:
                if tag in tag_dict.keys():
                    tag_dict[tag] += game_dict['hours']
                else:
                    tag_dict[tag] = game_dict['hours']
    return dict(sorted(tag_dict.items(), key=lambda item: item[1], reverse=True))

Applying the tag_hours function to the user's game library and saving tag_hours as a new column.

In [None]:
library_df['tag_hours'] = library_df['library'].apply(tag_hours)

In [61]:
library_df

Unnamed: 0_level_0,library,tag_hours
steamid,Unnamed: 1_level_1,Unnamed: 2_level_1
76561198148157441,"[{'appid': 17390, 'name': 'Spore', 'hours': 26...","{'Automobile Sim': 7053, 'Simulation': 7053, '..."
76561198170079242,"[{'appid': 3830, 'name': 'Psychonauts', 'hours...","{'Singleplayer': 13753, 'Automobile Sim': 8111..."
76561198088650778,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...",{}
76561198886682654,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...","{'Singleplayer': 4960, 'Automobile Sim': 3978,..."
76561198311899167,"[{'appid': 10, 'name': 'Counter-Strike', 'hour...","{'Singleplayer': 7217, 'Story Rich': 5740, 'Gr..."
...,...,...
76561198208253879,"[{'appid': 70, 'name': 'Half-Life', 'hours': 3...","{'Story Rich': 25593, 'Singleplayer': 24882, '..."
76561199104131020,"[{'appid': 230410, 'name': 'Warframe', 'hours'...","{'Automobile Sim': 9108, 'Simulation': 9108, '..."
76561198012694491,"[{'appid': 240, 'name': 'Counter-Strike: Sourc...","{'Singleplayer': 43238, 'Adventure': 41911, 'E..."
76561198393589724,"[{'appid': 2600, 'name': 'Vampire: The Masquer...",{}


Dropping users with no hours played.

In [62]:
no_hours = library_df[library_df['tag_hours'] == {}].index
library_df = library_df.drop(no_hours)
len(library_df)

4933

This function normalizes all the hours in the libary to a value between 0 and 1, effectively giving us the percentage by hours for each tag that we will use to rate the games.

In [63]:
def normalize_hours(tag_dict):
    tag_labels = tag_dict.keys()
    tag_hours = tag_dict.values()
    sum_hours = sum(tag_hours)
    norm_hours = [float(i)/sum_hours for i in tag_hours]
    return dict(zip(tag_labels, norm_hours))

Applying to the tag_hours column.

In [None]:
library_df['tag_hours'] = library_df['tag_hours'].apply(normalize_hours)

In [65]:
library_df

Unnamed: 0_level_0,library,tag_hours
steamid,Unnamed: 1_level_1,Unnamed: 2_level_1
76561198148157441,"[{'appid': 17390, 'name': 'Spore', 'hours': 26...","{'Automobile Sim': 0.1, 'Simulation': 0.1, 'Dr..."
76561198170079242,"[{'appid': 3830, 'name': 'Psychonauts', 'hours...","{'Singleplayer': 0.1, 'Automobile Sim': 0.0589..."
76561198886682654,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...","{'Singleplayer': 0.09614266330684242, 'Automob..."
76561198311899167,"[{'appid': 10, 'name': 'Counter-Strike', 'hour...","{'Singleplayer': 0.08663865546218487, 'Story R..."
76561199063236653,"[{'appid': 500, 'name': 'Left 4 Dead', 'hours'...","{'Singleplayer': 0.09933591588267847, 'Buildin..."
...,...,...
76561199085649821,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...","{'Visual Novel': 0.1, 'Story Rich': 0.1, 'Free..."
76561198054375336,"[{'appid': 220, 'name': 'Half-Life 2', 'hours'...","{'Story Rich': 0.07863520408163266, 'Adventure..."
76561198208253879,"[{'appid': 70, 'name': 'Half-Life', 'hours': 3...","{'Story Rich': 0.09840433712703783, 'Singlepla..."
76561199104131020,"[{'appid': 230410, 'name': 'Warframe', 'hours'...","{'Automobile Sim': 0.1, 'Simulation': 0.1, 'Dr..."


Here I noticed some of the duplicates so they were dropped.

In [66]:
library_df = library_df[~library_df.index.duplicated(keep='first')]

In [67]:
library_df

Unnamed: 0_level_0,library,tag_hours
steamid,Unnamed: 1_level_1,Unnamed: 2_level_1
76561198148157441,"[{'appid': 17390, 'name': 'Spore', 'hours': 26...","{'Automobile Sim': 0.1, 'Simulation': 0.1, 'Dr..."
76561198170079242,"[{'appid': 3830, 'name': 'Psychonauts', 'hours...","{'Singleplayer': 0.1, 'Automobile Sim': 0.0589..."
76561198886682654,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...","{'Singleplayer': 0.09614266330684242, 'Automob..."
76561198311899167,"[{'appid': 10, 'name': 'Counter-Strike', 'hour...","{'Singleplayer': 0.08663865546218487, 'Story R..."
76561199063236653,"[{'appid': 500, 'name': 'Left 4 Dead', 'hours'...","{'Singleplayer': 0.09933591588267847, 'Buildin..."
...,...,...
76561198809481114,"[{'appid': 208500, 'name': 'F1 2012', 'hours':...","{'Automobile Sim': 0.1, 'Simulation': 0.1, 'Dr..."
76561199085649821,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...","{'Visual Novel': 0.1, 'Story Rich': 0.1, 'Free..."
76561198054375336,"[{'appid': 220, 'name': 'Half-Life 2', 'hours'...","{'Story Rich': 0.07863520408163266, 'Adventure..."
76561199104131020,"[{'appid': 230410, 'name': 'Warframe', 'hours'...","{'Automobile Sim': 0.1, 'Simulation': 0.1, 'Dr..."


This function applies a rating to every game in a user's library based on it's tag scores. The more significant the tags it has the higher it will score, whereas tags associated with low hours with get a low rating.

In [68]:
def rate_game_library(steamid):
    try:
        library = ast.literal_eval(library_df.loc[steamid]['library'])
    except ValueError:
        library = ast.literal_eval(library_df.loc[steamid]['library'].iloc[0])
    tag_hours = library_df.loc[steamid]['tag_hours']
    scores = {}
    for game in library:
        score = 0
        appid = game['appid']
        if appid in game_data.index:
            game_tags = game_data.loc[appid]['tags']
            for tag in game_tags:
                if tag in tag_hours:
                    score += tag_hours[tag]
        if appid in scores:
            scores[appid] += score
        else:
            scores[appid] = score
    return scores

This function takes in the whole dataframe and applies the rate_game_library function to it.

In [69]:
def game_ratings(df):
    index = df.index.values
    library_ratings = []
    for steamid in index:
        rated_library = rate_game_library(steamid)
        library_ratings.append(rated_library)
    return library_ratings

Saving the results of game_ratings to a new column.

In [None]:
library_df['game_ratings'] = library_df.index.to_series().apply(rate_game_library)

In [71]:
library_df

Unnamed: 0_level_0,library,tag_hours,game_ratings
steamid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
76561198148157441,"[{'appid': 17390, 'name': 'Spore', 'hours': 26...","{'Automobile Sim': 0.1, 'Simulation': 0.1, 'Dr...","{17390: 0, 17440: 0, 550: 0.1, 47870: 0, 65600..."
76561198170079242,"[{'appid': 3830, 'name': 'Psychonauts', 'hours...","{'Singleplayer': 0.1, 'Automobile Sim': 0.0589...","{3830: 0, 4000: 0, 12120: 0, 12250: 0, 19900: ..."
76561198886682654,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...","{'Singleplayer': 0.09614266330684242, 'Automob...","{4000: 0, 400: 0.05760806357821284, 17390: 0, ..."
76561198311899167,"[{'appid': 10, 'name': 'Counter-Strike', 'hour...","{'Singleplayer': 0.08663865546218487, 'Story R...","{10: 0.06967587034813927, 80: 0.17858343337334..."
76561199063236653,"[{'appid': 500, 'name': 'Left 4 Dead', 'hours'...","{'Singleplayer': 0.09933591588267847, 'Buildin...","{500: 0.19407858328721636, 3590: 0, 550: 0.201..."
...,...,...,...
76561198809481114,"[{'appid': 208500, 'name': 'F1 2012', 'hours':...","{'Automobile Sim': 0.1, 'Simulation': 0.1, 'Dr...","{208500: 0, 215930: 0, 253920: 0, 253960: 0, 2..."
76561199085649821,"[{'appid': 4000, 'name': ""Garry's Mod"", 'hours...","{'Visual Novel': 0.1, 'Story Rich': 0.1, 'Free...","{4000: 0, 550: 0.9999999999999999, 22380: 0, 1..."
76561198054375336,"[{'appid': 220, 'name': 'Half-Life 2', 'hours'...","{'Story Rich': 0.07863520408163266, 'Adventure...","{220: 0.6276147959183673, 320: 0.1724489795918..."
76561199104131020,"[{'appid': 230410, 'name': 'Warframe', 'hours'...","{'Automobile Sim': 0.1, 'Simulation': 0.1, 'Dr...","{230410: 0, 238960: 0, 242720: 0, 291480: 0, 2..."


This function takes a steamid(user) and creates a new dataframe for a user with every game and it's rating as it's own row from library_df.

In [73]:
def user_rating_df(steamid):
    big_df = pd.DataFrame(columns=['steamid', 'appid', 'rating'])
    rating_dict = library_df.loc[steamid]['game_ratings']
    for app, rating in rating_dict.items():
        big_df = big_df.append({'steamid': str(steamid), 'appid': str(app), 'rating': rating}, ignore_index=True)
    return big_df

Here we create the dataframe that will contain every user-library dataframe.

In [None]:
user_rec_df = pd.DataFrame(columns=['steamid', 'appid', 'rating'])

This for loop will go through every user in library_df, create a user library rating dataframe, and then append it to the user_rec_df dataframe.

In [None]:
for num, steamid in enumerate(library_df.index):
    total = len(library_df)
    print(f"{num} out of {total}")
    user_df = user_rating_df(steamid)
    user_rec_df = pd.concat([user_rec_df, user_df], ignore_index = True, axis = 0)

As the above loop takes a long time to run, here I'll load a completed one from the data folder.

In [76]:
with open('data/big_rating_df.pickle', 'rb') as handle:
    big_rating_df = pickle.load(handle)
big_rating_df

Unnamed: 0,steamid,appid,rating
0,76561198148157441,17390,0.055111
1,76561198148157441,17440,0.0
2,76561198148157441,550,0.215663
3,76561198148157441,47870,0.083723
4,76561198148157441,65600,0.0
...,...,...,...
901104,76561198008893422,1546540,0.0
901105,76561198008893422,582660,0.228932
901106,76561198008893422,1551360,0.0
901107,76561198008893422,46500,0.0


# Modeling

Creating the reader and data for working with Surprise.

In [77]:
# instantiate a reader and read in our rating data
reader = Reader(rating_scale=(0, 1))
data = Dataset.load_from_df(big_rating_df[['steamid','appid','rating']], reader)

The train/test for modeling.

In [78]:
# train on 75% of known rates
trainset, testset = train_test_split(data, test_size=.25)

## First Simple Model - SVD

We start with a SVD with default parameters as it is simple to run and easy to understand.

In [79]:
svd = SVD()
svd.fit(trainset)
predictions = svd.test(testset)

In [80]:
# check the accuracy using Root Mean Square Error
accuracy.rmse(predictions)

RMSE: 0.0624


0.062396558579031654

In [81]:
# Run 5-fold cross-validation and then print results
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0619  0.0615  0.0613  0.0617  0.0619  0.0617  0.0002  
MAE (testset)     0.0372  0.0368  0.0371  0.0370  0.0370  0.0370  0.0001  
Fit time          34.84   34.74   35.09   35.02   34.82   34.90   0.13    
Test time         1.36    1.35    1.66    1.34    1.60    1.46    0.14    


{'test_rmse': array([0.06185194, 0.06151163, 0.06129947, 0.06169305, 0.06190668]),
 'test_mae': array([0.03720747, 0.03682958, 0.03708214, 0.03696086, 0.03701883]),
 'fit_time': (34.840415477752686,
  34.737008810043335,
  35.09151482582092,
  35.02129507064819,
  34.82382035255432),
 'test_time': (1.3645951747894287,
  1.3487894535064697,
  1.660546064376831,
  1.3368985652923584,
  1.5971174240112305)}

While the .06 RMSE is not necessarily bad, I will try some other models to see if I can work from another starting point.

## KNNWithMeans

In [82]:
sim_cos = {'name':'cosine', 'user_based':True}

In [83]:
knnwm = KNNWithMeans(sim_options=sim_cos)
knnwm.fit(trainset)
predictionswm = knnwm.test(testset)

Computing the cosine similarity matrix...


  sim = construction_func[name](*args)


Done computing similarity matrix.


In [84]:
# check the accuracy using Root Mean Square Error
accuracy.rmse(predictionswm)

RMSE: 0.0591


0.059078190080661064

In [85]:
# Run 5-fold cross-validation and then print results
cross_validate(knnwm, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0583  0.0579  0.0580  0.0581  0.0583  0.0581  0.0002  
MAE (testset)     0.0335  0.0332  0.0334  0.0334  0.0336  0.0334  0.0001  
Fit time          40.86   41.25   41.27   41.07   41.29   41.15   0.16    
Test time         54.52   53.77   54.65   54.35   54.88   54.43   0.38    


{'test_rmse': array([0.0583365 , 0.0578687 , 0.05798811, 0.05812293, 0.05833687]),
 'test_mae': array([0.03346728, 0.03323179, 0.03338957, 0.03342956, 0.03357789]),
 'fit_time': (40.86032032966614,
  41.252482175827026,
  41.26520490646362,
  41.065455198287964,
  41.290992736816406),
 'test_time': (54.515414237976074,
  53.766077756881714,
  54.65489387512207,
  54.35278916358948,
  54.88260793685913)}

With an RMSE of around .058 this is an improvement on the SVD model, but I will continue onto the next model in case it can be improved on.

## KNNBasic

In [86]:
basic = knns.KNNBasic(sim_options=sim_cos)
basic.fit(trainset)
pred_basic = basic.test(testset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [87]:
accuracy.rmse(pred_basic)

RMSE: 0.0559


0.05593304507366181

In [88]:
# Run 5-fold cross-validation and then print results
cross_validate(basic, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0548  0.0555  0.0550  0.0561  0.0550  0.0553  0.0005  
MAE (testset)     0.0259  0.0262  0.0261  0.0263  0.0260  0.0261  0.0001  
Fit time          40.87   40.94   41.54   41.26   41.20   41.16   0.24    
Test time         52.13   51.92   52.23   51.89   52.06   52.05   0.13    


{'test_rmse': array([0.05481555, 0.05545832, 0.05495332, 0.05612396, 0.05499054]),
 'test_mae': array([0.02591008, 0.02617641, 0.02608698, 0.02631249, 0.02596347]),
 'fit_time': (40.86977672576904,
  40.93999481201172,
  41.535391330718994,
  41.258631467819214,
  41.20149326324463),
 'test_time': (52.12744951248169,
  51.920119285583496,
  52.233962059020996,
  51.88645648956299,
  52.06074857711792)}

This model performed the best out of the three, averaging around .055 RMSE. I will try a gridsearch with both SVD and KNNBasic to be sure, but it is likely KNNBasic is the one I will go with.

## GridSearch

### SVD

In [89]:
## Perform a gridsearch with SVD
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}
g_s_svd = GridSearchCV(SVD,param_grid=params,cv=3)
g_s_svd.fit(data)

In [90]:
svd_results_df = pd.DataFrame.from_dict(g_s_svd.cv_results)
svd_results_df.sort_values('rank_test_rmse', axis=0, ascending=True)

Unnamed: 0,split0_test_rmse,split1_test_rmse,split2_test_rmse,mean_test_rmse,std_test_rmse,rank_test_rmse,split0_test_mae,split1_test_mae,split2_test_mae,mean_test_mae,std_test_mae,rank_test_mae,mean_fit_time,std_fit_time,mean_test_time,std_test_time,params,param_n_factors,param_reg_all
1,0.056202,0.055734,0.056073,0.056003,0.000197,1,0.032972,0.032816,0.032821,0.032869,7.2e-05,1,11.801399,0.020585,2.479613,0.599125,"{'n_factors': 20, 'reg_all': 0.05}",20,0.05
0,0.05668,0.056199,0.056439,0.056439,0.000196,2,0.032987,0.032851,0.032801,0.03288,7.9e-05,2,11.961742,0.038357,2.443443,0.015969,"{'n_factors': 20, 'reg_all': 0.02}",20,0.02
2,0.057127,0.056648,0.05704,0.056938,0.000208,3,0.03468,0.034517,0.034543,0.03458,7.1e-05,4,11.97661,0.343157,2.193552,0.609683,"{'n_factors': 20, 'reg_all': 0.1}",20,0.1
4,0.057205,0.056767,0.057065,0.057012,0.000183,4,0.033869,0.03373,0.033701,0.033767,7.3e-05,3,18.188906,0.012433,2.473456,0.540087,"{'n_factors': 50, 'reg_all': 0.05}",50,0.05
5,0.057462,0.057,0.057411,0.057291,0.000207,5,0.034917,0.034741,0.034792,0.034817,7.4e-05,5,18.123458,0.136109,2.465995,0.539385,"{'n_factors': 50, 'reg_all': 0.1}",50,0.1
8,0.058083,0.057589,0.057955,0.057875,0.000209,6,0.035351,0.035168,0.035189,0.035236,8.1e-05,7,28.437077,0.288441,2.659884,0.302475,"{'n_factors': 100, 'reg_all': 0.1}",100,0.1
7,0.058899,0.058374,0.058655,0.058643,0.000215,7,0.03514,0.034933,0.034908,0.034993,0.000104,6,28.487778,0.062555,2.256464,0.610556,"{'n_factors': 100, 'reg_all': 0.05}",100,0.05
3,0.059392,0.059017,0.059229,0.059213,0.000153,8,0.035445,0.035319,0.035286,0.03535,6.9e-05,8,18.122782,0.095994,2.668297,0.633091,"{'n_factors': 50, 'reg_all': 0.02}",50,0.02
6,0.063696,0.063339,0.063599,0.063545,0.000151,9,0.03878,0.03864,0.038628,0.038683,6.9e-05,9,28.622764,0.059291,2.525569,0.512285,"{'n_factors': 100, 'reg_all': 0.02}",100,0.02


### KNNBasic

In [91]:
## Perform a gridsearch with KNNBasic
params = {'k': [3, 5, 10, 20],
              'sim_options': {'name': ['msd', 'cosine', 'pearson'],
                              'min_support': [1, 5],
                              'user_based': [True]}
              }

g_s_knnb = GridSearchCV(knns.KNNBasic,param_grid=params, measures=['rmse', 'mae'], cv=3)
g_s_knnb.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...


  sim = construction_func[name](*args)


Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...


  sim = construction_func[name](*args)


Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity

In [92]:
knnb_results_df = pd.DataFrame.from_dict(g_s_knnb.cv_results)
knnb_results_df.sort_values('rank_test_rmse', axis=0, ascending=True)

Unnamed: 0,split0_test_rmse,split1_test_rmse,split2_test_rmse,mean_test_rmse,std_test_rmse,rank_test_rmse,split0_test_mae,split1_test_mae,split2_test_mae,mean_test_mae,std_test_mae,rank_test_mae,mean_fit_time,std_fit_time,mean_test_time,std_test_time,params,param_k,param_sim_options
13,0.040099,0.03988,0.039831,0.039937,0.000117,1,0.016577,0.016763,0.016581,0.016641,8.7e-05,1,13.928336,0.094272,65.955569,0.566199,"{'k': 10, 'sim_options': {'name': 'msd', 'min_...",10,"{'name': 'msd', 'min_support': 5, 'user_based'..."
19,0.040746,0.040581,0.04047,0.040599,0.000113,2,0.016911,0.017102,0.016897,0.01697,9.4e-05,2,13.510538,0.077747,68.94064,0.565731,"{'k': 20, 'sim_options': {'name': 'msd', 'min_...",20,"{'name': 'msd', 'min_support': 5, 'user_based'..."
7,0.040878,0.040602,0.040593,0.040691,0.000132,3,0.017051,0.017218,0.017049,0.017106,7.9e-05,3,13.898682,0.143145,62.890889,0.397589,"{'k': 5, 'sim_options': {'name': 'msd', 'min_s...",5,"{'name': 'msd', 'min_support': 5, 'user_based'..."
1,0.042752,0.042466,0.042474,0.042564,0.000133,4,0.01806,0.018214,0.018042,0.018105,7.7e-05,4,13.4036,0.034965,59.301139,1.438894,"{'k': 3, 'sim_options': {'name': 'msd', 'min_s...",3,"{'name': 'msd', 'min_support': 5, 'user_based'..."
23,0.046054,0.045877,0.045857,0.045929,8.9e-05,5,0.021125,0.021285,0.021143,0.021184,7.2e-05,6,32.383954,0.560997,65.568168,1.416959,"{'k': 20, 'sim_options': {'name': 'pearson', '...",20,"{'name': 'pearson', 'min_support': 5, 'user_ba..."
21,0.046643,0.046394,0.046489,0.046509,0.000103,6,0.02157,0.021684,0.021597,0.021617,4.8e-05,7,26.496758,0.066065,68.04439,1.226501,"{'k': 20, 'sim_options': {'name': 'cosine', 'm...",20,"{'name': 'cosine', 'min_support': 5, 'user_bas..."
17,0.047001,0.04675,0.046818,0.046856,0.000106,7,0.021727,0.021855,0.021751,0.021778,5.5e-05,8,33.579396,0.429756,65.334736,0.19356,"{'k': 10, 'sim_options': {'name': 'pearson', '...",10,"{'name': 'pearson', 'min_support': 5, 'user_ba..."
18,0.047576,0.047429,0.047171,0.047392,0.000168,8,0.020788,0.020979,0.02071,0.020826,0.000113,5,13.552616,0.070013,69.251948,1.089072,"{'k': 20, 'sim_options': {'name': 'msd', 'min_...",20,"{'name': 'msd', 'min_support': 1, 'user_based'..."
15,0.047578,0.047342,0.047507,0.047476,9.9e-05,9,0.022189,0.022287,0.022262,0.022246,4.1e-05,9,29.169943,0.89343,70.694307,0.931321,"{'k': 10, 'sim_options': {'name': 'cosine', 'm...",10,"{'name': 'cosine', 'min_support': 5, 'user_bas..."
11,0.049892,0.049579,0.049677,0.049716,0.000131,10,0.023295,0.023405,0.023315,0.023338,4.8e-05,11,36.064228,1.526686,68.070813,1.038171,"{'k': 5, 'sim_options': {'name': 'pearson', 'm...",5,"{'name': 'pearson', 'min_support': 5, 'user_ba..."


In [93]:
sim_msd = {'name':'msd', 'min_support': 5, 'user_based':True}
basic = knns.KNNBasic(k=10, sim_options=sim_msd)
basic.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x22f13846dc0>

In [94]:
pred_basic = basic.test(testset)
accuracy.rmse(pred_basic)

RMSE: 0.0382


0.038209607319484536

In [95]:
# Run 5-fold cross-validation and then print results
cross_validate(basic, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0377  0.0374  0.0370  0.0367  0.0361  0.0370  0.0005  
MAE (testset)     0.0154  0.0154  0.0153  0.0152  0.0152  0.0153  0.0001  
Fit time          18.49   18.75   19.07   19.03   18.93   18.85   0.21    
Test time         42.60   43.99   43.27   46.69   46.64   44.64   1.71    


{'test_rmse': array([0.03769004, 0.03738831, 0.03704637, 0.03667532, 0.0361411 ]),
 'test_mae': array([0.01541574, 0.01537122, 0.01532123, 0.01522475, 0.01515508]),
 'fit_time': (18.48554491996765,
  18.750130891799927,
  19.066022634506226,
  19.02609634399414,
  18.933117628097534),
 'test_time': (42.596980571746826,
  43.9891574382782,
  43.27440786361694,
  46.693233013153076,
  46.643250703811646)}

In [97]:
# return the top n recommendations
def recommended_games(user_ratings,game_title_df,n):
        for idx, rec in enumerate(user_ratings):
            title = game_title_df.loc[int(rec[0])]['name']
            print('Recommendation # ', idx+1, ': ', title, '\n')
            n-= 1
            if n == 0:
                break

In [98]:
def recs_for_user(steamid, num_games):
    list_of_games = []
    for appid in big_rating_df['appid'].unique():
        list_of_games.append( (appid,basic.predict(steamid,appid)[3]))
    # order the predictions from highest to lowest rated
    ranked_games = sorted(list_of_games, key=lambda x:x[1], reverse=True)
    recommended_games(ranked_games,game_data,num_games)

In [104]:
recs_for_user('76561197963796380', 3)

Recommendation #  1 :  Tomb Raider 

Recommendation #  2 :  Counter-Strike: Source 

Recommendation #  3 :  Grand Theft Auto III 

