# <a id='toc1_'></a>[Content-Based Filtering](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Content-Based Filtering](#toc1_)    
  - [Outline of the Process](#toc1_1_)    
  - [Importing Libraries and Datasets](#toc1_2_)    
  - [Content-Based Filtering](#toc1_3_)    
    - [Preprocessing](#toc1_3_1_)    
    - [Accuracy Concerns](#toc1_3_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Outline of the Process](#toc0_)

1. Use game details to find the similarity between game and user vectors
    - Use distance metrics (cosine similarity) to determine their similarity score
2. Rank games by their similarity score with the user
3. Remove games already owned in the user's library
4. Present top 6 games from the remaining list of ranked games.
    - To prevent recommending the same games every time, we can recommend the top 6 from a list of top games.

## <a id='toc1_2_'></a>[Importing Libraries and Datasets](#toc0_)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

# import scraping_tools for testing purposes
from scraping_tools import steam_library

from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
# reading in required datasets
games_df = pd.read_csv('data/clean_game_data.csv')
users_df = pd.read_csv('data/clean_user_data.csv')
recs_df = pd.read_csv('data/clean_recommendations.csv')

In [3]:
games_df.head(3)

Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,...,t_Well-Written,t_Werewolves,t_Western,t_Wholesome,t_Word Game,t_World War I,t_World War II,t_Wrestling,t_Zombies,t_eSports
0,10090,Call of Duty: World at War,2008-11-18,1,0,0,1,92,37039,19.99,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,13500,Prince of Persia: Warrior Within™,2008-11-21,1,0,0,1,84,2199,9.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,22364,BRINK: Agents of Change,2011-08-03,1,0,0,1,85,21,2.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
users_df.head(3)

Unnamed: 0,user_id,products,reviews
0,6924278,156,1
1,4358034,329,4
2,2340634,176,2


In [5]:
recs_df.head(3)

Unnamed: 0,app_id,helpful,funny,review_date,is_recommended,hours,user_id,review_id
0,975370,0,0,2022-12-12,1,36.3,24170,0
1,304390,4,0,2017-02-17,0,11.5,1339,1
2,1085660,2,0,2019-11-17,1,336.5,110271,2


In [6]:
# find number of unique games in recs_df
recs_df['app_id'].nunique()

2266

In [7]:
recs_df['user_id'].nunique()

6903784

Note that we only 2266 have unique games with reviews out of a possible ~55,000 total games on Steam.  So, there is a good chance that a new user will have played a game that is not represented in this dataset.  Similarity scores found from only recs_df, then, are unable to include games that were not reviewed by users.

## <a id='toc1_3_'></a>[Content-Based Filtering](#toc0_)

I will start by determining games that are similar to each other using their cosine similarities (i.e., by vectorizing each game using its details and finding the dot product between each game and every other game).  By creating this, we can then determine which games are most alike, which allows us to recommend games based on what we know about the user.  For instance, if we know a user liked God of War, and we know that The Witcher 3 is its "nearest neighbour" of games that user has not played, then the assumption is that the user would also like The Witcher 3, so we can recommend that game.

### <a id='toc1_3_1_'></a>[Preprocessing](#toc0_)

To start, we need to ensure our dataset is in the correct format to perform the cosine similarity.

In [8]:
# looking at the first few columns (excluding tags, which were already one-hot encoded in the Data Cleaning step.)
games_df.columns[0:15]

Index(['app_id', 'title', 'date_release', 'win', 'mac', 'linux', 'rating',
       'positive_ratio', 'user_reviews', 'price_final', 'price_original',
       'discount', 'description', 't_1980s', 't_1990's'],
      dtype='object')

In [9]:
# we can drop certain columns that are redundant or cannot be processed
games_sim = games_df.drop(columns=['app_id', 'title', 'date_release', 'price_original', 'description'])

In [10]:
# check for NaNs in our dataset
games_sim.isna().sum()

win                 0
mac                 0
linux               0
rating              0
positive_ratio      0
                 ... 
t_World War I     238
t_World War II    238
t_Wrestling       238
t_Zombies         238
t_eSports         238
Length: 449, dtype: int64

It looks like some of the games did not have tags when they were one-hot encoded.  Those games have NaNs in the tags columns.  Since they amount to only a very small proportion of total games, I'll drop them from this set.

In [11]:
# dropping games without tags
games_sim.dropna(inplace=True)

In [12]:
games_sim.shape

(48606, 449)

We are left with 48,606 unique games and 449 columns (features) for each game.

Before we can perform a cosine similarity, though, we need to ensure that our features are scaled together.  This is important to ensure that certain features don't outweigh other features unfairly.

In [13]:
# scaling the game_data
scaler = StandardScaler()
scaled_game_data = scaler.fit_transform(games_sim)

In [14]:
# perform cosine similarity in batches
def cosine_similarity_n_space(data, batch_size=25):
    assert isinstance(data, np.ndarray), "Input data must be a numpy array"
    ret = np.ndarray((data.shape[0], data.shape[0]))
    for row_i in range(0, int(data.shape[0] / batch_size) + 1):
        start = row_i * batch_size
        end = min([(row_i + 1) * batch_size, data.shape[0]])
        if end <= start:
            break
        rows = data[start: end]
        sim = cosine_similarity(rows, data)  # rows is O(1) size
        ret[start: end] = sim
    return ret

game_similarity = cosine_similarity_n_space(scaled_game_data)

In [16]:
game_similarity.shape

(48606, 48606)

Now that we have the similarity score, we can find recommendations for any user where we know the games they like.  This can even be done for users not in our dataset by getting their information through the Steam API.

Let's get a user's details to see how this would work.

In [17]:
# we need the user's id.
steam_id = '76561198120441502'

# in scraping_tools.py, I've already created functions to get the user's library details
user_library = steam_library(steam_id)

# look at the steam_library dataframe, sorted by number of hours played
sorted_library = user_library.sort_values(by='hours', ascending=False).head(10)

In [18]:
# top 10 most played games for this user
sorted_library

Unnamed: 0,app_id,hours,user_id
60,363970,21339,76561198120441502
75,413150,5963,76561198120441502
242,1432860,4334,76561198120441502
223,1072420,3037,76561198120441502
122,589360,2915,76561198120441502
126,599140,2444,76561198120441502
191,894940,2252,76561198120441502
245,1458100,1737,76561198120441502
257,1970460,1638,76561198120441502
200,972660,1578,76561198120441502


We can see that this user played game '363970' for 21,339 hours.  It's probably safe to say that they enjoyed it at least a little bit.  The hope, then, is that they would also enjoy similar games.

One problem with this assumption is that it assumes with more hours played are preferred more by users.  This is not necessarily the case since some games simply have less content than others.  That doesn't mean those games are less enjoyable, just that a user is likely to play it for less time.  Still, we will work on this assumption for now.

In [19]:
# Since the app only displayes 6 games at a time, let's look at the top 6
top6_games = sorted_library['app_id'].head(6).tolist()

top6_games

[363970, 413150, 1432860, 1072420, 589360, 599140]

In [20]:
# find the index of those games in the games_df
top6_indices = games_df[games_df['app_id'].isin(top6_games)].index
top6_indices

Int64Index([468, 3787, 5173, 9391, 11526, 11978], dtype='int64')

When finding similar games, we want to make sure that we aren't simply finding the top games for that user's most played game.  We want an assortment of games that cover most of the user's intersts.  For that reason, let's take the top 5 similar games for each of that user's top 6 most played games.

In [22]:
# get similar games for top6 most played
sim_games = []
game_recs = 0

for i in top6_indices:
    # find the row for that game
    row = game_similarity[i]
    
    # sort the row
    sorted_sims = np.argsort(row)[::-1]

    # get 5 game indices
    game_rec_indices = sorted_sims[1:6]
    sim_games.extend(game_rec_indices)

# get games from those indices
top30_recs = games_df.iloc[sim_games]

Lastly, we want to make sure we aren't considering games that the user has already played.  So, let's remove those games from the recommendations.

In [23]:
# get all appid values from user_library
excluded_apps = user_library['app_id'].tolist()

# mask those games from my top30 recs
mask = ~top30_recs['app_id'].isin(excluded_apps)

# we will take the top 6 from games that the user has not played
top6picks = top30_recs[mask].head(6)

display(top6picks)

Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,...,t_Well-Written,t_Werewolves,t_Western,t_Wholesome,t_Word Game,t_World War I,t_World War II,t_Wrestling,t_Zombies,t_eSports
8961,319320,Time Mysteries 3: The Final Enigma,2014-09-11,1,1,1,1,80,208,9.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
30460,509880,The End oo,2017-05-11,1,0,0,0,65,183,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10226,1897380,Zemblanity,2022-04-03,1,0,0,1,89,55,2.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17692,1000440,东方雪莲华 ～ Abyss Soul Lotus.,2023-02-02,1,0,0,1,91,267,12.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18032,1590640,Savior of the Abyss,2021-08-12,1,0,0,0,77,189,6.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5932,965220,Total War: WARHAMMER II - The Prophet & The Wa...,2019-04-17,1,1,1,1,84,1065,9.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that we know which games to recommend, we need to get the game details.

In [24]:
# read in the dataset that provides game details
game_images_df = pd.read_csv('data/steam_games.csv', sep=';')

In [25]:
# get game details for top 6 recommendations
game_details = game_images_df.loc[game_images_df['App ID'].isin(top6picks['app_id']), ['Name', 'Price', 'Header Image']]

### <a id='toc1_3_2_'></a>[Accuracy Concerns](#toc0_)

The model's accuracy can be a valuable metric to determine if the recommendations being given to a user are appropriate.  One of the problems with this sort of recommendation system is that it isn't easy to evaluate.  Other than by eye, or else by gathering further user metrics such as click-through rate, we won't ever know if these recommendations are good or not.  All we have to go off of is the similarity score that was given to us by calculating the cosine similarity.  Since we can't be sure that these scores are actually capturing important details about the games, though, the scores are not necessarily reliable.  

For this reason, in the next notebook I will move on to a Funk Singular Value Decomposition Model that provides at least one metric by which we can evaluate the recommendations that are produced.