# Modelling

## Outline of the Process

1. Build (two) embedding models to represent both items (games) and users (steam users) as an embedding vector.
2. Find the similarity between game and user vectors
    - Use distance metrics (cosine similarity, euclidean distance, etc.) to determine their similarity score
3. Rank games by their similarity score with the user
4. Remove games already owned in their library
5. Present top 6 games from the remaining list of ranked games.

## Benefits (Business worth)

This method of finding recommendations overcomes the so-called "Cold Start Problem": it is hard to predict a user's interests when they are new users. By learning the embeddings of existing users and items, the system can make initial recommenations for new users or provide recommendations for new items based on their similarity to existing items.

## Considerations for Modelling

There are many types of recommender systems 


### Types of Models


#### 1. Simple KNN

#### 2. Item-Item Filtering

#### 3. Item-User Filtering

#### 4. Others?

#### 5. NN

### Accuracy Concerns

The model's accuracy can be a valuable metric to determine if the recommendations being given to a user are appropriate.  An important consideration, however, is our tolerance for inaccurate recommendations making it into our top recommended list.  For instance, how much should we care if the system provides 5 good recommendations but 1 very bad recommendation compared to 6 mediocre recommendations?  In other words, how important is the system's ability to prevent false positives (i.e., its precision)?

For my recommender system, I will give high preference to precision over accuracy since I want to prevent false positives whenever possible.

### 0. Preparation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import coo_matrix, vstack

# import scraping_tools for testing purposes
from scraping_tools import steam_library

from warnings import filterwarnings
filterwarnings('ignore')

In [22]:
games_df = pd.read_csv('data/clean_game_data.csv')
#users_df = pd.read_csv('data/clean_user_data.csv')
#recs_df = pd.read_csv('data/clean_recommendations.csv')

In [23]:
games_df.head(3)

Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,...,t_Well-Written,t_Werewolves,t_Western,t_Wholesome,t_Word Game,t_World War I,t_World War II,t_Wrestling,t_Zombies,t_eSports
0,10090,Call of Duty: World at War,2008-11-18,1,0,0,1,92,37039,19.99,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,13500,Prince of Persia: Warrior Within™,2008-11-21,1,0,0,1,84,2199,9.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,22364,BRINK: Agents of Change,2011-08-03,1,0,0,1,85,21,2.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
users_df.head(3)

NameError: name 'users_df' is not defined

In [6]:
recs_df.head(3)

Unnamed: 0,app_id,helpful,funny,review_date,is_recommended,hours,user_id,review_id
0,975370,0,0,2022-12-12,1,36.3,24170,0
1,304390,4,0,2017-02-17,0,11.5,1339,1
2,1085660,2,0,2019-11-17,1,336.5,110271,2


In [7]:
# find number of unique games in recs_df
recs_df['app_id'].nunique()

2266

Note that a model trained on only 2266 unique games out of a possible ~55,000 total games on Steam.  So, there is a good chance that a new user will have played a game that is not represented in this dataset.  Similarity scores found from only recs_df, then, are unable to include games that were not included in recs_df.

### 1. KNN

Let's do three different KNN models: Collaborative Filtering (finding similar users), Content Filtering (finding similar games), and a hybrid of the two.

#### User-User Collaborative Filtering

In [5]:
games_df.columns[0:15]

Index(['app_id', 'title', 'date_release', 'win', 'mac', 'linux', 'rating',
       'positive_ratio', 'user_reviews', 'price_final', 'price_original',
       'discount', 'description', 't_1980s', 't_1990's'],
      dtype='object')

In [6]:
games_sim = games_df.drop(columns=['app_id', 'title', 'date_release', 'price_original', 'description'], inplace=True)

In [7]:
games_sim = games_sim.fillna(0)

In [8]:
# get cosine similarities of parts of the games df
games_similarity = cosine_similarity(games_sim, dense_output=False)

In [10]:
games_similarity.shape

(48844, 48844)

In [26]:
steam_id = '76561198120441502' #random user's steam id
user_library = steam_library(steam_id)

# look at the steam_library dataframe
sorted_library = user_library.sort_values(by='hours', ascending=False).head(6)

In [30]:
sorted_library

Unnamed: 0,app_id,hours,user_id
60,363970,21339,76561198120441502
75,413150,5963,76561198120441502
242,1432860,4334,76561198120441502
223,1072420,3037,76561198120441502
122,589360,2915,76561198120441502
126,599140,2444,76561198120441502


In [27]:
# grab the top 6 games
top6_games = sorted_library['app_id'].head(6).tolist()

In [28]:
top6_games

[363970, 413150, 1432860, 1072420, 589360, 599140]

In [33]:
top6_indices = games_df[games_df['app_id'].isin(top6_games)].index
top6_indices

Int64Index([468, 3787, 5173, 9391, 11526, 11978], dtype='int64')

In [34]:
sim_games = []
game_recs = 0

for i in top6_indices:
    # find the row for that game
    row = games_similarity[i]
    
    # sort the row
    sorted_sims = np.argsort(row)[::-1]

    # get 5 game indices
    game_rec_indices = sorted_sims[1:6]
    sim_games.extend(game_rec_indices)

In [37]:
top30_recs = games_df.iloc[sim_games]

In [38]:
# get all appid values from user_library
excluded_apps = user_library['app_id'].tolist()

# mask those games from my top30 recs
mask = ~top30_recs['app_id'].isin(excluded_apps)

top6picks = top30_recs[mask].head(6)

In [39]:
top6picks

Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,...,t_Well-Written,t_Werewolves,t_Western,t_Wholesome,t_Word Game,t_World War I,t_World War II,t_Wrestling,t_Zombies,t_eSports
3360,694280,Zombie Army 4: Dead War,2021-02-18,1,0,0,1,85,4709,49.99,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
5215,1372110,JoJo's Bizarre Adventure: All-Star Battle R,2022-09-01,1,0,0,1,86,4719,49.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22576,840140,武侠乂 The Swordsmen X,2020-04-27,1,0,0,0,46,2412,24.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8734,1732190,FATAL FRAME / PROJECT ZERO: Maiden of Black Water,2021-10-27,1,0,0,0,79,4060,39.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1184,10180,Call of Duty®: Modern Warfare® 2 (2009),2009-11-11,1,1,0,1,93,26786,19.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10611,751780,Forager,2019-04-18,1,0,0,1,91,27094,19.99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
game_images_df = pd.read_csv('data/steam_games.csv', sep=';')

In [46]:
game_details = game_images_df.loc[game_images_df['App ID'].isin(top6picks['app_id']), ['Name', 'Price', 'Header Image']]

In [48]:
header_img = game_details['Header Image'].to_list()
header_img

['https://cdn.akamai.steamstatic.com/steam/apps/10180/header.jpg?t=1654809646',
 'https://cdn.akamai.steamstatic.com/steam/apps/1372110/header.jpg?t=1666908137',
 'https://cdn.akamai.steamstatic.com/steam/apps/1732190/header.jpg?t=1663081460',
 'https://cdn.akamai.steamstatic.com/steam/apps/694280/header.jpg?t=1652368094',
 'https://cdn.akamai.steamstatic.com/steam/apps/751780/header.jpg?t=1667237775',
 'https://cdn.akamai.steamstatic.com/steam/apps/840140/header.jpg?t=1588046169']

In [None]:
def game_recommendations(game_id, games_similarity, user_library, min_reviews):
    pass

In [None]:
game

In [18]:
new_recs_df = recs_df.drop(columns=['helpful', 'funny', 'review_date', 'is_recommended', 'review_id'])

In [8]:
steam_id = '76561198120441502' #random user's steam id
user_library = steam_library(steam_id)

# look at the steam_library dataframe
user_library.head()

Unnamed: 0,app_id,hours,user_id
0,400,209,76561198120441502
1,34190,0,76561198120441502
2,70400,393,76561198120441502
3,620,0,76561198120441502
4,105600,0,76561198120441502


In [19]:
new_recs_df.head()

Unnamed: 0,app_id,hours,user_id
0,975370,36.3,24170
1,304390,11.5,1339
2,1085660,336.5,110271
3,703080,27.4,112510
4,526870,7.9,11046


In [20]:
merged_df = pd.concat([user_library, new_recs_df], ignore_index=True)

In [21]:
print(new_recs_df.shape, user_library.shape, merged_df.shape)

(14585291, 3) (261, 3) (14585552, 3)


In [22]:
# map each user and item to a unique numeric value
user_ids = merged_df['user_id'].astype('category').cat.codes
item_ids = merged_df['app_id'].astype('category').cat.codes

# Get the unique user and game ids
unique_user_ids = merged_df['user_id'].astype('category').cat.categories
unique_item_ids = merged_df['app_id'].astype('category').cat.categories

# create a sparse matrix
user_game_matrix = coo_matrix((merged_df['hours'], (user_ids, item_ids)))

# Fit the model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')
model_knn.fit(user_game_matrix)

In [8]:
print(f'Shape of recs_matrix: {recs_matrix.shape}')
print('Number of unique users:', recs_df['user_id'].nunique())
print('Number of unique games:', recs_df['app_id'].nunique())

Shape of recs_matrix: (6903784, 2266)
Number of unique users: 6903784
Number of unique games: 2266


In [9]:
knn_model = NearestNeighbors(metric='cosine')
knn_model.fit(recs_matrix)