# Content-Based Video Game Recommendation 
Brendan Haniff

Have you ever enjoyed playing a video game so much that after completing it you're left looking for more?
<br><br>
Finding similar games to the one you completed can sometimes be a challenging task, especially with the amount of video games available. This is where a recommendation system comes into play. Similarly to movie streaming services, A recommendation system can be used to determine the best possible video game to play, given a game you already enjoy.

## Goal
The goal of this project is to develop a content based video game recommendation system, using data from the 'Popular Video Games 1980 - 2023' dataset on Kaggle. Content based recommendation systems utilize information about games a user enjoys and aims to find similar games based on select categories of data about the game. This is accomplished in this project with the use of cosine similarity. Each video game in the dataset is assigned a vector based on the data from the chosen categories. Cosine similarity is used to measure the distance between each of the vectors, determining which games are closest to the game suggested by the user.

<div>
    <img src="https://cdn.sanity.io/images/oaglaatp/production/a2fc251dcb1ad9ce9b8a82b182c6186d5caba036-1200x800.png?w=1200&h=800&auto=format" width="200"/>
</div>

## Dataset

In this project I will using the 'Popular Video Games 1980 - 2023' dataset on Kaggle. This dataset includes 1099 unique video games, each with data in the catgories: Title, Release Date, Team, Rating, Times Listed, Number of Reviews, Genre, Summary, Reviews, Plays, Playing, Backlogs, and Wishlist. The data is collected from a website called 'Backloggd', which is a video game collection website that allows users to interact and share experiences and opinions on different games.

## Data Cleaning

import pandas and numpy for data cleaning and import tfidfVectorizer and linear_kernel for building recommendation model

In [5]:
#libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

df.head() shows a summary of the first 5 values in the dataset. We can see all the different categories of data, along with the types for each column with df.info(). We can see that each data type is of the type 'object', aside from the index and "rating"

In [6]:
#read csv
df = pd.read_csv("games.csv")
df = df.rename(columns={df.columns[0]: 'index'})
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1512 entries, 0 to 1511
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              1512 non-null   int64  
 1   Title              1512 non-null   object 
 2   Release Date       1512 non-null   object 
 3   Team               1511 non-null   object 
 4   Rating             1499 non-null   float64
 5   Times Listed       1512 non-null   object 
 6   Number of Reviews  1512 non-null   object 
 7   Genres             1512 non-null   object 
 8   Summary            1511 non-null   object 
 9   Reviews            1512 non-null   object 
 10  Plays              1512 non-null   object 
 11  Playing            1512 non-null   object 
 12  Backlogs           1512 non-null   object 
 13  Wishlist           1512 non-null   object 
dtypes: float64(1), int64(1), object(12)
memory usage: 165.5+ KB
None


Unnamed: 0,index,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
0,0,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3.9K,3.9K,"['Adventure', 'RPG']","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o...",17K,3.8K,4.6K,4.8K
1,1,Hades,"Dec 10, 2019",['Supergiant Games'],4.3,2.9K,2.9K,"['Adventure', 'Brawler', 'Indie', 'RPG']",A rogue-lite hack and slash dungeon crawler in...,['convinced this is a roguelike for people who...,21K,3.2K,6.3K,3.6K
2,2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017","['Nintendo', 'Nintendo EPD Production Group No...",4.4,4.3K,4.3K,"['Adventure', 'RPG']",The Legend of Zelda: Breath of the Wild is the...,['This game is the game (that is not CS:GO) th...,30K,2.5K,5K,2.6K
3,3,Undertale,"Sep 15, 2015","['tobyfox', '8-4']",4.2,3.5K,3.5K,"['Adventure', 'Indie', 'RPG', 'Turn Based Stra...","A small child falls into the Underground, wher...",['soundtrack is tied for #1 with nier automata...,28K,679,4.9K,1.8K
4,4,Hollow Knight,"Feb 24, 2017",['Team Cherry'],4.4,3K,3K,"['Adventure', 'Indie', 'Platform']",A 2D metroidvania with an emphasis on close co...,"[""this games worldbuilding is incredible, with...",21K,2.4K,8.3K,2.3K


Drop duplicate video games from the dataset while keeping rows in proper index range by using df.drop_duplicates, with these parameters: subset='Title', keep='first', ignore_index=True <br>
<br> I chose to isolate the Title, Genres, Team, Summary, and Reviews columns because these seemed most relevant to determining similarity between video games

In [7]:
#isolate columns needed
df = df.drop_duplicates(subset='Title', keep='first', ignore_index=True)

df = df[['Title', 'Team', 'Genres', 'Summary', 'Reviews']]
df.head()

Unnamed: 0,Title,Team,Genres,Summary,Reviews
0,Elden Ring,"['Bandai Namco Entertainment', 'FromSoftware']","['Adventure', 'RPG']","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o..."
1,Hades,['Supergiant Games'],"['Adventure', 'Brawler', 'Indie', 'RPG']",A rogue-lite hack and slash dungeon crawler in...,['convinced this is a roguelike for people who...
2,The Legend of Zelda: Breath of the Wild,"['Nintendo', 'Nintendo EPD Production Group No...","['Adventure', 'RPG']",The Legend of Zelda: Breath of the Wild is the...,['This game is the game (that is not CS:GO) th...
3,Undertale,"['tobyfox', '8-4']","['Adventure', 'Indie', 'RPG', 'Turn Based Stra...","A small child falls into the Underground, wher...",['soundtrack is tied for #1 with nier automata...
4,Hollow Knight,['Team Cherry'],"['Adventure', 'Indie', 'Platform']",A 2D metroidvania with an emphasis on close co...,"[""this games worldbuilding is incredible, with..."


Here we can see the shape of the filtered data. There are 1097 rows, and the 5 columns we isolated

In [8]:
df = df.dropna()
print(f"shape of df: {df.shape}")
df.head()


shape of df: (1097, 5)


Unnamed: 0,Title,Team,Genres,Summary,Reviews
0,Elden Ring,"['Bandai Namco Entertainment', 'FromSoftware']","['Adventure', 'RPG']","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o..."
1,Hades,['Supergiant Games'],"['Adventure', 'Brawler', 'Indie', 'RPG']",A rogue-lite hack and slash dungeon crawler in...,['convinced this is a roguelike for people who...
2,The Legend of Zelda: Breath of the Wild,"['Nintendo', 'Nintendo EPD Production Group No...","['Adventure', 'RPG']",The Legend of Zelda: Breath of the Wild is the...,['This game is the game (that is not CS:GO) th...
3,Undertale,"['tobyfox', '8-4']","['Adventure', 'Indie', 'RPG', 'Turn Based Stra...","A small child falls into the Underground, wher...",['soundtrack is tied for #1 with nier automata...
4,Hollow Knight,['Team Cherry'],"['Adventure', 'Indie', 'Platform']",A 2D metroidvania with an emphasis on close co...,"[""this games worldbuilding is incredible, with..."


## Build Model

Next, combine data from rows with useful data
<br>In this project I chose to test 2 different combinations of data, one including the "Reviews" data and one without it

In [9]:
def combine_feats(row):
    return row['Team'] + " " + row['Genres'] + " " + row["Summary"] + " " + row["Reviews"]

def no_reviews(row):
    return row['Team'] + " " + row['Genres'] + " " + row["Summary"]

df['all_feats'] = df.apply(combine_feats, axis=1)
df['no_reviews'] = df.apply(no_reviews, axis=1)
print(df['all_feats'].head())
print(df['no_reviews'].head())

0    ['Bandai Namco Entertainment', 'FromSoftware']...
1    ['Supergiant Games'] ['Adventure', 'Brawler', ...
2    ['Nintendo', 'Nintendo EPD Production Group No...
3    ['tobyfox', '8-4'] ['Adventure', 'Indie', 'RPG...
4    ['Team Cherry'] ['Adventure', 'Indie', 'Platfo...
Name: all_feats, dtype: object
0    ['Bandai Namco Entertainment', 'FromSoftware']...
1    ['Supergiant Games'] ['Adventure', 'Brawler', ...
2    ['Nintendo', 'Nintendo EPD Production Group No...
3    ['tobyfox', '8-4'] ['Adventure', 'Indie', 'RPG...
4    ['Team Cherry'] ['Adventure', 'Indie', 'Platfo...
Name: no_reviews, dtype: object


I used the TfidVectorizer to find the frequency/importance of terms in each of the combined strings

In [10]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df=0.0, stop_words='english')
tfMatrix = tf.fit_transform(df['all_feats'])
tfMatrix2 = tf.fit_transform(df['no_reviews'])
print(tfMatrix.shape)
print("\n")
print(tfMatrix2.shape)

(1097, 160627)


(1097, 53947)


I find cosine similarity for the data with and without reviews. This tells us how related certain games are to others based on the numberical values generated by TfidfVectorizer, which is used to find game recommendations. I used linear_kernel for this project because it is faster when using TfidfVectorizer.

In [11]:
cosine_similarity = linear_kernel(tfMatrix, tfMatrix)
no_review_CS = linear_kernel(tfMatrix2, tfMatrix2)
print(cosine_similarity[:4, :4])
print("\n")
print(no_review_CS[:4, :4])

[[1.         0.01584567 0.02306498 0.01541052]
 [0.01584567 1.         0.03037132 0.03394257]
 [0.02306498 0.03037132 1.         0.02749863]
 [0.01541052 0.03394257 0.02749863 1.        ]]


[[1.         0.00422194 0.0408111  0.01103808]
 [0.00422194 1.         0.00870011 0.01223707]
 [0.0408111  0.00870011 1.         0.00334356]
 [0.01103808 0.01223707 0.00334356 1.        ]]


## Results

Next, I run both models and print the top 10 recommended games both with and without reviews

In [12]:
titles = df['Title']

#get the index of the inoutted video game
def get_idx(value):
    index = df.index[df['Title']==value].tolist()
    return index

idx = get_idx('Elden Ring')[0]

#run model with reviews
similar_games = list(enumerate(cosine_similarity[idx]))
sim_games_sorted = sorted(similar_games, key=lambda x:x[1], reverse=True)

print("Result with reviews")
for i in range(1, 11):
    print(f"{i}. {titles.iloc[sim_games_sorted[i][0]]}")

print("\n")
print("Results without reviews")

#run model without reviews
no_review_similar = list(enumerate(no_review_CS[idx]))
no_rev_sorted = sorted(no_review_similar, key=lambda x:x[1], reverse=True)

for i in range(1, 11):
    print(f"{i}. {titles.iloc[no_rev_sorted[i][0]]}")


Result with reviews
1. Elden Ring: Shadow of the Erdtree
2. Sekiro: Shadows Die Twice
3. Bayonetta 3
4. Dark Souls
5. Dark Souls III
6. Bloodborne: The Old Hunters
7. Kingdoms of Amalur: Reckoning
8. Dark Souls II: Scholar of the First Sin
9. Outer Wilds
10. Metal Gear Solid V: The Phantom Pain


Results without reviews
1. Elden Ring: Shadow of the Erdtree
2. Dark Souls III
3. Dark Souls II: Scholar of the First Sin
4. Dark Souls
5. The Lord of the Rings: The Two Towers
6. Code Vein
7. Ni no Kuni II: Revenant Kingdom
8. JoJo’s Bizarre Adventure: All-Star Battle R
9. Tekken 6
10. Dark Souls: Remastered


Based on the input game 'Elden Ring', the model returns the same #1 recommendation: 'Elden Ring: Shadow of the Erdtree', which is an expansion of the original game. #2 - #10 of the recommended games have a few similarities, such as 'Dark Souls' at #4, and vaious other 'Dark Souls' games at other spots. One of the main differences between the model that incorporated reviews and the model that doesn't is the game 'Sekiro: Shadows Die Twice' at spot #2 in the model with reviews included. This is due to the direct mention of 'sekiro' as a related game to 'Elden Ring', by the reviews in this dataset. The inclusion of reviews in this model seems to find additional games with similarities to the input game, that are otherwise not found without the reviews.



<div>
    <img src="/Users/brendanhaniff/Desktop/pyProj/Video_Game_Rec/Google Recs.png" width="700"/>
</div>