**`Exercise 2`**.Build a recommender system in a new notebook for recommending books from goodreads, using the following datasets:


`- 5.4.1.books-ratings.csv`


`- 5.4.1.books-info.csv`


Test the recommender with the book_id = 10!

# Load the data

In [1]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
ratings = pd.read_csv("C:/Users/linco/Desktop/datastudy/DScourse/_module5_/_module5_/notebooks/5.4. Unsupervised_Association/GBB/bgg-15m-reviews.csv")
ratings

Unnamed: 0.1,Unnamed: 0,user,rating,comment,ID,name
0,0,Torsten,10.0,,30549,Pandemic
1,1,mitnachtKAUBO-I,10.0,Hands down my favorite new game of BGG CON 200...,30549,Pandemic
2,2,avlawn,10.0,I tend to either love or easily tire of co-op ...,30549,Pandemic
3,3,Mike Mayer,10.0,,30549,Pandemic
4,4,Mease19,10.0,This is an amazing co-op game. I play mostly ...,30549,Pandemic
...,...,...,...,...,...,...
15823264,15823264,Fafhrd65,8.0,Turn based preview looks very promising. The g...,281515,Company of Heroes
15823265,15823265,PlatinumOh,8.0,KS,281515,Company of Heroes
15823266,15823266,BunkerBill,7.0,,281515,Company of Heroes
15823267,15823267,Hattori Hanzo,6.0,,281515,Company of Heroes


In [3]:
ratings.columns

Index(['Unnamed: 0', 'user', 'rating', 'comment', 'ID', 'name'], dtype='object')

In [4]:
ratings.drop(["Unnamed: 0", "comment"], axis=1, inplace=True)

In [5]:
ratings.head()

Unnamed: 0,user,rating,ID,name
0,Torsten,10.0,30549,Pandemic
1,mitnachtKAUBO-I,10.0,30549,Pandemic
2,avlawn,10.0,30549,Pandemic
3,Mike Mayer,10.0,30549,Pandemic
4,Mease19,10.0,30549,Pandemic


In [6]:
ratings.rename(mapper={"user":"user_id", "ID":"game_id", "name":"title"}, axis=1, inplace=True)
ratings.head()

Unnamed: 0,user_id,rating,game_id,title
0,Torsten,10.0,30549,Pandemic
1,mitnachtKAUBO-I,10.0,30549,Pandemic
2,avlawn,10.0,30549,Pandemic
3,Mike Mayer,10.0,30549,Pandemic
4,Mease19,10.0,30549,Pandemic


In [7]:
ratings.columns

Index(['user_id', 'rating', 'game_id', 'title'], dtype='object')

In [8]:
info = ratings[["game_id","title"]]
info

Unnamed: 0,game_id,title
0,30549,Pandemic
1,30549,Pandemic
2,30549,Pandemic
3,30549,Pandemic
4,30549,Pandemic
...,...,...
15823264,281515,Company of Heroes
15823265,281515,Company of Heroes
15823266,281515,Company of Heroes
15823267,281515,Company of Heroes


In [9]:
info.drop_duplicates(inplace=True)
info

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  info.drop_duplicates(inplace=True)


Unnamed: 0,game_id,title
0,30549,Pandemic
100,822,Carcassonne
200,13,Catan
300,68448,7 Wonders
400,36218,Dominion
...,...,...
15823119,246345,Ninja Rush
15823149,195623,Politricks: Dirty Card Game
15823179,235943,Doppel X
15823209,284862,Beasty Borders


In [10]:
ratings.game_id.nunique()

19330

# Exploratory Analysis

In [11]:
# number of users

ratings.user_id.nunique()

351048

In [12]:
# number of books
ratings.game_id.nunique()

19330

In [13]:
# number of ratings per user
userfreq = ratings[["user_id", "game_id"]].groupby("user_id").count().reset_index()
userfreq.columns = ["user_id","no_ratings"]
userfreq.sort_values(by="no_ratings")

Unnamed: 0,user_id,no_ratings
84277,Khaisor,1
204393,christophersisto,1
204392,christophermrau,1
204391,christophermorrison2,1
204389,christophergodfrey,1
...,...,...
66952,Hessu68,4602
342435,warta,4811
41655,Doel,4876
161600,TomVasel,4950


# Bahesian stats

In [14]:
gamestats = ratings.groupby("game_id")["rating"].agg(["count","mean"])
gamestats.head(10)

Unnamed: 0_level_0,count,mean
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,5089,7.630658
2,544,6.613411
3,14393,7.445493
4,337,6.612821
5,17838,7.344081
6,81,6.47963
7,3150,6.520788
8,197,6.147411
9,1358,6.467161
10,7855,6.71789


In [15]:
C=gamestats["count"].mean()
m=gamestats["mean"].mean()

In [16]:
def bayesian_avg(ratings):
    bayesian_avg = (C*m+ratings.sum()) / (C+ratings.count())
    return bayesian_avg

In [17]:
bayesavg_ratings = ratings.groupby("game_id")["rating"].agg(bayesian_avg).reset_index()
bayesavg_ratings.columns = ["game_id","bayesian_avg"]
bayesavg_ratings.head(5)

Unnamed: 0,game_id,bayesian_avg
0,1,7.459099
1,2,6.480725
2,3,7.388831
3,4,6.456785
4,5,7.302331


In [18]:
gamestats = gamestats.merge(bayesavg_ratings, on="game_id")
gamestats.head(10)

Unnamed: 0,game_id,count,mean,bayesian_avg
0,1,5089,7.630658,7.459099
1,2,544,6.613411,6.480725
2,3,14393,7.445493,7.388831
3,4,337,6.612821,6.456785
4,5,17838,7.344081,7.302331
5,6,81,6.47963,6.400389
6,7,3150,6.520788,6.494336
7,8,197,6.147411,6.344997
8,9,1358,6.467161,6.4391
9,10,7855,6.71789,6.687185


In [19]:
gamestats = gamestats.merge(info[["game_id","title"]], on="game_id")
gamestats.head(10)

Unnamed: 0,game_id,count,mean,bayesian_avg,title
0,1,5089,7.630658,7.459099,Die Macher
1,2,544,6.613411,6.480725,Dragonmaster
2,3,14393,7.445493,7.388831,Samurai
3,4,337,6.612821,6.456785,Tal der Könige
4,5,17838,7.344081,7.302331,Acquire
5,6,81,6.47963,6.400389,Mare Mediterraneum
6,7,3150,6.520788,6.494336,Cathedral
7,8,197,6.147411,6.344997,Lords of Creation
8,9,1358,6.467161,6.4391,El Caballero
9,10,7855,6.71789,6.687185,Elfenland


# Building the recommender system

## dictionaries

In [20]:
gametitles = dict(zip(info["game_id"], info["title"]))
gametitles

{30549: 'Pandemic',
 822: 'Carcassonne',
 13: 'Catan',
 68448: '7 Wonders',
 36218: 'Dominion',
 9209: 'Ticket to Ride',
 178900: 'Codenames',
 31260: 'Agricola',
 3076: 'Puerto Rico',
 40692: 'Small World',
 167791: 'Terraforming Mars',
 70323: 'King of Tokyo',
 14996: 'Ticket to Ride: Europe',
 148228: 'Splendor',
 2651: 'Power Grid',
 173346: '7 Wonders Duel',
 129622: 'Love Letter',
 169786: 'Scythe',
 39856: 'Dixit',
 478: 'Citadels',
 230802: 'Azul',
 28143: 'Race for the Galaxy',
 110327: 'Lords of Waterdeep',
 84876: 'The Castles of Burgundy',
 34635: 'Stone Age',
 163412: 'Patchwork',
 1927: 'Munchkin',
 65244: 'Forbidden Island',
 12333: 'Twilight Struggle',
 161936: 'Pandemic Legacy: Season 1',
 150376: 'Dead of Winter: A Crossroads Game',
 120677: 'Terra Mystica',
 174430: 'Gloomhaven',
 11: 'Bohnanza',
 98778: 'Hanabi',
 15987: 'Arkham Horror',
 50: 'Lost Cities',
 10547: 'Betrayal at House on the Hill',
 41114: 'The Resistance',
 54043: 'Jaipur',
 131357: 'Coup',
 147020:

## make the matrix

In [21]:
from scipy.sparse import csr_matrix

def create_X(df):
    """
    Generates a sparse matrix from ratings dataframe.
    
    Args:
        df: pandas dataframe
    
    Returns:
        X: sparse matrix
        user_mapper: dict that maps user id's to user indices
        user_inv_mapper: dict that maps user indices to user id's
        movie_mapper: dict that maps movie id's to movie indices
        movie_inv_mapper: dict that maps movie indices to movie id's
    """
    
    N = df["user_id"].nunique() # number of users
    M= df["game_id"].nunique() # number of games
    
    user_mapper = dict(zip(np.unique(df["user_id"]), list(range(N))))
    game_mapper = dict(zip(np.unique(df["game_id"]), list(range(M))))
    
    user_inv_mapper = dict(zip(list(range(N)), np.unique(df["user_id"])))
    game_inv_mapper = dict(zip(list(range(M)), np.unique(df["game_id"])))
    
    user_index = [user_mapper[i] for i in df["user_id"]]
    game_index = [game_mapper[i] for i in df["game_id"]]
    
    X = csr_matrix((df["rating"], (game_index, user_index)), shape=(M,N))
    
    return X, user_mapper, game_mapper, user_inv_mapper, game_inv_mapper

In [22]:
ratings.user_id.unique()

array(['Torsten', 'mitnachtKAUBO-I', 'avlawn', ..., 'rielnano',
       'Lost_Titan_Legacy', 'smedzz31'], dtype=object)

In [23]:
# encode the user ids
from sklearn.preprocessing import LabelEncoder

# Step 1. Instantiate the model (label encoding)
le = LabelEncoder() 
le.fit_transform(ratings['user_id'])

# Step 2. Fit the variable to the instatiated model
ratings['user_id'] = le.fit_transform(ratings['user_id'])
ratings.head(10)

Unnamed: 0,user_id,rating,game_id,title
0,162393,10.0,30549,Pandemic
1,282186,10.0,30549,Pandemic
2,188488,10.0,30549,Pandemic
3,104525,10.0,30549,Pandemic
4,102294,10.0,30549,Pandemic
5,202289,10.0,30549,Pandemic
6,259457,10.0,30549,Pandemic
7,34278,10.0,30549,Pandemic
8,237240,10.0,30549,Pandemic
9,199851,10.0,30549,Pandemic


In [24]:
ratings.user_id.nunique()

351049

In [25]:
ratings.tail(10)

Unnamed: 0,user_id,rating,game_id,title
15823259,94173,9.0,281515,Company of Heroes
15823260,180855,9.0,281515,Company of Heroes
15823261,320655,9.0,281515,Company of Heroes
15823262,61909,9.0,281515,Company of Heroes
15823263,129799,9.0,281515,Company of Heroes
15823264,50913,8.0,281515,Company of Heroes
15823265,124124,8.0,281515,Company of Heroes
15823266,22867,7.0,281515,Company of Heroes
15823267,65652,6.0,281515,Company of Heroes
15823268,132441,1.0,281515,Company of Heroes


In [26]:
ratings.sort_values(by=["rating"],ascending=True)

Unnamed: 0,user_id,rating,game_id,title
2439405,59027,1.401300e-45,42,Tigris & Euphrates
9924871,240149,1.000000e-30,5432,Chutes and Ladders
8654503,182310,1.000000e-04,3699,Killer Bunnies and the Quest for the Magic Carrot
15015812,182310,1.000000e-03,9782,Battle of the Bands: Encore Edition
15250333,147014,1.000000e-03,18041,Bunco Party
...,...,...,...,...
3726695,327193,1.000000e+01,237182,Root
3726694,288547,1.000000e+01,237182,Root
3726693,139590,1.000000e+01,237182,Root
3726691,331038,1.000000e+01,237182,Root


In [27]:
# number of ratings per user
userfreq = ratings[["user_id", "game_id"]].groupby("user_id").count().reset_index()
userfreq.columns = ["user_id","no_ratings"]
userfreq.sort_values(by="no_ratings")

Unnamed: 0,user_id,no_ratings
245030,245030,1
169471,169471,1
147351,147351,1
49107,49107,1
95683,95683,1
...,...,...
66952,66952,4602
342435,342435,4811
41655,41655,4876
161600,161600,4950


In [28]:
ratings = ratings.merge(userfreq, on="user_id")
ratings

Unnamed: 0,user_id,rating,game_id,title,no_ratings
0,162393,10.0,30549,Pandemic,1358
1,162393,10.0,68448,7 Wonders,1358
2,162393,10.0,178900,Codenames,1358
3,162393,10.0,31260,Agricola,1358
4,162393,10.0,148228,Splendor,1358
...,...,...,...,...,...
15823264,811,10.0,281515,Company of Heroes,1
15823265,146512,10.0,281515,Company of Heroes,1
15823266,307953,10.0,281515,Company of Heroes,1
15823267,94173,9.0,281515,Company of Heroes,1


In [29]:
# actually creating the matrix

X, user_mapper, game_mapper, user_inv_mapper, game_inv_mapper = create_X(ratings)

In [30]:
X

<19330x351049 sparse matrix of type '<class 'numpy.float64'>'
	with 15823269 stored elements in Compressed Sparse Row format>

In [31]:
# calculate the sparsity of our matrix

sparsity = X.count_nonzero()/(X.shape[0]*X.shape[1])

print("The sparsity of this matrix is ", round(sparsity*100,2))

The sparsity of this matrix is  0.23


## k-nearest neighbors

In [32]:
from sklearn.neighbors import NearestNeighbors

KNN = NearestNeighbors(n_neighbors=10, metric="cosine") # how to pick the metric????
KNN.fit(X)

In [33]:
# use the mapper to find the index of the book we want to assess

gameindex = game_mapper[10]
gameindex

9

In [34]:
game_toassess = X[9]
game_toassess

<1x351049 sparse matrix of type '<class 'numpy.float64'>'
	with 7855 stored elements in Compressed Sparse Row format>

In [35]:
# find the nearest neighbors to this book
neig = KNN.kneighbors(game_toassess, return_distance=False)
neig

array([[  9, 673,  84,  80,  51, 109, 429, 406, 218, 311]], dtype=int64)

In [36]:
# use the inverse mapper to find the ids of the neighbours

neig_ids = []
for i in range(1,10):
    n = neig.item(i)
    neig_ids.append(game_inv_mapper[n])

In [37]:
neig_ids

[826, 93, 88, 54, 120, 503, 475, 256, 361]

In [38]:
# use the ids to find the titles

print("because you enjoyed playing:",gametitles[9])
for i in neig_ids:
    print(gametitles[i])

because you enjoyed playing: El Caballero
Cartagena
El Grande
Torres
Tikal
Hoity Toity
Through the Desert
Taj Mahal
Mississippi Queen
Hare & Tortoise


## make a function to do it

In [39]:
def rec_games(game_id, df, X, k, metric="cosine"):
    """
    Finds k-nearest neighbours for a given book id.
    
    Args:
        book_id: id of the book of interest
        X: user-item utility matrix
        k: number of similar book to retrieve
        metric: distance metric for kNN calculations
    
    Returns:
        list of k similar book ID's
    """
        
    neighbour_ids = []
        
    game_ind = game_mapper[game_id]
    game_to_assess = X[game_ind]
    KNN = NearestNeighbors(n_neighbors=k, metric=metric)
        
    KNN.fit(X)
        
    neighbour = KNN.kneighbors(game_to_assess, return_distance=False)
        
    for i in range(1,k):
        n = neighbour.item(i)
        neighbour_ids.append(game_inv_mapper[n])
            
    game_titles = dict(zip(df["game_id"], df["title"]))
    game_title = game_titles[game_id]
        
    print("Because you enjoyed playing:", game_title, ", I think you would enjoy these:")
    for i in neighbour_ids:
        print(game_titles[i])

In [40]:
rec_games(10, info, X, k=10)

Because you enjoyed playing: Elfenland , I think you would enjoy these:
Cartagena
El Grande
Torres
Tikal
Hoity Toity
Through the Desert
Taj Mahal
Mississippi Queen
Hare & Tortoise


In [41]:
info[info.title.str.contains('Pandemic')]

Unnamed: 0,game_id,title
0,30549,Pandemic
2893,161936,Pandemic Legacy: Season 1
5362289,221107,Pandemic Legacy: Season 2
6539377,150658,Pandemic: The Cure
6545946,198928,Pandemic: Iberia
7406011,192153,Pandemic: Reign of Cthulhu
9929732,260428,Pandemic: Fall of Rome
9936696,157789,Pandemic: Contagion
11432557,234671,Pandemic: Rising Tide
12224758,280789,Pandemic: Rapid Response


In [42]:
rec_games(30549, info, X, k=10)

Because you enjoyed playing: Pandemic , I think you would enjoy these:
7 Wonders
Carcassonne
Dominion
Catan
Ticket to Ride
Codenames
Small World
King of Tokyo
Love Letter


In [43]:
info[info.title.str.contains('Nemesis')]

Unnamed: 0,game_id,title
6538181,167355,Nemesis
15467076,197076,Nemesis: Burma 1944
15671818,310100,Nemesis: Lockdown
15780467,200959,1813: Napoleon's Nemesis


In [44]:
rec_games(167355, info, X, k=10)

Because you enjoyed playing: Nemesis , I think you would enjoy these:
Lords of Hellas
Rising Sun
Blood Rage
Mansions of Madness: Second Edition
Scythe
Terraforming Mars
Root
Star Wars: Rebellion
Tainted Grail: The Fall of Avalon


In [45]:
info[info.title.str.contains('Clank')]

Unnamed: 0,game_id,title
7980,201808,Clank!: A Deck-Building Adventure
6544255,233371,Clank! In! Space!: A Deck-Building Adventure
11027330,266507,Clank! Legacy: Acquisitions Incorporated


In [46]:
rec_games(266507, info, X, k=10)

Because you enjoyed playing: Clank! Legacy: Acquisitions Incorporated , I think you would enjoy these:
Clank! In! Space!: A Deck-Building Adventure
Clank!: A Deck-Building Adventure
The Quacks of Quedlinburg
Space Base
Everdell
Roll Player
Wingspan
Tapestry
Charterstone


In [47]:
info[info.title.str.contains('Everrain')]

Unnamed: 0,game_id,title
15797255,252315,The Everrain


In [48]:
rec_games(252315, info, X, k=10)

Because you enjoyed playing: The Everrain , I think you would enjoy these:
Oathsworn: Into the Deepwood
Solomon Kane
Warhammer Underworlds: Dreadfane
Limbo: Eternal War
Mighty Morphin Power Rangers Game
Waste Knights: Second Edition
The Great Wall
Krosmaster: Blast
Village Attacks


In [49]:
info[info.title.str.contains('Twilight')]

Unnamed: 0,game_id,title
2793,12333,Twilight Struggle
3658174,12493,Twilight Imperium (Third Edition)
5359694,233078,Twilight Imperium (Fourth Edition)
13318833,180199,"Colonial Twilight: The French-Algerian War, 19..."
13383735,24,Twilight Imperium
13756126,26055,Twilight Imperium (Second Edition)
14506148,191364,Twilight Squabble
14844496,63385,Anima: Twilight of the Gods
14932425,21779,1914: Twilight in the East
15298572,206904,Twilight of the Gods


In [50]:
rec_games(233078, info, X, k=10)

Because you enjoyed playing: Twilight Imperium (Fourth Edition) , I think you would enjoy these:
Star Wars: Rebellion
Scythe
Terraforming Mars
Blood Rage
Rising Sun
Root
Twilight Imperium (Third Edition)
Gloomhaven
Through the Ages: A New Story of Civilization


In [52]:
info[info.title.str.contains('Flatline')]

Unnamed: 0,game_id,title
12481842,216597,Flatline


In [53]:
rec_games(216597, info, X, k=10)

Because you enjoyed playing: Flatline , I think you would enjoy these:
FUSE
Flip Ships
Sentient
Covert
Ex Libris
Pandemic: The Cure
Magic Maze
Quadropolis
Potion Explosion
