# Game Clustering

I want to develop a metric to see how similar a user's game collection are to each other.

Boardgames are as listed here: https://boardgamegeek.com/browse/boardgame

Let's use Pandemic Legacy: Season 1 as an example: https://boardgamegeek.com/boardgame/161936/pandemic-legacy-season-1

From the page, here are parameters that could be useful for clustering: <br>
#players: We'll use MAX player count <br>
playtime: MAX minutes <br>
Weight: A value between 0-5 <br>
Category: https://boardgamegeek.com/browse/boardgamecategory : total of 84 possible categories<br>
Mechanisms: https://boardgamegeek.com/browse/boardgamemechanic, https://boardgamegeek.com/wiki/page/mechanism : total of 51 bgg-recognized mechanisms

So we'll use these as our parameters.

Additional info to have: <br>
Game Name <br>
Game ID <br>
Game rank

### Lists:

Already have the following lists:

bgg id output.csv: Master list of all boardgames on bgg

BGG categories.csv: Master list of all possible boardgame categories

BGG mechanics.csv: Master list of all possible boardgame mechanics

In [2]:
#General libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

#For webscraping
from bs4 import BeautifulSoup
import requests

#Regular expression
import re

In [3]:
# Get the games list we'll use
games_list = pd.read_csv('bgg id output.csv')

#Remove all NaN rows
games_list.dropna(axis=0,how='any',inplace=True)
games_list.reset_index(drop=True,inplace=True)

#There are repeat titles in the list. Remove them.
rep_games_idx = games_list[games_list['Game'].duplicated()].index.tolist() #Returns the indices of all repeat titles. This list does NOT include the first appearance
games_list.drop(games_list.index[rep_games_idx],inplace=True)
games_list.reset_index(drop=True,inplace=True)

#Convert GameID to int
games_list['GameID'] = games_list['GameID'].apply(lambda x: int(x))

#Convert BGG Rank to int
games_list['BGG Rank'] = games_list['BGG Rank'].apply(lambda x: int(x))

## Create dataframe

Columns: <br>
Game name <br>
Game rank <br>
Game ID <br>

#players <br>
playtime <br>
Weight <br>
84 categories (each a unique column) <br>
51 mechanisms (eacn a unique column)<br>

total categories <br>
total mechanics

In [4]:
#Load the categories and mechanisms pulled from bgg
cat_list = pd.read_csv('BGG categories.csv',sep='\t')
mech_list = pd.read_csv('BGG mechanics.csv',sep='\t')

#Convert the dataframes into lists
cat_list = cat_list['Categories'].tolist()
mech_list = mech_list['Mechanics'].tolist()

In [5]:
#Establish column headings
columns = ['Game name', 'Game rank', 'Game ID', '#players', 'playtime', 'weight'] + cat_list + mech_list + ['total categories', 'total mechanics']

#Create dataframe filled with 0's
game_attributes = pd.DataFrame(0, index=np.arange(games_list.shape[0]), columns=columns)

In [6]:
game_attributes.head()

Unnamed: 0,Game name,Game rank,Game ID,#players,playtime,weight,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,...,Tile Placement,Time Track,Trading,Trick-taking,Variable Phase Order,Variable Player Powers,Voting,Worker Placement,total categories,total mechanics
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [79]:
## In case dataframes need to be created with specific datatypes preset:

# To create the dataframe, we need Game name column to be str and the other columns to be int
#Create Game name df
#g1 = pd.DataFrame('', index=np.arange(games_list.shape[0]), columns=[columns[0]])
#Create other columns df
#g2 = pd.DataFrame(0, index=np.arange(games_list.shape[0]), columns=columns[1:])

#concatenate
#game_attributes = pd.concat([g1,g2],axis=1)


## Get Individual Game Attributes

1) Get #players, playtime, Weight from the game's page

2) Get game categories

3) Get game mechanics

#### Get General Attributes

In [361]:
# Use beautiful soup
#Import necessary libraries
from bs4 import BeautifulSoup
import requests

# Get website in xml format
game_id = 174430
game_name = 'gloomhaven'
url = "https://boardgamegeek.com/boardgame/" + str(game_id) + "/" + game_name + '/credits'
url = "https://boardgamegeek.com/boardgame/174430/gloomhaven/credits"
#url = "https://boardgamegeek.com/boardgame/161936/pandemic-legacy-season-1/credits"

r = requests.get(url)
page = r.text

soup = BeautifulSoup(page, "lxml")

In [362]:
script = soup.find_all(lambda tag: tag.name=='script')

In [478]:
#To get a URL
soup.find('head').find('link')['href']

'https://boardgamegeek.com/boardgame/174430/gloomhaven'

In [364]:
current_game = script[0]

In [365]:
current_game

<script>
	var GEEK = {};
	GEEK.adBlock = [];
	GEEK.adConfig = {"hideleaderboard":false,"hideskyscraper":false,"noadsense":false,"showbggstorewidget":false};
	GEEK.googleTargets = {"gameid":["174430"],"companyid":["27425","18852"],"temp_co":["hot"],"familyid":["45000","46988","42165","46075","24281","25158","8374","45610","25404","5666","5497","5496"],"propertyid":["2689","1022","2023","1020","1010","1046","2676","2040","1047","2011","2028","2020","2027","2015"],"personid":["69802","77084","78961","84269"],"temp_ppl":["hot"],"temp_game":["hot"]};
	GEEK.userid = 0;
	GEEK.domainname = 'boardgamegeek.com';
	GEEK.domain = 'boardgame';
	GEEK.geekitemPreload = {"item":{"itemdata":[{"datatype":"geekitem_fielddata","fieldname":"name","title":"Primary Name","primaryname":true,"required":true,"unclickable":true,"fullcredits":true,"subtype":"boardgame","keyname":"name"},{"datatype":"geekitem_fielddata","fieldname":"alternatename","title":"Alternate Names","alternate":true,"unclickable":true,"fullc

In [366]:
game = current_game.text

In [368]:
# Max Player
player_pos = [m.start() for m in re.finditer('maxplayer',game)]
maxplayer = game[player_pos[2]:game.find(',',player_pos[2])]
maxplayer = maxplayer[13:-1]
print(maxplayer)

4


In [369]:
# Max Playtime
time_pos = [m.start() for m in re.finditer('maxplaytime',game)]
maxtime = game[time_pos[2]:game.find(',',time_pos[2])]
maxtime = maxtime[14:-1]
print(maxtime)

120


In [370]:
# Weight
weight_pos = [m.start() for m in re.finditer('averageweight',game)]
weight = game[weight_pos[0]:game.find(',',weight_pos[0])]
weight = weight[15:]
print(weight)

3.7841726618705


#### Get categories

In [384]:
#Get the category text
boardgamecategory_pos = [m.start() for m in re.finditer('boardgamecategory',game)]

#We can see that the fourth pos value is the script segment that has all the actual categories
#Get the string that contains all the categories
categories = game[boardgamecategory_pos[4]:game.find(']',boardgamecategory_pos[4])]

#Split by '{' character
categories = categories.split('{')

In [385]:
Categories = [] #Empty list to store the categories

#First value is just the heading. Start with second
for i in range(1,len(categories)):
    #Can see the first pair is "name":"category". So we pull category out using.
    #First split again by ',' then by '"', then grab the appropriate string
    #Because of how we converted the html to a string, there are forward-slash breaks: e.g. "Action \\/ Movement Programming"
    #Use regular expression (re) to remove them: sub out all '\\' values ('\\\\' to '')
    cat = re.sub('\\\\','',categories[i].split(',')[0].split('"')[3])
    Categories.append(cat)

In [386]:
Categories

['Adventure', 'Exploration', 'Fantasy', 'Fighting', 'Miniatures']

In [429]:
#Get total categories for given boardgame
#Get the string that contains the number
total_cat = game[boardgamecategory_pos[-1]:game.find(',',boardgamecategory_pos[-1])]
#Get the number and convert to int
total_cat = int(total_cat.split(':')[1])

#### Get mechanisms

In [387]:
#Get the mechanisms text
mechanic_pos = [m.start() for m in re.finditer('boardgamemechanic',game)]

#Same as Categories; fourth pos is the script segment we want
mechanic = game[mechanic_pos[4]:game.find(']',mechanic_pos[4])]

#Split by '{' character
mechanic = mechanic.split('{')

In [388]:
Mechanics = [] #Empty list to store the mechanics

#First value is just the heading. Start with the second
for i in range(1,len(mechanic)):
    #Because of how we converted the html to a string, there are forward-slash breaks: e.g. "Action \\/ Movement Programming"
    #Use regular expression (re) to remove them
    #1) From mechanic list, take the current string
    #2) Split string by ','; the mechanics name is in the first [0] item
    #3) Split by '"'; the actual mechanics name is in the fourth [4] item
    #4) sub out all '\\' values ('\\\\' to '')
    mech = re.sub('\\\\','',mechanic[i].split(',')[0].split('"')[3])
    Mechanics.append(mech)

In [390]:
Mechanics

['Action / Movement Programming',
 'Co-operative Play',
 'Grid Movement',
 'Hand Management',
 'Modular Board',
 'Role Playing']

In [435]:
#Get total categories for given boardgame
#Get the string that contains the number
total_mech = game[mechanic_pos[-1]:game.find(',',mechanic_pos[-1])]
#Get the number and convert to int
total_mech = int(total_mech.split(':')[1])

Noticed that when there are more than 7 mechanics, only the first six get pulled...

## Create the script to automate getting all game attributes

In [173]:
from bs4 import BeautifulSoup
import requests
from time import time, sleep

for i in range(13812,len(games_list)):
      
    print(i)
    game_id = games_list['GameID'][i]
#    print(game_id)
    game_rank = games_list['BGG Rank'][i]
#    print(game_rank)
    game_name = games_list['Game'][i]
#    print(game_name)
    
    #Use just the GameID to get the true credits url
    url = "https://boardgamegeek.com/boardgame/"+str(game_id)
    r = requests.get(url)
    page = r.text

    soup = BeautifulSoup(page, "lxml")

    #Update url; e.g. 
    url = soup.find('head').find('link')['href']
    url = url+'/credits'
    r = requests.get(url)
    page = r.text

    soup = BeautifulSoup(page, "lxml")
    
    #Get the game script that contains all the relevant info
    script = soup.find_all(lambda tag: tag.name=='script')
 
    script_num = 0
    script_get = 0

    #Check each script on the page; find the script that contains 'maxplayers' which will include all other relevant info
    while (script_num < len(script)) and script_get == 0:
        if 'maxplayers' in script[script_num].text:
            script_get = 1
        else:
            script_num += 1
    
    if script_num < len(script):
        current_game = script[script_num]
        game = current_game.text


        #----#PLAYERS----#
        player_pos = [m.start() for m in re.finditer('maxplayer',game)]
        maxplayer = game[player_pos[2]:game.find(',',player_pos[2])]
        maxplayer = maxplayer[13:-1]
    #    print(maxplayer)

        #----MAX PLAYTIME----#
        time_pos = [m.start() for m in re.finditer('maxplaytime',game)]
        maxtime = game[time_pos[2]:game.find(',',time_pos[2])]
        maxtime = maxtime[14:-1]
    #    print(maxtime)

        #----WEIGHT----#
        weight_pos = [m.start() for m in re.finditer('averageweight',game)]
        weight = game[weight_pos[0]:game.find(',',weight_pos[0])]
        weight = weight[15:]
    #    print(weight)


        #----CATEGORIES----#
        #Get the category text
        boardgamecategory_pos = [m.start() for m in re.finditer('boardgamecategory',game)]

        #We can see that the fourth pos value is the script segment that has all the actual categories
        #Get the string that contains all the categories
        categories = game[boardgamecategory_pos[4]:game.find(']',boardgamecategory_pos[4])]

        #Split by '{' character
        categories = categories.split('{')

        Categories = [] #Empty list to store the categories

        #First value is just the heading. Start with second
        for k in range(1,len(categories)):
            #Can see the first pair is "name":"category". So we pull category out using.
            #First split again by ',' then by '"', then grab the appropriate string
            #Because of how we converted the html to a string, there are forward-slash breaks: e.g. "Action \\/ Movement Programming"
            #Use regular expression (re) to remove them: sub out all '\\' values ('\\\\' to '')
            cat = re.sub('\\\\','',categories[k].split(',')[0].split('"')[3])
            Categories.append(cat)
    #    print(Categories)

        #Get total categories for given boardgame
        #Get the string that contains the number
        total_cat = game[boardgamecategory_pos[-1]:game.find(',',boardgamecategory_pos[-1])]
        #Get the number and convert to int
        total_cat = int(total_cat.split(':')[1])

        #----MECHANICS----#
        #Get the mechanics text
        mechanic_pos = [m.start() for m in re.finditer('boardgamemechanic',game)]

        #Same as Categories; fourth pos is the script segment we want
        mechanic = game[mechanic_pos[4]:game.find(']',mechanic_pos[4])]

        #Split by '{' character
        mechanic = mechanic.split('{')

        Mechanics = [] #Empty list to store the mechanics

        #First value is just the heading. Start with the second
        for k in range(1,len(mechanic)):
            #Because of how we converted the html to a string, there are forward-slash breaks: e.g. "Action \\/ Movement Programming"
            #Use regular expression (re) to remove them
            #1) From mechanic list, take the current string
            #2) Split string by ','; the mechanics name is in the first [0] item
            #3) Split by '"'; the actual mechanics name is in the fourth [4] item
            #4) sub out all '\\' values ('\\\\' to '')
            mech = re.sub('\\\\','',mechanic[k].split(',')[0].split('"')[3])
            Mechanics.append(mech)
    #    print(Mechanics)

        #Get total mechanics for given boardgame
        #Get the string that contains the number
        total_mech = game[mechanic_pos[-1]:game.find(',',mechanic_pos[-1])]
        #Get the number and convert to int
        total_mech = int(total_mech.split(':')[1])


        #----Add data to game_attributes df----#
        game_attributes.loc[i, 'Game name'] = game_name
        game_attributes.loc[i, 'Game rank'] = game_rank
        game_attributes.loc[i, 'Game ID'] = game_id
        game_attributes.loc[i, '#players'] = maxplayer
        game_attributes.loc[i, 'playtime'] = maxtime
        game_attributes.loc[i, 'weight'] = weight

        for cat in Categories:
            game_attributes.loc[i, cat] = 1

        for mech in Mechanics:
            game_attributes.loc[i, mech] = 1

        game_attributes.loc[i, 'total categories'] = total_cat
        game_attributes.loc[i, 'total mechanics'] = total_mech

    #Set a time limit between each loop to reduce bgg load
    sleep(2)

13812
13813
13814
13815
13816
13817
13818
13819
13820
13821
13822
13823
13824
13825
13826
13827
13828
13829
13830
13831
13832
13833
13834
13835
13836
13837
13838
13839
13840
13841
13842
13843
13844
13845
13846
13847
13848
13849
13850
13851
13852
13853
13854
13855
13856
13857
13858
13859
13860
13861
13862
13863
13864
13865
13866
13867
13868
13869
13870
13871
13872
13873
13874
13875
13876
13877
13878
13879
13880
13881
13882
13883
13884
13885
13886
13887
13888
13889
13890
13891
13892
13893
13894
13895
13896
13897
13898
13899
13900
13901
13902
13903
13904
13905
13906
13907
13908
13909
13910
13911
13912
13913
13914
13915
13916
13917
13918
13919
13920
13921
13922
13923
13924
13925
13926
13927
13928
13929
13930
13931
13932
13933
13934
13935
13936
13937
13938
13939
13940
13941
13942
13943
13944
13945
13946
13947
13948
13949
13950
13951
13952
13953
13954


In [176]:
# Save the game attribute dataframe
#Took hours to create, so save as csv so we can load it quickly
game_attributes.to_csv('bgg game attributes.csv',sep='\t')

# Cleaning game_attributes dataset

1) We should assume that any game without categories or mechanics can be removed.

2) Noticed that if the # of categories or mechanics exceeded 6, the script doesn't "see" the extra categories/mechanics. We need to manually adjust the binary categories/mechanics for games with >6 categories/mechanics. We can identify these as a separate list and go through it manually.

In [2]:
# Load csv we had saved
ga = pd.read_csv('bgg game attributes.csv')

#Drop the first column (just indices)
ga.drop('Column1', inplace=True, axis=1)

#Rename headers with the actual headings (stored in first row)
ga.rename(columns=ga.iloc[0], inplace=True)
ga.drop(0, inplace=True, axis=0)
ga.reset_index(drop=True, inplace=True)

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
ga.head()

Unnamed: 0,Game name,Game rank,Game ID,#players,playtime,weight,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,...,Tile Placement,Time Track,Trading,Trick-taking,Variable Phase Order,Variable Player Powers,Voting,Worker Placement,total categories,total mechanics
0,Pandemic Legacy: Season 1,1,161936,4,60,2.810298102981,0,0,0,0,...,0,0,1,0,0,0,0,0,2,7
1,Through the Ages: A New Story of Civilization,2,182028,4,240,4.3604436229205,0,0,0,0,...,0,0,0,0,0,0,0,0,3,3
2,Twilight Struggle,3,12333,2,180,3.5463510848126,0,0,0,0,...,0,0,0,0,0,0,0,0,3,5
3,Gloomhaven,4,174430,4,120,3.7831541218638,0,0,1,0,...,0,0,0,0,0,0,0,0,5,9
4,Star Wars: Rebellion,5,187645,4,240,3.6142506142506,0,0,0,0,...,0,0,0,0,0,1,0,0,5,6


In [223]:
#Find the rows that have 0 categories or 0 mechanics
to_remove = ga[(ga['total categories'] == 0) | (ga['total mechanics'] == 0)]

#Remove the rows from the ga df
ga.drop(to_remove.index, inplace=True)

In [245]:
manual = ga[(ga['total mechanics'].apply(lambda x: int(x)) > 6) | (ga['total categories'].apply(lambda x: int(x)) > 6)]

In [246]:
manual

Unnamed: 0,Game name,Game rank,Game ID,#players,playtime,weight,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,...,Tile Placement,Time Track,Trading,Trick-taking,Variable Phase Order,Variable Player Powers,Voting,Worker Placement,total categories,total mechanics
0,Pandemic Legacy: Season 1,1,161936,4,60,2.810298102981,0,0,0,0,...,0,0,1,0,0,0,0,0,2,7
3,Gloomhaven,4,174430,4,120,3.7831541218638,0,0,1,0,...,0,0,0,0,0,0,0,0,5,9
12,War of the Ring (Second Edition),13,115746,4,180,4.0176600441501,0,0,1,0,...,0,0,0,0,0,0,0,0,8,5
15,Mage Knight Board Game,16,96848,4,150,4.2316883116883,0,0,1,0,...,0,0,0,0,0,0,0,0,4,10
16,Blood Rage,17,170216,4,90,2.8922610015175,0,0,0,0,...,0,0,0,0,0,0,0,0,4,7
20,Mansions of Madness: Second Edition,21,205059,5,180,2.6446700507614,0,0,1,0,...,0,0,0,0,0,0,0,0,8,8
23,Eclipse,24,72125,6,200,3.683997689197,0,0,0,0,...,1,0,0,0,0,0,0,0,5,8
26,Robinson Crusoe: Adventures on the Cursed Island,27,121921,4,120,3.722927557879,0,0,1,0,...,1,0,0,0,0,0,0,0,4,8
36,Keyflower,37,122515,6,120,3.3424657534247,0,0,0,0,...,0,0,0,0,0,0,0,0,5,8
40,Twilight Imperium (Third Edition),41,12493,6,240,4.2396149949341,0,0,0,0,...,1,0,0,0,0,0,0,0,6,10


In [247]:
manual.shape

(352, 143)

Go through all the games we identified with categories/mechanics > 6 and make sure all the bgg listed categories/mechanics are properly accounted for in the dataframe.

In [248]:
manual.to_csv('bgg game attribute manual-adjustment.csv')