# Scraping BGG - Part 1

This was my scraping of the [browse page](https://boardgamegeek.com/browse/boardgame) of boardgamegeek (BGG).

**Note:** "browse_df" is a dataframe that holds the info from the browse page (pg1) of BGG
- Board game name
- local hyperlink extension to that board game's main page
- BGG rating; formula is roughly... (voter_count\*avg_rating + (total_bgg_users-voter_count)\*5.5)/total_bgg_users)
- Average rating
- Number of voters (voter_count)

In [1]:
import re
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
link = 'https://boardgamegeek.com/browse/boardgame'
response = requests.get(link)
response.status_code

200

In [3]:
html = response.text
html[:100]

'\n<!DOCTYPE html>\n<html ng-app="GeekApp" lang="en-US" ng-cloak>\n<head>\n\t<meta charset=\'utf-8\'>\n\t<meta'

In [4]:
soup = BeautifulSoup(html, 'html5lib')

# a combination of soup to find all hyperlinks and regex to filter the ones that were important
top100_times3 = soup.find_all('a', attrs={ 'href' : re.compile("^/boardgame/.")})

In [5]:
#there's 3 links per game, but we only care about every 2nd one (2/3) 
top100 = top100_times3[1::3]

# use .text on these elements to get the text portion
# use .get('') to get the value of the variable stored in ''
print(top100[0].text)
print(top100[0].get('href'))

Gloomhaven
/boardgame/174430/gloomhaven


In [6]:
# this will be the dictionary that gets turned into a df
browse_stats_dict = {
    'Name': [],
    'Link': [],
    'BGG_Rating': [],
    'AVG_Rating': [],
    'Voter_Count': []
}

# storing the name and link of each game
for idx in range(100):
    browse_stats_dict['Name'].append(top100[idx].text)
    browse_stats_dict['Link'].append(top100[idx].get('href'))

browse_stats_dict['Name'][0]

'Gloomhaven'

In [7]:
# get_num uses regex to extract the numerical part out of a string
def get_num(string):
    return float(re.search('\S+', string).group(0))


# All the JavaScript scripts for the stats on browse page -- 3 per game
browse_stats = soup.find_all('td', attrs={'class': 'collection_bggrating' })

# idx_list is a list of numbers 0-300 use to index through the 300 scripts
idx_list = list(range(len(browse_stats)))

# assigning each link to the coresponding dictionary list
for idx in idx_list[::3]:
    browse_stats_dict['BGG_Rating'].append(get_num( browse_stats[idx].text))
    browse_stats_dict['AVG_Rating'].append(get_num( browse_stats[idx+1].text))
    browse_stats_dict['Voter_Count'].append(get_num( browse_stats[idx+2].text))

browse_stats_dict['AVG_Rating'][0]

8.83

In [8]:
#turning the dictionary into a df
browse_df = pd.DataFrame.from_dict(browse_stats_dict)
browse_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         100 non-null    object 
 1   Link         100 non-null    object 
 2   BGG_Rating   100 non-null    float64
 3   AVG_Rating   100 non-null    float64
 4   Voter_Count  100 non-null    float64
dtypes: float64(3), object(2)
memory usage: 4.0+ KB


In [9]:
browse_df.head()

Unnamed: 0,Name,Link,BGG_Rating,AVG_Rating,Voter_Count
0,Gloomhaven,/boardgame/174430/gloomhaven,8.572,8.83,36176.0
1,Pandemic Legacy: Season 1,/boardgame/161936/pandemic-legacy-season-1,8.472,8.62,37794.0
2,Brass: Birmingham,/boardgame/224517/brass-birmingham,8.291,8.64,13724.0
3,Terraforming Mars,/boardgame/167791/terraforming-mars,8.28,8.43,55986.0
4,Through the Ages: A New Story of Civilization,/boardgame/182028/through-ages-new-story-civil...,8.216,8.46,20626.0


In [10]:
browse_df.iloc[0]

Name                             Gloomhaven
Link           /boardgame/174430/gloomhaven
BGG_Rating                            8.572
AVG_Rating                             8.83
Voter_Count                           36176
Name: 0, dtype: object

# Scraping BGG - Part 2

I picked a random board game's main page "[Terraforming Mars](https://boardgamegeek.com/boardgame/167791/terraforming-mars/stats)" (tm) and used str.find() and json.loads() to extract info. into a dictionary "tm_stats"

In [11]:
# requests.get(link_string) asks to pull the html code
# .status_code or the result itself returns permisions

tm_link = 'https://boardgamegeek.com/boardgame/167791/terraforming-mars/stats'
link_response = requests.get(tm_link)
link_response.status_code

200

In [12]:
# We'll save the html in a variable
# .text returns the html (if permissions allow)
tm_html = link_response.text
tm_html[:100]

'\n<!DOCTYPE html>\n<html ng-app="GeekApp" lang="en-US" ng-cloak>\n<head>\n\t<meta charset=\'utf-8\'>\n\t<meta'

In [13]:
# BeautifulSoup is like a search and return engine
soup_tm = BeautifulSoup(tm_html, 'html5lib')

In [14]:
soup_tm.find_all('div', class_='outline-item-title') # Not finding something that exists

[]

Should return something. I see it exits in the html, but soup cant find.

In [38]:
import json

In [47]:
# finding the stats for game and putting into dictionary
stats_idx = tm_html.find('"stats"')
stats_end = tm_html[stats_idx:].find('}')
stats_str = tm_html[stats_idx + 8 : stats_idx + stats_end + 1]
stats_dict = json.loads(stats_str) 
stats_dict

{'usersrated': '55986',
 'average': '8.42959',
 'baverage': '8.28038',
 'stddev': '1.37965',
 'avgweight': '3.2358',
 'numweights': '2248',
 'numgeeklists': '6455',
 'numtrading': '345',
 'numwanting': '2090',
 'numwish': '15901',
 'numowned': '73824',
 'numprevowned': '2294',
 'numcomments': '8407',
 'numwishlistcomments': '1561',
 'numhasparts': '24',
 'numwantparts': '9',
 'views': '5249201',
 'playmonth': '2020-07',
 'numplays': '301378',
 'numplays_month': '3065',
 'numfans': 4855}

In [49]:
#Extracting the important keys
important_stats_keys = ['avgweight', 'numowned', 'numplays']
stats = {key: stats_dict[key] for key in important_stats_keys}

#changing the values to floats
for key in stats:
    stats[key] = float(stats[key])

stats

{'avgweight': 3.2358, 'numowned': 73824.0, 'numplays': 301378.0}

In [41]:
# Same thing with the basic info
info_idx = tm_html.find('"minplayers":')
info_end = tm_html.find(',', tm_html.find('"minage":') )
info_str = '{' + tm_html[info_idx:info_end] + '}'
info = json.loads(info_str)
info

{'minplayers': '1',
 'maxplayers': '5',
 'minplaytime': '120',
 'maxplaytime': '120',
 'minage': '12'}

In [51]:
#changing the values to floats
for key in info:
    info[key] = int(info[key])

info

{'minplayers': 1,
 'maxplayers': 5,
 'minplaytime': 120,
 'maxplaytime': 120,
 'minage': 12}

In [52]:
# merges 2 dictionaries into 1
tm_stats = {**info, **stats}
tm_stats

{'minplayers': 1,
 'maxplayers': 5,
 'minplaytime': 120,
 'maxplaytime': 120,
 'minage': 12,
 'avgweight': 3.2358,
 'numowned': 73824.0,
 'numplays': 301378.0}