###General Instructions
Board Game Geek is an online site that provides rankings for various boardgames.  A complete listing of ranked games is provided here: https://boardgamegeek.com/browse/boardgame

The data on this page is organized in a simple table format. The Title field is divided into the game's title and the year it was published. The title itself provides a link to more information on the game.

Write a routine to scrape this data from the page and persist it as a partitioned Parquet table named games.board_games.  Use the current date as the partitioning field.

The fields required in the final table are:

* Rank - the integer value from the Board Game Rank column
* ImageURL - the URL (as provided) of the image presented in the second column 
* Title - the title of the game
* YearPublished - the year the game was published
* GeekRating - the float value presented in the Geek Rating field
* AvgRating - the float value presented in the Avg Rating field
* NumVoters - the integer value presented in the NumVoters field

(You can ignore the Shop field)

While the Board Game Geek site has multiple pages of rankings, limit your scraping to the games presented on the page referenced above.  (In other words, don't hit the other 1000+ pages of rankings.)

Once you've populated your table, run the two SQL statements presented at the bottom of this notebook.

In [0]:
# run this code to make sure beautiful soup is installed
%pip install beautifulsoup4

Python interpreter will be restarted.
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.0-py3-none-any.whl (132 kB)
Collecting soupsieve>1.2
  Downloading soupsieve-2.4-py3-none-any.whl (37 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.0 soupsieve-2.4
Python interpreter will be restarted.


In [0]:
# retrieve the web page

from bs4 import BeautifulSoup
from urllib import request as url

# formulate a request for the page
request = url.Request('https://boardgamegeek.com/browse/boardgame')

# submit this request and retrieve a result
response = url.urlopen(request)

# read the page's HTML from the result
html = response.read()

# load raw html into beautiful soup
soup = BeautifulSoup(html)

# print beautiful soup html
print(soup)

<!DOCTYPE html>

<html lang="en-US" ng-app="GeekApp" ng-cloak="">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" id="vp" name="viewport"/>
<script>
			window.addEventListener( 'DOMContentLoaded',  function() {
				var width = document.documentElement.clientWidth || window.innerWidth;
				if (width < 960) {
					var mvp = document.getElementById('vp');
					// android debugging
					mvp.setAttribute('content','width=960');
				}
			});
		</script>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<title>Browse Board Games | BoardGameGeek</title>
<link href="https://cf.geekdo-static.com/icons/touch-icon180.png" rel="apple-touch-icon"/>
<link href="https://cf.geekdo-static.com/icons/favicon2.ico" rel="shortcut icon" type="image/ico"/>
<link href="https://cf.geekdo-static.com/icons/favicon2.ico" rel="icon" type="image/ico"/>
<link href="/game-opensearch.xml" rel="search" title="BGG

In [0]:
# extract the data from the web page

# retrieve the body in the document
body = soup.body

# retrieve all tables in the document
tables = body.find_all('table')

# grab the first table
table = tables[0]

# extract header cells
header = []
for th in table.find('tr').find_all('th'):
  header += [th.text.replace('\n','').replace('\t','')]
  
# remove the shop column
header = header[:6]


table_data = []

# for each row in the table body
for tr in table.find_all('tr')[1:]:
  
  row_data = []
  
  # extract each cell in the row
  for td in tr.find_all('td'):
    # if the cell contains an image
    try: cell = td.find('img')['src']
    # else
    except: cell = td.text.replace('\n','').replace('\t','')
    
    row_data += [cell]

  print("Row Data:", row_data)

  # append row data to table data
  table_data += [row_data[:6]]

Row Data: ['1', 'https://cf.geekdo-images.com/x3zxjr-Vw5iU4yDPg70Jgw__micro/img/4Od3GYCiqptga0VbmyumPbJlBsU=/fit-in/64x64/filters:strip_icc()/pic3490053.jpg', 'Brass: Birmingham(2018)Build networks, grow industries, and navigate the world of the Industrial Revolution.', '8.427', '8.62', '37065', '']
Row Data: ['2', 'https://cf.geekdo-images.com/-Qer2BBPG7qGGDu6KcVDIw__micro/img/n6-sXYD6XXZoqIxq4P6AG7VPCuA=/fit-in/64x64/filters:strip_icc()/pic2452831.png', 'Pandemic Legacy: Season 1(2015)Mutating diseases are spreading around the world - can your team save humanity?', '8.397', '8.54', '49959', '']
Row Data: ['3', 'https://cf.geekdo-images.com/sZYp_3BTDGjh2unaZfZmuA__micro/img/sQyh47ClBO3d5sxPm73hMvM-JV4=/fit-in/64x64/filters:strip_icc()/pic2437871.jpg', 'Gloomhaven(2017)Vanquish monsters with strategic cardplay. Fulfill your quest to leave your legacy!', '8.395', '8.63', '57353', '']
Row Data: ['4', 'https://cf.geekdo-images.com/SoU8p28Sk1s8MSvoM4N8pQ__micro/img/LSAi1pmhbTbWwtBrziutDOXz

In [0]:
# create new table that does not include empty lists
all_data = [row for row in table_data if row != ['']]
print(header)
print(all_data)

['Board Game Rank', 'Thumbnail image', 'Title', 'Geek Rating', 'Avg Rating', 'Num Voters']
[['1', 'https://cf.geekdo-images.com/x3zxjr-Vw5iU4yDPg70Jgw__micro/img/4Od3GYCiqptga0VbmyumPbJlBsU=/fit-in/64x64/filters:strip_icc()/pic3490053.jpg', 'Brass: Birmingham(2018)Build networks, grow industries, and navigate the world of the Industrial Revolution.', '8.427', '8.62', '37065'], ['2', 'https://cf.geekdo-images.com/-Qer2BBPG7qGGDu6KcVDIw__micro/img/n6-sXYD6XXZoqIxq4P6AG7VPCuA=/fit-in/64x64/filters:strip_icc()/pic2452831.png', 'Pandemic Legacy: Season 1(2015)Mutating diseases are spreading around the world - can your team save humanity?', '8.397', '8.54', '49959'], ['3', 'https://cf.geekdo-images.com/sZYp_3BTDGjh2unaZfZmuA__micro/img/sQyh47ClBO3d5sxPm73hMvM-JV4=/fit-in/64x64/filters:strip_icc()/pic2437871.jpg', 'Gloomhaven(2017)Vanquish monsters with strategic cardplay. Fulfill your quest to leave your legacy!', '8.395', '8.63', '57353'], ['4', 'https://cf.geekdo-images.com/SoU8p28Sk1s8MSv

In [0]:
# convert the data to a Spark SQL table
boardgames = spark.createDataFrame(all_data, schema=header)

display(boardgames)

Board Game Rank,Thumbnail image,Title,Geek Rating,Avg Rating,Num Voters
1,https://cf.geekdo-images.com/x3zxjr-Vw5iU4yDPg70Jgw__micro/img/4Od3GYCiqptga0VbmyumPbJlBsU=/fit-in/64x64/filters:strip_icc()/pic3490053.jpg,"Brass: Birmingham(2018)Build networks, grow industries, and navigate the world of the Industrial Revolution.",8.427,8.62,37065
2,https://cf.geekdo-images.com/-Qer2BBPG7qGGDu6KcVDIw__micro/img/n6-sXYD6XXZoqIxq4P6AG7VPCuA=/fit-in/64x64/filters:strip_icc()/pic2452831.png,Pandemic Legacy: Season 1(2015)Mutating diseases are spreading around the world - can your team save humanity?,8.397,8.54,49959
3,https://cf.geekdo-images.com/sZYp_3BTDGjh2unaZfZmuA__micro/img/sQyh47ClBO3d5sxPm73hMvM-JV4=/fit-in/64x64/filters:strip_icc()/pic2437871.jpg,Gloomhaven(2017)Vanquish monsters with strategic cardplay. Fulfill your quest to leave your legacy!,8.395,8.63,57353
4,https://cf.geekdo-images.com/SoU8p28Sk1s8MSvoM4N8pQ__micro/img/LSAi1pmhbTbWwtBrziutDOXzfdY=/fit-in/64x64/filters:strip_icc()/pic6293412.jpg,"Ark Nova(2021)Plan and build a modern, scientifically managed zoo to support conservation projects.",8.293,8.54,27929
5,https://cf.geekdo-images.com/_Ppn5lssO5OaildSE-FgFA__micro/img/2gymaKs35_2yj7eyyA6cYyVmd9c=/fit-in/64x64/filters:strip_icc()/pic3727516.jpg,"Twilight Imperium: Fourth Edition(2017)Build an intergalactic empire through trade, research, conquest and grand politics.",8.243,8.63,20310
6,https://cf.geekdo-images.com/wg9oOLcsKvDesSUdZQ4rxw__micro/img/LUkXZhd1TO80eCiXMD3-KfnzA6k=/fit-in/64x64/filters:strip_icc()/pic3536616.jpg,Terraforming Mars(2016)Compete with rival CEOs to make Mars habitable and build your corporate empire.,8.238,8.38,88187
7,https://cf.geekdo-images.com/_HhIdavYW-hid20Iq3hhmg__micro/img/OdKjWiFsNvQAfJfXXSITttiozWE=/fit-in/64x64/filters:strip_icc()/pic5055631.jpg,Gloomhaven: Jaws of the Lion(2020)Vanquish monsters with strategic cardplay in a 25-scenario Gloomhaven campaign.,8.215,8.52,27398
8,https://cf.geekdo-images.com/ImPgGag98W6gpV1KV812aA__micro/img/NT-Av_3kdYUcwuti5ocmIQXow3g=/fit-in/64x64/filters:strip_icc()/pic1215633.jpg,War of the Ring: Second Edition(2011)The Fellowship and the Free Peoples clash with Sauron over the fate of Middle-Earth.,8.167,8.53,18665
9,https://cf.geekdo-images.com/7SrPNGBKg9IIsP4UQpOi8g__micro/img/nEvTiCkWpT-ymH4bstc9c335TtQ=/fit-in/64x64/filters:strip_icc()/pic4325841.jpg,Star Wars: Rebellion(2016)Strike from your hidden base as the Rebels—or find and destroy it as the Empire.,8.165,8.42,29424
10,https://cf.geekdo-images.com/kjCm4ZvPjIZxS-mYgSPy1g__micro/img/sRPer5FmBxyV527MY67P-gM7ukg=/fit-in/64x64/filters:strip_icc()/pic7013651.jpg,Spirit Island(2017)Island Spirits join forces using elemental powers to defend their home from invaders.,8.154,8.36,42887


In [0]:
import pyspark.sql.functions as f

# clean up data in the dataframe
boardgames = (
  boardgames
    .withColumnRenamed("Board Game Rank", "Rank")
    .withColumnRenamed("Thumbnail image", "ImageURL")
    .withColumnRenamed("Geek Rating", "GeekRating")
    .withColumnRenamed("Avg Rating", "AvgRating")
    .withColumnRenamed("Num Voters", "NumVoters")
  )

clean_boardgames = (
  boardgames
    .withColumn("Rank", boardgames['Rank'].cast("Integer"))
    .withColumn("YearPublished", f.substring_index('Title', "(" ,-1 )).withColumn("YearPublished", f.substring_index('YearPublished', ")" ,1 ).cast("Integer"))
    .withColumn("Title", f.substring_index('Title', "(" ,1 ))
    .withColumn("GeekRating", boardgames['GeekRating'].cast("Float"))
    .withColumn("AvgRating", boardgames['AvgRating'].cast("Float"))
    .withColumn("NumVoters", boardgames['NumVoters'].cast("Integer"))
  )

display(clean_boardgames)

Rank,ImageURL,Title,GeekRating,AvgRating,NumVoters,YearPublished
1,https://cf.geekdo-images.com/x3zxjr-Vw5iU4yDPg70Jgw__micro/img/4Od3GYCiqptga0VbmyumPbJlBsU=/fit-in/64x64/filters:strip_icc()/pic3490053.jpg,Brass: Birmingham,8.427,8.62,37065,2018
2,https://cf.geekdo-images.com/-Qer2BBPG7qGGDu6KcVDIw__micro/img/n6-sXYD6XXZoqIxq4P6AG7VPCuA=/fit-in/64x64/filters:strip_icc()/pic2452831.png,Pandemic Legacy: Season 1,8.397,8.54,49959,2015
3,https://cf.geekdo-images.com/sZYp_3BTDGjh2unaZfZmuA__micro/img/sQyh47ClBO3d5sxPm73hMvM-JV4=/fit-in/64x64/filters:strip_icc()/pic2437871.jpg,Gloomhaven,8.395,8.63,57353,2017
4,https://cf.geekdo-images.com/SoU8p28Sk1s8MSvoM4N8pQ__micro/img/LSAi1pmhbTbWwtBrziutDOXzfdY=/fit-in/64x64/filters:strip_icc()/pic6293412.jpg,Ark Nova,8.293,8.54,27929,2021
5,https://cf.geekdo-images.com/_Ppn5lssO5OaildSE-FgFA__micro/img/2gymaKs35_2yj7eyyA6cYyVmd9c=/fit-in/64x64/filters:strip_icc()/pic3727516.jpg,Twilight Imperium: Fourth Edition,8.243,8.63,20310,2017
6,https://cf.geekdo-images.com/wg9oOLcsKvDesSUdZQ4rxw__micro/img/LUkXZhd1TO80eCiXMD3-KfnzA6k=/fit-in/64x64/filters:strip_icc()/pic3536616.jpg,Terraforming Mars,8.238,8.38,88187,2016
7,https://cf.geekdo-images.com/_HhIdavYW-hid20Iq3hhmg__micro/img/OdKjWiFsNvQAfJfXXSITttiozWE=/fit-in/64x64/filters:strip_icc()/pic5055631.jpg,Gloomhaven: Jaws of the Lion,8.215,8.52,27398,2020
8,https://cf.geekdo-images.com/ImPgGag98W6gpV1KV812aA__micro/img/NT-Av_3kdYUcwuti5ocmIQXow3g=/fit-in/64x64/filters:strip_icc()/pic1215633.jpg,War of the Ring: Second Edition,8.167,8.53,18665,2011
9,https://cf.geekdo-images.com/7SrPNGBKg9IIsP4UQpOi8g__micro/img/nEvTiCkWpT-ymH4bstc9c335TtQ=/fit-in/64x64/filters:strip_icc()/pic4325841.jpg,Star Wars: Rebellion,8.165,8.42,29424,2016
10,https://cf.geekdo-images.com/kjCm4ZvPjIZxS-mYgSPy1g__micro/img/sRPer5FmBxyV527MY67P-gM7ukg=/fit-in/64x64/filters:strip_icc()/pic7013651.jpg,Spirit Island,8.154,8.36,42887,2017


In [0]:
%sql 

CREATE DATABASE IF NOT EXISTS games;
 
DROP TABLE IF EXISTS games.board_games;

In [0]:
# write dataframe to be queried
(
  clean_boardgames
    .write
    .format('delta')
    .mode('overwrite')
    .option('overwriteSchema','true')
    .saveAsTable('games.board_games')
  )
 
# read data from table
display(
  spark.table('games.board_games')
  )

Rank,ImageURL,Title,GeekRating,AvgRating,NumVoters,YearPublished
85,https://cf.geekdo-images.com/oeygRZntjNUJWvc8SxDfww__micro/img/gXi23Nbfem6XTfRlV2eHEHB_cHY=/fit-in/64x64/filters:strip_icc()/pic784193.jpg,Dominant Species,7.595,7.82,20066,2010
86,https://cf.geekdo-images.com/OYne8uBCHv5oEgRfpOrV0A__micro/img/8IGVzvGADTyV8kjaMNw-VCBL4TY=/fit-in/64x64/filters:strip_icc()/pic2648303.jpg,The 7th Continent,7.595,7.9,21039,2017
87,https://cf.geekdo-images.com/BfEHqHQAvZLbRX7y7e9TWg__micro/img/RgHc2PFvpz3xxRPRg5v7d46JQNc=/fit-in/64x64/filters:strip_icc()/pic5617866.jpg,Beyond the Sun,7.582,7.97,10928,2020
88,https://cf.geekdo-images.com/aAwBzPzta4joKfFZt05hCw__micro/img/28sp86TqPbMqn9M_qHWfS9j44-w=/fit-in/64x64/filters:strip_icc()/pic4385726.jpg,Tainted Grail: The Fall of Avalon,7.579,8.15,10767,2019
89,https://cf.geekdo-images.com/sy89BiuZXfbSnG7Cag9tBQ__micro/img/0VYxwvacU7ADCGyJt5iLMa5WQoA=/fit-in/64x64/filters:strip_icc()/pic5902073.png,Obsession,7.574,8.2,7248,2018
90,https://cf.geekdo-images.com/OAX7HfOz-9N60StgADzd0g__micro/img/5OEOXKjvrEMh2aT5I-FrAn5-9jo=/fit-in/64x64/filters:strip_icc()/pic3781944.png,Architects of the West Kingdom,7.569,7.77,25375,2018
91,https://cf.geekdo-images.com/GO282hlXR3RiknU5W4GZjg__micro/img/Ya0GwTP0sOChJ3b0E3Ufmua39gw=/fit-in/64x64/filters:strip_icc()/pic5373572.png,El Grande,7.559,7.73,26368,1995
92,https://cf.geekdo-images.com/HbfgxDJZQnNEZQKvpmJxQg__micro/img/L6e_jaWDU2M5XPhV0lMhPPo26EU=/fit-in/64x64/filters:strip_icc()/pic2278942.jpg,Keyflower,7.548,7.74,21834,2012
93,https://cf.geekdo-images.com/5Q2w2rFJiFI_uV89KP6ECg__micro/img/KGgAsbHUPaEjsl-r05JCH2ixP1o=/fit-in/64x64/filters:strip_icc()/pic354500.jpg,Battlestar Galactica: The Board Game,7.548,7.73,35492,2008
94,https://cf.geekdo-images.com/yC7nOSc1x5PT-oNnh6TEcQ__micro/img/R7ZK4KtIIQNR2CJHZ6HtUxEY8ik=/fit-in/64x64/filters:strip_icc()/pic1638795.jpg,Caylus,7.543,7.73,28478,2005


Execute the following SQL statements to validate your results:

In [0]:
%sql -- verify row count

SELECT COUNT(*) FROM games.board_games;

count(1)
100


In [0]:
%sql -- verify values

SELECT * FROM games.board_games ORDER BY rank ASC;

Rank,ImageURL,Title,GeekRating,AvgRating,NumVoters,YearPublished
1,https://cf.geekdo-images.com/x3zxjr-Vw5iU4yDPg70Jgw__micro/img/4Od3GYCiqptga0VbmyumPbJlBsU=/fit-in/64x64/filters:strip_icc()/pic3490053.jpg,Brass: Birmingham,8.427,8.62,37065,2018
2,https://cf.geekdo-images.com/-Qer2BBPG7qGGDu6KcVDIw__micro/img/n6-sXYD6XXZoqIxq4P6AG7VPCuA=/fit-in/64x64/filters:strip_icc()/pic2452831.png,Pandemic Legacy: Season 1,8.397,8.54,49959,2015
3,https://cf.geekdo-images.com/sZYp_3BTDGjh2unaZfZmuA__micro/img/sQyh47ClBO3d5sxPm73hMvM-JV4=/fit-in/64x64/filters:strip_icc()/pic2437871.jpg,Gloomhaven,8.395,8.63,57353,2017
4,https://cf.geekdo-images.com/SoU8p28Sk1s8MSvoM4N8pQ__micro/img/LSAi1pmhbTbWwtBrziutDOXzfdY=/fit-in/64x64/filters:strip_icc()/pic6293412.jpg,Ark Nova,8.293,8.54,27929,2021
5,https://cf.geekdo-images.com/_Ppn5lssO5OaildSE-FgFA__micro/img/2gymaKs35_2yj7eyyA6cYyVmd9c=/fit-in/64x64/filters:strip_icc()/pic3727516.jpg,Twilight Imperium: Fourth Edition,8.243,8.63,20310,2017
6,https://cf.geekdo-images.com/wg9oOLcsKvDesSUdZQ4rxw__micro/img/LUkXZhd1TO80eCiXMD3-KfnzA6k=/fit-in/64x64/filters:strip_icc()/pic3536616.jpg,Terraforming Mars,8.238,8.38,88187,2016
7,https://cf.geekdo-images.com/_HhIdavYW-hid20Iq3hhmg__micro/img/OdKjWiFsNvQAfJfXXSITttiozWE=/fit-in/64x64/filters:strip_icc()/pic5055631.jpg,Gloomhaven: Jaws of the Lion,8.215,8.52,27398,2020
8,https://cf.geekdo-images.com/ImPgGag98W6gpV1KV812aA__micro/img/NT-Av_3kdYUcwuti5ocmIQXow3g=/fit-in/64x64/filters:strip_icc()/pic1215633.jpg,War of the Ring: Second Edition,8.167,8.53,18665,2011
9,https://cf.geekdo-images.com/7SrPNGBKg9IIsP4UQpOi8g__micro/img/nEvTiCkWpT-ymH4bstc9c335TtQ=/fit-in/64x64/filters:strip_icc()/pic4325841.jpg,Star Wars: Rebellion,8.165,8.42,29424,2016
10,https://cf.geekdo-images.com/kjCm4ZvPjIZxS-mYgSPy1g__micro/img/sRPer5FmBxyV527MY67P-gM7ukg=/fit-in/64x64/filters:strip_icc()/pic7013651.jpg,Spirit Island,8.154,8.36,42887,2017


In [0]:
%sql -- verify partitioning
DESCRIBE EXTENDED games.board_games;

col_name,data_type,comment
Rank,int,
ImageURL,string,
Title,string,
GeekRating,float,
AvgRating,float,
NumVoters,int,
YearPublished,int,
,,
# Detailed Table Information,,
Catalog,spark_catalog,
