# Prague Marathon Results
## Dynamic JavaScript Scraping
#### David Koubek, Jiri Zelenka

### Import required packages

In [1]:
import requests # for robots check
from bs4 import BeautifulSoup # prettify HTML
from selenium import webdriver # scraping JS dynamic elements
from time import sleep # for sleeping (slowing down) inside a function
import random # for random number sleeping
import pandas as pd # for dataframe
import numpy as np # for arrays

### Robots.txt

Are we allowed to scrape?

In [2]:
requests.get('https://www.runczech.com/robots.txt')

<Response [200]>

The response 200 means the request was fulfilled. Let's look visually at the actual robots.txt file what's allowed and what's not.

In [3]:
print(requests.get('https://www.runczech.com/robots.txt').text)

#
# robots.txt
#

# exclude these directories
User-agent: *
Disallow: /srv/
Disallow: /cgi/
Allow: /srv/www/qf/*/ramjet/eventList
Allow: /srv/www/qf/*/ramjet/eventVoucherList
Allow: /srv/www/qf/*/ramjet/contactPage
Allow: /srv/www/qf/*/ramjet/raceDetail
Allow: /srv/www/qf/*/ramjet/leagueDetail
Allow: /srv/www/qf/*/ramjet/results/list
Allow: /srv/www/qf/*/ramjet/results/league
Allow: /srv/www/qf/*/ramjet/results/league/detail
Allow: /srv/www/qf/*/ramjet/resultsEventDetail
Allow: /srv/www/qf/*/ramjet/resultsSubEventUserDetail
Allow: /srv/www/qf/*/ramjet/resultsSubEventGroupDetail
Allow: /srv/www/qf/*/ramjet/event/runnerList

Sitemap: https://www.runczech.com/sitemap-cs.xml
Sitemap: https://www.runczech.com/sitemap-en.xml
Sitemap: https://www.runczech.com/sitemap-de.xml
Sitemap: https://www.runczech.com/sitemap-it.xml
Sitemap: https://www.runczech.com/sitemap-fr.xml
Sitemap: https://www.runczech.com/sitemap-es.xml
Sitemap: https://www.runczech.com/sitemap-pl.xml
Sitemap: https://www.runcz

The "resultsEventDetail" which we desire to scrape is allowed which is good, we can proceed.

# Scraping JavaScript dynamic website

We take the route of using a virtual browser driver that loads up the webpage content fully and then we scrape the HTML code of it. One of the ways to do this is using Selenium. There are other methods as well, but this one worked nicely in our setting.

## Selenium

First make sure chromedriver is correctly in the environment (download from https://sites.google.com/a/chromium.org/chromedriver/ ), otherwise the webdriver scraping outputs an error. Other drivers, e.g. for firefox, can be used, but the code would have to be altered to load the appropriate driver.

## 1) Find all marathon links

The middlepage table of our webpage is not simply a static HTML code, it gets loaded in the browser only after we load the page, dynamically via JavaScripts. So we have to use dynamic scraping methods, e.g. Selenium. After we've scraped the dynamic code, we need to scrape the "a href" tag of class "indexList_link" which contains URL links to the desired marathon events.

The result of this chapter is one variable containing 24 URLs that point to the 24 marathon years/events, where the data tables we eventually desire to scrape reside.

We need to slow down the scraping inside *get_soup* function so the url gets fully loaded in the browser (JS table takes about 1-2s to pull data from servers and load itself to the browser) before it's scraped, otherwise the soup object will contain only the static parts of the website and not the dynamic ones which are of interest.

In [4]:
# Scrapes dynamic webpage content using Selenium browser, returns a prettified soup code of the page
def get_soup(url):
    # Then navigate browser to desired url and get the source code
    browser.get(url) # navigate to the page

    # Wait randomly between 1.0-1.5seconds (1.0s should be enough to display our page), to confuse the website that we're not bots
    sleep(random.uniform(1.0, 1.5)) # time in seconds, sleep can take a float value
    
    # Take all the inner code of the displayed webpage
    innerHTML = browser.execute_script("return document.body.innerHTML") #returns the inner HTML as a string
    
    # Clean with BeautifulSoup:
    soup = BeautifulSoup(innerHTML,'lxml')
    return soup

In [5]:
# For a given RunCzech Results URL, returns a list of events' URLs (marathons)
def get_all_links(url):
    soup = get_soup(url) # call get_soup function on the desired url and get back the soup from bs (of the dynamic HTML with JS elements loaded)
    a_elements = soup.find_all('a',{'class':'indexList_link'}) # class "indexList_link" contains the href link we desire
    a_elements_contents = [a.contents[0] for a in a_elements]
    
    marathon_patterns = ['prague international marathon', 'volkswagen prague marathon', 'volkswagen marathon weekend'] # patterns of Prague marathon wording in Results
    is_marathon = [] # initialise a logical variable
    for i in range(0, len(a_elements_contents)): # creates a logical variable that tells us if the a_elements_contents has a marathon pattern in it (so no other events)
        if any(pattern in a_elements_contents[i].lower() for pattern in marathon_patterns): # checks if any marathon word pattern is in the element or not
            is_marathon.append(True)
        else:
            is_marathon.append(False)
    
    a_elements_marathons = [a_elements[i] for i in range(0, len(a_elements)) if is_marathon[i]] # if i-th element is_marathon=True, then save a_elements[i] to new variable

    
    urls_events = ['https://www.runczech.com' + a['href'] for a in a_elements_marathons] # list comprehension/function for links, join runczech url with the href ending of the events
    return urls_events

In [6]:
# URL of Results webpage which contains links to marathons
url_results = 'https://www.runczech.com/srv/www/qf/en/ramjet/results/list'

In [7]:
url_results_initial = 'https://www.runczech.com/srv/www/qf/en/ramjet/results/list' # the first part of the url_results for Results pages
url_results_page = '?&page=' # for adding a tag to url page in Results
num_of_pages = 7 # number of pages in Results to be scraped, just set manually
results_pages = np.empty(num_of_pages, dtype = object) # initialise an empty array of length num_of_pages

for i in range(0, num_of_pages):
    results_pages[i] = url_results_initial + url_results_page + str(i + 1) # add the whole link together with page number at the end, for each page
    
results_pages

array(['https://www.runczech.com/srv/www/qf/en/ramjet/results/list?&page=1',
       'https://www.runczech.com/srv/www/qf/en/ramjet/results/list?&page=2',
       'https://www.runczech.com/srv/www/qf/en/ramjet/results/list?&page=3',
       'https://www.runczech.com/srv/www/qf/en/ramjet/results/list?&page=4',
       'https://www.runczech.com/srv/www/qf/en/ramjet/results/list?&page=5',
       'https://www.runczech.com/srv/www/qf/en/ramjet/results/list?&page=6',
       'https://www.runczech.com/srv/www/qf/en/ramjet/results/list?&page=7'],
      dtype=object)

In [8]:
# Selenium scraping

# Working with chrome, first open one Chrome window that'll be displaying our URLs
browser = webdriver.Chrome()

print("Waiting 3sec for browser window to open")
sleep(3) # time in seconds, wait before Chrome window opens

urls_marathons = [] # initialise empty list
position = 0 # initialise a page counter

# For loop that scrapes each page of the Results and concatenates the marathon links into one list variable
for page in results_pages:
    position += 1 # increment a page counter
    print('Scraping page ' + str(position) + '/' + str(len(results_pages)))
    
    # Scrape the links on this page
    urls_marathons_add = get_all_links(page)
    
    # Concatenating the lists
    urls_marathons = urls_marathons + urls_marathons_add

print('Done scraping')

Scraping page 1/7
Scraping page 2/7
Scraping page 3/7
Scraping page 4/7
Scraping page 5/7
Scraping page 6/7
Scraping page 7/7
Done scraping


In [9]:
urls_marathons

['https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22166',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21429',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=20618',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=11874',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=20012',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=20002',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=11328',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=11569',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=11504',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=11490',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=11479',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=11465',
 'ht

The above final "Results page" URL variable containing the marathon links where the data tables for each year reside. The following chapter uses these 24 links to scrape the tables from them.

### 2) Get data tables from marathon links

In this chapter (the second, main part of the scraping project), we scrape the data tables containing marathon results for each of the 24 years. In each year, the data tables are spread across hundreds of pages, each page contains only 15 rows of data. There is no way to display more rows on the website, so this is a very useful application for automated scraping. The rows from each page are merged and the dataset of marathon results is created and saved for each year separately.

In [10]:
# Get the whole table of data that interests us
def get_table(url):
    soup = get_soup(url) # gets the bs HTML code of the url

    trs = soup.find_all('tr') # "tr" table-row element tag
    
    # Starting column 1
    tds_col_1 = [tr.find('td',{'class':'hidden980'}) for tr in trs] # hidden980 is the class of first column, "Rank in filter"
    tds_col_1 = [x for x in tds_col_1 if x != None] # filter out the None elements in tds (where tds weren't present in tr tags),
    # could also use filter(None, tds) which though gets rid of 0s as well which is more dangerous in certain situations
    
    # Column 2 - "Rank"
    tds_col_2 = [td.find_next('td') for td in tds_col_1] # finds next sibling of tag 'td', second column "Rank"
    contents_col_2 = [td.contents[0] for td in tds_col_2] # returns just the text inside tags
    
    # Column 3 - "Name"
    tds_col_3 = [td.find_next('td') for td in tds_col_2]
    contents_col_3 = [td.contents[0] for td in tds_col_3]
    
    # Column 5 - "Chip time"
    tds_col_5 = [td.find_next('td').find_next('td') for td in tds_col_3]
    contents_col_5 = [td.contents[0] for td in tds_col_5]
    
    # Column 6 - "St. number"
    tds_col_6 = [td.find_next('td') for td in tds_col_5]
    contents_col_6 = [td.contents[0] for td in tds_col_6]
    
    # Column 7 - "Nationality"
    tds_col_7 = [td.find_next('td') for td in tds_col_6]
    contents_col_7 = [td.contents[0] for td in tds_col_7]
    
    # Column 8 - "Age cat."
    tds_col_8 = [td.find_next('td') for td in tds_col_7]
    contents_col_8 = [td.contents[0] for td in tds_col_8]
    
    # Merge data columns
    # https://cmdlinetips.com/2018/01/how-to-create-pandas-dataframe-from-multiple-lists/
    # zip function to merge lists
    table = list(zip(contents_col_2, contents_col_3, contents_col_5, contents_col_6, contents_col_7, contents_col_8))
    
    # Create pandas dataframe of the data
    labels = ['Rank', 'Name', 'Chip time', 'St. number', 'Nationality', 'Age cat.']
    df = pd.DataFrame(table, columns = labels)
    
    return df

#### Automated all years

We scrape all the available years of Prague Marathons. This is 1995-2011 and 2013-2019. 2012 is for some reason missing from the website, it can be googled but the page at their website is then unusable as it is in a weird place of their website and it cannot load the script that pulls the data from their servers. Our guess is that it was probably a special year where they tried something new with a different company and the data is now inaccessible. Or they simply forgot to finish integrating this year to the website correcly.

In [11]:
# Years of our marathons is from 2019 till 1995, excluding 2012
years = list(range(2019, 2012, -1)) + list(range(2011, 1994, -1))
print(years)

[2019, 2018, 2017, 2016, 2015, 2014, 2013, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995]


In [12]:
# Just manually inputted the total number of pages for each marathon, could be scraped as well but didn't seem worth the time savings and clutter
num_of_pages = [487, 464, 434, 386, 392, 390, 385, 354, 324, 266, 247, 213, 200, 227, 229, 174, 188, 177, 186, 186, 168, 110, 78, 35] # THIS RUNS ABOUT 5 HOURS
# num_of_pages = [2] * 24 # for testing, just 2 pages scraping, takes just about a minute for all the 24 years of marathons - RUN THIS VERSION WHEN TESTING

In [13]:
# Gets the URLs of all data table pages for marathon year
def get_pages_urls(idx):
    url_marathon_part_1 = urls_marathons[idx] + '&page=' # first part of the urls before inputting the page number
    url_marathon_part_2 = '&per_page=15&sort=finishTime' # the end of the urls
    marathon_pages = np.empty(num_of_pages[idx], dtype=object) # initialise an empty array of length depending on how many pages the marathon table has

    for i in range(0, num_of_pages[idx]): # link together URL parts
        marathon_pages[i] = url_marathon_part_1 + str(i + 1) + url_marathon_part_2

    return marathon_pages # returns URLs for pages of the marathon

In [14]:
# Scrapes the whole one marathon year, across its data table pages
def scrape(index):
    df = pd.DataFrame() # initialise empty df
    position = 0 # initialise a page counter

    marathon_pages = get_pages_urls(index) # get the urls for index-marathon table pages
    
    # For loop that scrapes each page of the marathon year and concatenates the data table into one dataframe variable
    for page in marathon_pages:
        position += 1 # increment a page counter
        print('Scraping year ' + str(years[index]) + ', page ' + str(position) + '/' + str(len(marathon_pages))) # feedback output to console

        # Scrape the data table on one page
        df_add = get_table(page)

        # Concatenating table with additional pages in the loop
        # http://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
        frames = [df, df_add]
        df = pd.concat(frames, ignore_index = True) # ignore_index = True ignores original 0-14 indices and makes a new one

    print('Done scraping year ' + str(years[index]))
    return df # returns pandas dataframe containing the whole data table for a given year (index)

## The final scraping loop

The following cell starts the scraping of data tables for all years. Calls most of the above functions.

**Takes a few hours to scrape all the years.**

In [15]:
# Selenium scraping

# Working with chrome, first open one Chrome window that'll be displaying our URLs
browser = webdriver.Chrome()

print("Waiting 3sec for browser window to open")
sleep(3) # time in seconds, wait before Chrome window opens

# MAIN SCRAPING LOOP/CALL
for i in range(0, len(years)):
    # Scrape the data table for the i-th marathon year
    data = scrape(i)
    
    # Save the data
    data.to_csv('Data_Marathons_Prague/data_' + str(years[i]) + '.csv', index = False) # "index = False" avoids saving the index column which is duplicated once loaded
    
# The scraping status is gradually printed to the screen for feedback. If you're viewing this on GitHub or some other HTML form, this is long so scroll down.
# Best to open this Jupyter file in Jupyter lab where the view state has been saved and the output of this cell should be truncated.

Scraping year 2019, page 1/487
Scraping year 2019, page 2/487
Scraping year 2019, page 3/487
Scraping year 2019, page 4/487
Scraping year 2019, page 5/487
Scraping year 2019, page 6/487
Scraping year 2019, page 7/487
Scraping year 2019, page 8/487
Scraping year 2019, page 9/487
Scraping year 2019, page 10/487
Scraping year 2019, page 11/487
Scraping year 2019, page 12/487
Scraping year 2019, page 13/487
Scraping year 2019, page 14/487
Scraping year 2019, page 15/487
Scraping year 2019, page 16/487
Scraping year 2019, page 17/487
Scraping year 2019, page 18/487
Scraping year 2019, page 19/487
Scraping year 2019, page 20/487
Scraping year 2019, page 21/487
Scraping year 2019, page 22/487
Scraping year 2019, page 23/487
Scraping year 2019, page 24/487
Scraping year 2019, page 25/487
Scraping year 2019, page 26/487
Scraping year 2019, page 27/487
Scraping year 2019, page 28/487
Scraping year 2019, page 29/487
Scraping year 2019, page 30/487
Scraping year 2019, page 31/487
Scraping year 201

### Test load data files

Try to load up some of the saved datasets to test if they seem correct. Visually inspect. All seems good. More preprocessing is done in the Data Analysis part of the project where these saved datasets are loaded up and analysed.

In [19]:
# Test load a saved data file
# 2019, last year
data_loaded = pd.read_csv('Data_Marathons_Prague/data_2019.csv')
data_loaded

Unnamed: 0,Rank,Name,Chip time,St. number,Nationality,Age cat.
0,1,Almahjoub DAZZA,2:05:58,1,BHR,MAM
1,2,Dawit WOLDE,2:06:18,12,ETH,MAM
2,3,Aychew BANTIE,2:06:23,7,ETH,MAM
3,4,Amos KIPRUTO,2:06:46,2,KEN,MAM
4,5,Solomon Kirwa YEGO,2:07:30,3,KEN,MAM
5,6,Hamid Ben DAOUD,2:08:14,9,ESP,MAM
6,7,Paul Muchemi MAINA,2:09:17,4,KEN,MAM
7,8,Girmaw AMARE,2:09:54,19,ISR,MAM
8,9,Nicodemus Kipkurui KIMUTAI,2:10:00,17,KEN,MAM
9,10,Goitom KIFLE,2:10:18,15,ERI,MAM


In [20]:
# Test load a saved data file
# 1995, first year
data_loaded = pd.read_csv('Data_Marathons_Prague/data_1995.csv')
data_loaded

Unnamed: 0,Rank,Name,Chip time,St. number,Nationality,Age cat.
0,1,Turbo Tummo,2:12:44,3,-,-
1,2,Andrzej Krzyscin,2:16:53,8,-,-
2,3,Pavel Klimes,2:16:56,7,-,-
3,4,Miriusz Kaminski,2:17:06,28,-,-
4,5,Jackson Kipngok,2:17:13,1,-,-
5,6,Pavel Kryška,2:17:31,286,-,-
6,7,Vladimír Plykine,2:17:47,5,-,-
7,8,Marek Adamski,2:19:14,17,-,-
8,9,Petro Meta,2:20:25,10,-,-
9,10,Alexander Erchov,2:20:30,776,-,-


### Sources:

A couple of useful websites for reading up on the methods.

 - a Google search for relevant terms:
     - https://www.google.com/search?q=python+scrape+website+that+has+script+inside+html&oq=python+scrape+website+that+has+script+inside+html&aqs=chrome..69i57.14882j0j7&sourceid=chrome&ie=UTF-8
 - a couple of approaches compared:
     - https://stackoverflow.com/questions/26680590/how-to-scrape-imbeded-script-on-webpage-in-python
 - Selenium tutorial, especially useful for setting up the environment (the webdriver):
     - https://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Scraping_a_Webpage_Rendered_by_Javascript_Using_Python.php
 - JS scraping using BeautifulSoup:
     - https://www.youtube.com/watch?v=FSH77vnOGqU
 - modern websites scraping:
     - https://www.youtube.com/watch?v=vsmxMLmroyQ
 - BeautifulSoup documentation:
     - https://www.crummy.com/software/BeautifulSoup/bs4/doc/
 - Pandas documentation:
     - https://pandas.pydata.org/pandas-docs/stable/