# Web scraping the Oddshecker outright World Cup winner page

This script uses selenium and chromedriver to scrape Oddschecker's outright World Cup winner page and calculate the "indicative probabilities" of each team winning the tournament. 

The script was developed for use as part of scheduled workflows using GitHub Actions. An initial version of the script worked effectively on my desktop but was blocked when executed through GitHub Actions, which led to the inclusion of a user agent and additional chromedriver options.

<div>
<img src="screenshot.png" width="500"/>
</div>

In [1]:
from bs4 import BeautifulSoup
import numpy
import pandas
from datetime import datetime

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ActionChains
from selenium.webdriver.support.ui import WebDriverWait

In [45]:
chrome_options = Options()
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.517 Safari/537.36'
chrome_options.add_argument('user-agent={0}'.format(user_agent))
chrome_options.add_argument('--headless')

driver = webdriver.Chrome(service=Service('/usr/local/bin/chromedriver.exe'),options=chrome_options)

wait = WebDriverWait(driver, 20)
action = ActionChains(driver)

driver.get('https://www.oddschecker.com/football/world-cup/winner')
soup = BeautifulSoup(driver.page_source)

Define the list of participating countries. This was determined from the html code before the tournament began, but then hardcoded so that countries would still be assigned a probability (of 0) once they were eliminated (at which point they would disappear from the webpage).

In [57]:
#listOfCountries = soup.find_all( class_ = "popup selTxt" )

#for i in range(len(listOfCountries)):
    #listOfCountries[i] = listOfCountries[i]["data-name"]
    
#print(listOfCountries)

listOfCountries = ['Brazil', 'Argentina', 'France', 'Spain', 'England', 'Germany', 'Netherlands', 'Portugal', 'Belgium', 'Denmark', 'Uruguay', 'Croatia', 'Serbia', 'Switzerland', 'Senegal', 'Mexico', 'USA', 'Poland', 'Ecuador', 'Wales', 'Morocco', 'Japan', 'Ghana', 'Canada', 'Cameroon', 'Iran', 'South Korea', 'Australia', 'Qatar', 'Tunisia', 'Saudi Arabia', 'Costa Rica']

For each country, for each betting company/exchange, grab the odds. The initial probability is then the average across of these values. That probability is then normalised to account for the fact that betting companies do not provide fair odds (i.e. without normalisation the probabilities of each team winning would exceed 1).

In [58]:
oddsData = []
totalProb = 0

for country in listOfCountries:
    
    countryData = soup.find( class_ = "diff-row evTabRow bc", attrs={"data-bname" : country} )
    
    if countryData is None:
        
        oddsArray = [0]
        
    else:
    
        oddsArray = countryData.findChildren("td", class_ = "bc", recursive=False)

        for i in range(len(oddsArray)):
            oddsArray[i] = 1/(float(oddsArray[i]["data-odig"]))

    oddsMean = numpy.mean(oddsArray)
    totalProb += oddsMean
    
    oddsData.append({"country": country, "prob": oddsMean, "currDateTime": datetime.now()})
    
for obj in oddsData:
    
    obj["prob"] = obj["prob"]/totalProb

In [59]:
pandas.DataFrame(oddsData).to_csv("historicalOutputs.csv", sep=',', encoding='utf-8', index=False, mode='a', header=False)