#Is There a Home Course Advantage in UCI Cyclocross Worlds?

It is getting to be the heart of the season for Euro cyclocross racing, and I'm getting excited for this year's worlds in Heusden-Zolder, Belgium.  I started to wonder if historically there has been an advantage to racers having the event hosted by their home country.  It obviously wasn't strong enough for JPow to win last year, but perhaps historically having a crowd cheering you on, riding a possibly familiar course, or at the very least not having to travel a long way and be off schedule could give a rider an extra edge.  Conveniently, the data I need to answer this question is already tabulated on on this [Cyclocross Magazine page](http://www.cxmagazine.com/past-and-present-cyclocross-world-champions-world-championship-winners).  It would also be cool to look at the women's and U23 races, but at this point, there isn't too much history to analyze.  One obvious caveat is that there have been a lot of races in Belgium, and there are also a lot of Belgian racers.  With that in mind, let's have a look.

In [1]:
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
from ggplot import *
import re

In [2]:
url = 'http://www.cxmagazine.com/past-and-present-cyclocross-world-champions-world-championship-winners'

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/47.0.2526.106 Safari/537.36' #must specify User-Agent because default gives 403 forbidden error

req = urllib2.Request(url, headers = {'User-Agent' : user_agent})

try:
    page_html = urllib2.urlopen(req)
except:
    print 'Unable to complete request'
    exit()

soup = BeautifulSoup(page_html)

tables = soup.find_all('table')

#print off the first few rows to see which table contains the Men's Elite records
for table in tables:
    rows = table.find_all('tr')
    print rows[:2]


[<tr>
<td height="17" width="52"><strong>Year</strong></td>
<td width="164"><strong>City</strong></td>
<td width="142"><strong>Gold</strong></td>
<td width="147"><strong>Silver</strong></td>
<td width="145"><strong>Bronze</strong></td>
</tr>, <tr>
<td height="17" width="52">1950</td>
<td width="164">Paris, France</td>
<td width="142">Jean Robic (FRA)</td>
<td width="147">Roger Rondeaux (FRA)</td>
<td width="145">Pierre Jodet (FRA)</td>
</tr>]
[<tr>
<td height="17" width="52"><strong>Year</strong></td>
<td width="164"><strong>City</strong></td>
<td width="142"><strong>Gold</strong></td>
<td width="147"><strong>Silver</strong></td>
<td width="145"><strong>Bronze</strong></td>
</tr>, <tr>
<td height="17" width="52">2000</td>
<td width="164">Sint-Michielsgestel, Netherlands</td>
<td width="142">Hanka Kupfernagel (GER)</td>
<td width="147">Louise Robinson (GBR)</td>
<td width="145">Daphny van den Brand (NED)</td>
</tr>]
[<tr>
<td height="13" width="35"><strong>Year</strong></td>
<td width="

In [3]:
#the first table is the one we are looking for
mens_worlds = tables[0]

#define new arrays to house the scraped data
year = []
city = []
gold = []
silver = []
bronze = []

for row in mens_worlds.find_all('tr')[1:-1]: #rows are contained within <tr> pairs
    cols = row.find_all('td') 
    year.append(cols[0].string.strip())
    city.append(cols[1].string.strip())
    gold.append(cols[2].string.strip())
    silver.append(cols[3].string.strip())
    bronze.append(cols[4].string.strip())

In [4]:
worlds = pd.DataFrame({'Year' : year, 'City' : city, 'Gold' : gold, 'Silver' : silver, 'Bronze' : bronze})
worlds.head()

Unnamed: 0,Bronze,City,Gold,Silver,Year
0,Pierre Jodet (FRA),"Paris, France",Jean Robic (FRA),Roger Rondeaux (FRA),1950
1,Pierre Jodet (FRA),"Luxembourg, Luxembourg",Roger Rondeaux (FRA),André Dufraisse (FRA),1951
2,Albert Meier (SUI),"Geneva, Switzerland",Roger Rondeaux (FRA),André Dufraisse (FRA),1952
3,André Dufraisse (FRA),"Quato, Spain",Roger Rondeaux (FRA),Gilbert Bauvin (FRA),1953
4,Hans Bieri (SUI),"Crenna, Italy",André Dufraisse (FRA),Pierre Jodet (FRA),1954


In [5]:
worlds = worlds.loc[:,['Year', 'City', 'Gold', 'Silver', 'Bronze']]
worlds.head()

Unnamed: 0,Year,City,Gold,Silver,Bronze
0,1950,"Paris, France",Jean Robic (FRA),Roger Rondeaux (FRA),Pierre Jodet (FRA)
1,1951,"Luxembourg, Luxembourg",Roger Rondeaux (FRA),André Dufraisse (FRA),Pierre Jodet (FRA)
2,1952,"Geneva, Switzerland",Roger Rondeaux (FRA),André Dufraisse (FRA),Albert Meier (SUI)
3,1953,"Quato, Spain",Roger Rondeaux (FRA),Gilbert Bauvin (FRA),André Dufraisse (FRA)
4,1954,"Crenna, Italy",André Dufraisse (FRA),Pierre Jodet (FRA),Hans Bieri (SUI)


In [6]:
#now to extract the three letter country abbreviation from each element
cntry = re.compile(r'\(([A-Z]{3})\)')
worlds['Gold_country'] = [cntry.findall(winner)[0] for winner in worlds['Gold']]
worlds['Silver_country'] = [cntry.findall(winner)[0] for winner in worlds['Silver']]
worlds['Bronze_country'] = [cntry.findall(winner)[0] for winner in worlds['Bronze']]

In [7]:
worlds.head()

Unnamed: 0,Year,City,Gold,Silver,Bronze,Gold_country,Silver_country,Bronze_country
0,1950,"Paris, France",Jean Robic (FRA),Roger Rondeaux (FRA),Pierre Jodet (FRA),FRA,FRA,FRA
1,1951,"Luxembourg, Luxembourg",Roger Rondeaux (FRA),André Dufraisse (FRA),Pierre Jodet (FRA),FRA,FRA,FRA
2,1952,"Geneva, Switzerland",Roger Rondeaux (FRA),André Dufraisse (FRA),Albert Meier (SUI),FRA,FRA,SUI
3,1953,"Quato, Spain",Roger Rondeaux (FRA),Gilbert Bauvin (FRA),André Dufraisse (FRA),FRA,FRA,FRA
4,1954,"Crenna, Italy",André Dufraisse (FRA),Pierre Jodet (FRA),Hans Bieri (SUI),FRA,FRA,SUI


In [8]:
#We can now pull a table of three letter country codes from wikipedia to convert the winner's country and compare
#it against the host country

url = 'https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3'

page_html = urllib2.urlopen(url)

soup = BeautifulSoup(page_html)

tables = soup.find_all('table')

#From looking at the page, the correct table should be the first one.
#Now to capture the relevant information in lists.
abbrev = []
country = []

#this table is formatted in a kind of tricky way, three tables nested in <td> tags under the main table

column_tables = tables[0].find_all('table')

len(column_tables)

3

In [9]:
#Ok, that's it.  Now to capture the relevant information in lists.
abbrev = []
country = []

#this table is formatted in a kind of tricky way, three tables nested in <td> tags under the main table

column_tables = tables[0].find_all('table')

len(column_tables)

3

In [10]:
for i in range(3):
    for row in column_tables[i].find_all('tr'):
        cols = row.find_all('td')
        abbrev.append(cols[0].string.strip())
        country.append(cols[1].string.strip())
    

In [11]:
print abbrev[:10], country[:10]

[u'ABW', u'AFG', u'AGO', u'AIA', u'ALA', u'ALB', u'AND', u'ARE', u'ARG', u'ARM'] [u'Aruba', u'Afghanistan', u'Angola', u'Anguilla', u'\xc5land Islands', u'Albania', u'Andorra', u'United Arab Emirates', u'Argentina', u'Armenia']


In [12]:
#great!  Now to define a dict with the above data as key:value pairs
cntry_lookup = {abbrev[i] : country[i] for i in range(len(abbrev))}

In [13]:
#it appears that there are a few non-standard 3 letter abbreviations in the race results table.  
#Let's add those to the dict manually.

cntry_lookup['GER'] = 'Germany'
cntry_lookup['SUI'] = 'Switzerland'
cntry_lookup['NED'] = 'Netherlands'

worlds['Gold_country_full'] = [cntry_lookup[worlds.loc[i, 'Gold_country']] for i in range(len(worlds.Gold_country))]

In [14]:
worlds.head()

Unnamed: 0,Year,City,Gold,Silver,Bronze,Gold_country,Silver_country,Bronze_country,Gold_country_full
0,1950,"Paris, France",Jean Robic (FRA),Roger Rondeaux (FRA),Pierre Jodet (FRA),FRA,FRA,FRA,France
1,1951,"Luxembourg, Luxembourg",Roger Rondeaux (FRA),André Dufraisse (FRA),Pierre Jodet (FRA),FRA,FRA,FRA,France
2,1952,"Geneva, Switzerland",Roger Rondeaux (FRA),André Dufraisse (FRA),Albert Meier (SUI),FRA,FRA,SUI,France
3,1953,"Quato, Spain",Roger Rondeaux (FRA),Gilbert Bauvin (FRA),André Dufraisse (FRA),FRA,FRA,FRA,France
4,1954,"Crenna, Italy",André Dufraisse (FRA),Pierre Jodet (FRA),Hans Bieri (SUI),FRA,FRA,SUI,France


In [15]:
#ok, now we can finally calculate the proportion of time the winner won on a home course

home_win = [worlds.Gold_country_full[i] in worlds.City[i] for i in range(len(worlds.City))]

float(sum(home_win))/float(len(worlds.City))

0.2153846153846154

A little over 20% of the time, the race winner hailed from the host country.  At first glance, it seems like that could indicate some advantage of racing on home dirt.  It would be interesting to do some further investigation to determine if in these cases the winner was one of the dominant riders in that season, and thus would be expected to win anywhere, or whether there were some instances of an upset spurred on by some small advantage.