# UCI MTB DH Data Retrieval

## Setup
#### Import Libraries

If you do not have these libraries available, you should install them using `pip`

```
pip install requests
pip install bs4
pip install pandas
```

In [76]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime as dt
import os

In [52]:
def calculate_age(born):
    today = dt.date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

Widen display area to prevent column wrapping, and always show all columns for debug 

In [53]:
pd.set_option('display.width', 2000)
pd.set_option('display.max_columns', None)

## Config

Which race data are we collecting?

1. Losinj
1. Fort William
1. Leogang
1. Val di Sole
1. Vallnord
1. Mont-Sainte-Anne
1. La Bresse

In [54]:
race = 2
gender = 'm'
event = 'dh'

#### Data Sources

The UCI Live Timing API contains a lot of data points, but not all the ones we want (speed being the main one missing), and not even all the ones they include on their own PDF which is frustrating.

Similarly, Roots & Rain also has a lot of the data points, but again not all of them; most notably it's missing timing splits 4 and 5.

Therefore we need to pull from both sources and combine the sets.

Here we specify the URLs for both sources from which we will extract our data. The UCI API URL can be found by loading the Live Timing page then using your browser's inspector on the Network tab (in Chrome at least) to see the data feed. As the UCI seems to be using a Single Page Application (SPA) here, it's not straight forward to extract this link automagically.

*Links will be added as data sources become available.*

In [78]:
races = [
    [
        'losinj',
        'http://prod.chronorace.be/api/results/uci/dh/race/20180421_dh/3',
        'https://www.rootsandrain.com/race5897/2018-apr-22-mercedes-benz-uci-world-cup-1-losinj/results/filters/m/',
        'http://prod.chronorace.be/api/results/uci/dh/race/20180421_dh/6',
        'https://www.rootsandrain.com/race5897/2018-apr-22-mercedes-benz-uci-world-cup-1-losinj/results/filters/f/'
    ]
    , [
        'fortbill',
        'http://prod.chronorace.be/api/results/uci/dh/race/20180602_dh/3',
        'https://www.rootsandrain.com/race5898/2018-jun-3-mercedes-benz-uci-world-cup-2-fort-william/results/filters/m',
        'http://prod.chronorace.be/api/results/uci/dh/race/20180602_dh/6',
        'https://www.rootsandrain.com/race5898/2018-jun-3-mercedes-benz-uci-world-cup-2-fort-william/results/filters/f'
    ]
    , [ 'leogang', '', '' ]
    , [ 'valdisole', '', '' ]
    , [ 'vallnord', '', '' ]
    , [ 'msa', '', '' ]
    , [ 'labresse', '', '' ]
]
key = 0 if gender == 'm' else 2
raceName = races[race-1][0]
urlUci = races[race-1][key+1]
urlRoots = races[race-1][key+2]

File handling setup

In [87]:
directory = event + str(race) + '_' + raceName
if not os.path.exists(directory):
    os.makedirs(directory)

file_prefix = event + str(race) + '_' + raceName + '_' + gender
file_prefix = os.path.join( directory, file_prefix )

# UCI API
### Load Data

These two lines make the actual request to the server, and then converts the JSON string response in to a usable list format (deserialization)

In [56]:
r = requests.get( urlUci )
d = r.json()

The API returns with three main sections:

1. `Last Finisher`
 - Racers in order of start time
2. `Results`
 - Racers in finishing rank order
3. `Riders`
 - Personal details on all racers
 
Each contains many data points. To see all the contained data, you can un-comment and execute any of the lines in the next section to explore more.

In [57]:
# display( d )
# display( d['Results'][7] )
# display( d['Riders']['1001'] )
# display( d['Results'][61] )

### Extract Data

Here we iterate over the `Results` sub-set of data to extract the information we care about: basically those that finished the race, some identifying info, and their splits.

If you looked at detail of the returned data set in the last step you might have noticed the rider's name is not stored next to their result, riders are only identified by a reference number. To facilitate our analysis later on it is useful to import each rider's name at this stage by cross-referencing the `Riders` sub-set.

First we find out the last man to drop-in's start number so we can use that to add a reverse order column.

In [58]:
lastStart = d['Riders'][list(d['Riders'].keys())[-1]]['StartOrder']

We start with an empty list `lst` and in each loop iteration add an entry (actually a dict) to that list for each rider.

In [59]:
splits = len(d['Results'][0]['Times'] )
lst = []
for idx, row in enumerate( d['Results'] ):
    fin = "Finished" == row['Status']
    res = {
        'rank': row['Position'] if fin else idx+1,
        'name': d['Riders'][str(row['RaceNr'])]['PrintName'],
        'id': row['RaceNr'],
        'uci': d['Riders'][str(row['RaceNr'])]['UciRiderId'],
        'bib': d['Riders'][str(row['RaceNr'])]['RaceNr'],
        'status': row['Status'],
        'speed': np.nan,
        'start': d['Riders'][str(row['RaceNr'])]['StartOrder'],
        'start_rev': lastStart - d['Riders'][str(row['RaceNr'])]['StartOrder'] +1
    }

    # Add all splits to result set
    for split in range( 0, splits ):
        res['split' + str(split+1)] = row['Times'][split]['RaceTime']/1000 if fin else np.nan

    # Append result set to list
    lst.append(res)

This line loads the completed list in to a Pandas dataframe so that we can easily write it out to CSV later on 

In [60]:
df = pd.DataFrame( lst )

#### Expand Dataset

Calculate and add all the extra columns we need for split and sector differences and their rankings

In [61]:
for i in range( 1, splits+1 ):
    split = 'split' + str(i)
    sector = split + '_sector'
    df[split + '_rank'] = df[split].rank(method='dense')
    df[split + '_vs_best'] = (df[split] - df[split].min())
    df[split + '_vs_winner'] = (df[split] - df[split][0])

    if i > 1:
        df[split + '_sector'] = df[split] - df['split' + str(i-1)]
        df[split + '_sector_rank'] = df[sector].rank(method='dense')
        df[split + '_sector_vs_best'] = (df[sector] - df[sector].min())
        df[split + '_sector_vs_winner'] = (df[sector] - df[sector][0])

We can take a peek at our data at this point to make sure it looks how we expect.

At this point the `speed` column is NaN (Not a Number) for all racers. This will be filled in below.

In [62]:
display( df['split1'][0], df['split1'].min() )

60.601999999999997

59.142000000000003

In [63]:
display( df.head(10) )

Unnamed: 0,bib,id,name,rank,speed,split1,split2,split3,split4,split5,start,start_rev,status,uci,split1_rank,split1_vs_best,split1_vs_winner,split2_rank,split2_vs_best,split2_vs_winner,split2_sector,split2_sector_rank,split2_sector_vs_best,split2_sector_vs_winner,split3_rank,split3_vs_best,split3_vs_winner,split3_sector,split3_sector_rank,split3_sector_vs_best,split3_sector_vs_winner,split4_rank,split4_vs_best,split4_vs_winner,split4_sector,split4_sector_rank,split4_sector_vs_best,split4_sector_vs_winner,split5_rank,split5_vs_best,split5_vs_winner,split5_sector,split5_sector_rank,split5_sector_vs_best,split5_sector_vs_winner
0,16,1016,PIERRON Amaury,1,,60.602,185.501,211.297,247.279,274.452,54,7,Finished,10008827283,8.0,1.46,0.0,1.0,0.0,0.0,124.899,1.0,0.0,0.0,1.0,0.0,0.0,25.796,1.0,0.0,0.0,1.0,0.0,0.0,35.982,5.0,0.461,0.0,1.0,0.0,0.0,27.173,26.0,1.215,0.0
1,10,1010,VERGIER Loris,2,,60.79,186.001,211.93,248.287,274.722,58,3,Finished,10008723112,14.0,1.648,0.188,2.0,0.5,0.5,125.211,2.0,0.312,0.312,2.0,0.633,0.633,25.929,8.0,0.133,0.133,2.0,1.008,1.008,36.357,9.0,0.836,0.375,2.0,0.27,0.27,26.435,7.0,0.477,-0.738
2,8,1008,BROSNAN Troy,3,,60.101,186.402,212.226,248.301,274.763,57,4,Finished,10007307417,4.0,0.959,-0.501,4.0,0.901,0.901,126.301,5.0,1.402,1.402,3.0,0.929,0.929,25.824,2.0,0.028,0.028,3.0,1.022,1.022,36.075,7.0,0.554,0.093,3.0,0.311,0.311,26.462,8.0,0.504,-0.711
3,81,1081,WILSON Reece,4,,61.39,187.453,213.456,249.286,275.775,30,31,Finished,10009563271,25.0,2.248,0.788,6.0,1.952,1.952,126.063,4.0,1.164,1.164,6.0,2.159,2.159,26.003,10.0,0.207,0.207,4.0,2.007,2.007,35.83,2.0,0.309,-0.152,4.0,1.323,1.323,26.489,9.0,0.531,-0.684
4,61,1061,BRUNI Loic,5,,60.521,187.518,213.998,250.78,277.039,46,15,Finished,10007544358,7.0,1.379,-0.081,7.0,2.017,2.017,126.997,7.0,2.098,2.098,7.0,2.701,2.701,26.48,25.0,0.684,0.684,9.0,3.501,3.501,36.782,15.0,1.261,0.8,5.0,2.587,2.587,26.259,4.0,0.301,-0.914
5,9,1009,HART Danny,6,,60.734,186.386,212.55,250.051,277.209,59,2,Finished,10005470073,12.0,1.592,0.132,3.0,0.885,0.885,125.652,3.0,0.753,0.753,4.0,1.253,1.253,26.164,15.0,0.368,0.368,5.0,2.772,2.772,37.501,36.0,1.98,1.519,6.0,2.757,2.757,27.158,25.0,1.2,-0.015
6,71,1071,WALKER Matt,7,,60.504,187.293,213.215,250.435,277.612,29,32,Finished,10011016756,6.0,1.362,-0.098,5.0,1.792,1.792,126.789,6.0,1.89,1.89,5.0,1.918,1.918,25.922,7.0,0.126,0.126,6.0,3.156,3.156,37.22,27.0,1.699,1.238,7.0,3.16,3.16,27.177,27.0,1.219,0.004
7,18,1018,GUTIERREZ VILLEGAS Marcelo,8,,61.269,189.254,215.855,251.704,277.698,39,22,Finished,10005855649,21.0,2.127,0.667,15.0,3.753,3.753,127.985,15.0,3.086,3.086,17.0,4.558,4.558,26.601,29.0,0.805,0.805,14.0,4.425,4.425,35.849,3.0,0.328,-0.133,8.0,3.246,3.246,25.994,2.0,0.036,-1.179
8,4,1004,BLENKINSOP Samuel,9,,61.452,189.187,215.381,251.41,277.72,56,5,Finished,10004485929,27.0,2.31,0.85,14.0,3.686,3.686,127.735,12.0,2.836,2.836,14.0,4.084,4.084,26.194,18.0,0.398,0.398,11.0,4.131,4.131,36.029,6.0,0.508,0.047,9.0,3.268,3.268,26.31,5.0,0.352,-0.863
9,138,1138,MAES Martin,10,,61.545,188.737,214.603,250.706,277.774,41,20,Finished,10009453945,30.0,2.403,0.943,12.0,3.236,3.236,127.192,9.0,2.293,2.293,10.0,3.306,3.306,25.866,4.0,0.07,0.07,8.0,3.427,3.427,36.103,8.0,0.582,0.121,10.0,3.322,3.322,27.068,20.0,1.11,-0.105


#### Rider Data

Saving the personal information about each racer is much easier as we can just export the entire `Riders` dataset. However, the rows and columns are the wrong way round so the `.T` command *transposes* the information, meaning it basically flips the axes.

In [64]:
df2 = pd.DataFrame( d['Riders'] )
df2 = df2.T
df2['Age'] = [ calculate_age( dt.datetime.strptime( dob[:10], "%Y-%m-%d" ) ) for dob in df2['BirthDate'] ]

Here we can glimpse the first few rows of our `DataFrame` and can check the data looks as we expect

In [65]:
display( df2.head() )

Unnamed: 0,BirthDate,CategoryCode,FamilyName,GivenName,Id,Nation,Outfit,PrintName,RaceId,RaceNr,ScoreboardName,StartOrder,StartTime,UciCode,UciRank,UciRiderId,UciTeamCode,UciTeamId,UciTeamName,WorldCupRank,Age
1001,1987-12-24T00:00:00,ME,GWIN,Aaron,101001,USA,WCL,GWIN Aaron,0,1,GWIN A,52,54150000,USA19871224,1,10006516663,YTM,1531,THE YT MOB,1,30
1002,1996-12-25T00:00:00,ME,SHAW,Luca,101002,USA,,SHAW Luca,0,2,SHAW L,60,55830000,USA19961225,10,10008813442,SCB,1307,SANTA CRUZ SYNDICATE,2,21
1004,1988-10-28T00:00:00,ME,BLENKINSOP,Samuel,101004,NZL,CCh,BLENKINSOP Samuel,0,4,BLENKINSOP S,56,54990000,NZL19881028,8,10004485929,NFR,2013,NORCO FACTORY RACING,4,29
1005,1992-04-30T00:00:00,ME,NORTON,Dakotah,101005,USA,,NORTON Dakotah,0,5,NORTON D,55,54780000,USA19920430,19,10010038167,UDR,1961,UNIOR/DEVINCI FACTORY RACING,5,26
1007,1997-02-18T00:00:00,ME,GREENLAND,Laurie,101007,GBR,,GREENLAND Laurie,0,7,GREENLAND L,50,53580000,GBR19970218,12,10009404738,MSM,1009,MS MONDRAKER TEAM,7,21


# Roots and Rain
### Load Data

Similar to the UCI api, we make a request to the server with the previously declared `urlRoots` variable. This time however we simply load the content of the response as text which is actually the HTML code of the web page. We do not do have a nice JSON API to read which means we will not deserialize.

Next we invoke a utility called `BeautifulSoup` to help us extract the data from this messy HTML code

In [66]:
r = requests.post( urlRoots )
c = r.content
soup = BeautifulSoup( c, "html.parser" )

### Extract Data

If you look at the Roots and Rain page you'll see it listed in a tabular format. What we do here is find all the rows of that table so we can extract the information we need.

Specifically we are looking for instances of `tr` (table row), with a class that *begins with* `c-` as this is a common denomenator I discovered when looking through the code with the browser inspector

In [67]:
rows = soup.find_all( "tr", class_=lambda x: x and 'c-' in x )

Similar to the UCI data set, here we will iterate over each row in our data set--basically each table row from the web page--and extract the bits we need.

Racer speed is the metric we're interested in, but in order to match that to our existing data set we need a corresponding identifier so we also extract the racer licence number as that exists in both sets and we can match them together: it is the *intersect* between both sets of data.

To summarise:
1. Extract licence number and corresponding speed
2. Import speed to existing DataFrame matching racers by licence

The `if` condition in the middle will exit this block of code once we hit the end of the Elite finishers, seeing as that's all we have in our existing data set so can't match anyone else

In [68]:
for row in rows:
    cells = row.find_all( "td" )

    speed = float(cells[7].text[:5])
    licence = cells[4].text
    bib = int( cells[1].text )
    pos = cells[0].text[8:]
    if "" == pos: break

    # Match rider by UCI licence if present, otherwise fallback to bib
    if len(df2.loc[df2['UciRiderId'] == licence].index.values ):
        rid = int(df2.loc[df2['UciRiderId'] == licence].index.values[0])
    else:
        rid = int( df2.loc[df2['RaceNr'] == bib].index.values[0] )

    # Add speed, and other associated metrics
    df.loc[df['id'] == rid, 'speed'] = speed
    df.loc[df['id'] == rid, 'speed_ms'] = float(speed)*(1000/60/60)
    df.loc[df['id'] == rid, 'speed_ms_vs_best'] = df['speed_ms'].max() - df.speed_ms
    df['speed_rank'] = df.speed.rank(method='dense', ascending=False)

As before, we can take another look at how our data is looking, with the `speed` column now containing data 

In [69]:
display( df.head() )

Unnamed: 0,bib,id,name,rank,speed,split1,split2,split3,split4,split5,start,start_rev,status,uci,split1_rank,split1_vs_best,split1_vs_winner,split2_rank,split2_vs_best,split2_vs_winner,split2_sector,split2_sector_rank,split2_sector_vs_best,split2_sector_vs_winner,split3_rank,split3_vs_best,split3_vs_winner,split3_sector,split3_sector_rank,split3_sector_vs_best,split3_sector_vs_winner,split4_rank,split4_vs_best,split4_vs_winner,split4_sector,split4_sector_rank,split4_sector_vs_best,split4_sector_vs_winner,split5_rank,split5_vs_best,split5_vs_winner,split5_sector,split5_sector_rank,split5_sector_vs_best,split5_sector_vs_winner
0,16,1016,PIERRON Amaury,1,,60.602,185.501,211.297,247.279,274.452,54,7,Finished,10008827283,8.0,1.46,0.0,1.0,0.0,0.0,124.899,1.0,0.0,0.0,1.0,0.0,0.0,25.796,1.0,0.0,0.0,1.0,0.0,0.0,35.982,5.0,0.461,0.0,1.0,0.0,0.0,27.173,26.0,1.215,0.0
1,10,1010,VERGIER Loris,2,,60.79,186.001,211.93,248.287,274.722,58,3,Finished,10008723112,14.0,1.648,0.188,2.0,0.5,0.5,125.211,2.0,0.312,0.312,2.0,0.633,0.633,25.929,8.0,0.133,0.133,2.0,1.008,1.008,36.357,9.0,0.836,0.375,2.0,0.27,0.27,26.435,7.0,0.477,-0.738
2,8,1008,BROSNAN Troy,3,,60.101,186.402,212.226,248.301,274.763,57,4,Finished,10007307417,4.0,0.959,-0.501,4.0,0.901,0.901,126.301,5.0,1.402,1.402,3.0,0.929,0.929,25.824,2.0,0.028,0.028,3.0,1.022,1.022,36.075,7.0,0.554,0.093,3.0,0.311,0.311,26.462,8.0,0.504,-0.711
3,81,1081,WILSON Reece,4,,61.39,187.453,213.456,249.286,275.775,30,31,Finished,10009563271,25.0,2.248,0.788,6.0,1.952,1.952,126.063,4.0,1.164,1.164,6.0,2.159,2.159,26.003,10.0,0.207,0.207,4.0,2.007,2.007,35.83,2.0,0.309,-0.152,4.0,1.323,1.323,26.489,9.0,0.531,-0.684
4,61,1061,BRUNI Loic,5,,60.521,187.518,213.998,250.78,277.039,46,15,Finished,10007544358,7.0,1.379,-0.081,7.0,2.017,2.017,126.997,7.0,2.098,2.098,7.0,2.701,2.701,26.48,25.0,0.684,0.684,9.0,3.501,3.501,36.782,15.0,1.261,0.8,5.0,2.587,2.587,26.259,4.0,0.301,-0.914


# Points

Neither data set contains points awarded so we use a reference file and merge

Merge type here must be `outer` so people that finished outside the top 60, or DNF, don't get trimmed from the dataset

In [70]:
dfp = pd.read_csv( event + '_points_' + gender + '.csv', index_col=0 )
dfp = dfp.reset_index(drop=False)
df = df.merge( dfp, left_index=True, right_index=True, how="outer")

In [71]:
df.tail()

Unnamed: 0,bib,id,name,rank,speed,split1,split2,split3,split4,split5,start,start_rev,status,uci,split1_rank,split1_vs_best,split1_vs_winner,split2_rank,split2_vs_best,split2_vs_winner,split2_sector,split2_sector_rank,split2_sector_vs_best,split2_sector_vs_winner,split3_rank,split3_vs_best,split3_vs_winner,split3_sector,split3_sector_rank,split3_sector_vs_best,split3_sector_vs_winner,split4_rank,split4_vs_best,split4_vs_winner,split4_sector,split4_sector_rank,split4_sector_vs_best,split4_sector_vs_winner,split5_rank,split5_vs_best,split5_vs_winner,split5_sector,split5_sector_rank,split5_sector_vs_best,split5_sector_vs_winner,points
55,88,1088,MACKINNON Kiran,56,,60.36,204.067,231.469,268.752,296.058,32,29,Finished,10007888104,5.0,1.218,-0.242,57.0,18.566,18.566,143.707,57.0,18.808,18.808,57.0,20.172,20.172,27.402,50.0,1.606,1.606,56.0,21.473,21.473,37.283,29.0,1.762,1.301,56.0,21.606,21.606,27.306,30.0,1.348,0.133,5
56,86,1086,POTGIETER Johann,57,,64.013,213.39,242.02,281.849,310.577,3,58,Finished,10004065900,59.0,4.871,3.411,59.0,27.889,27.889,149.377,59.0,24.478,24.478,58.0,30.723,30.723,28.63,55.0,2.834,2.834,57.0,34.57,34.57,39.829,57.0,4.308,3.847,57.0,36.125,36.125,28.728,55.0,2.77,1.555,4
57,22,1022,WILLIAMSON Greg,58,,61.707,208.734,245.156,287.915,323.479,43,18,Finished,10006909111,34.0,2.565,1.105,58.0,23.233,23.233,147.027,58.0,22.128,22.128,59.0,33.859,33.859,36.422,58.0,10.626,10.626,59.0,40.636,40.636,42.759,59.0,7.238,6.777,58.0,49.027,49.027,35.564,59.0,9.606,8.391,3
58,2,1002,SHAW Luca,59,,59.142,198.497,230.95,283.025,330.07,60,1,Finished,10008813442,1.0,0.0,-1.46,54.0,12.996,12.996,139.355,56.0,14.456,14.456,56.0,19.653,19.653,32.453,57.0,6.657,6.657,58.0,35.746,35.746,52.075,60.0,16.554,16.093,59.0,55.618,55.618,47.045,60.0,21.087,19.872,2
59,106,1106,SUAREZ ALONSO Angel,60,,59.941,266.703,376.757,418.044,450.873,15,46,Finished,10008831529,2.0,0.799,-0.661,60.0,81.202,81.202,206.762,60.0,81.863,81.863,60.0,165.46,165.46,110.054,59.0,84.258,84.258,60.0,170.765,170.765,41.287,58.0,5.766,5.305,60.0,176.421,176.421,32.829,58.0,6.871,5.656,1


# Data Export

All that's left is to save our data to CSV files so we can quickly import it again for analysis and visualization without making constant requests to the online servers. This not only reduces load on the services providing the data, but also allows us to work on our analysis "offline", moreover giving us a local copy in case the results are ever taken down. It's also much quicker to load data this way than constantly hitting online servers.

In [72]:
df.id = df.id.astype(str)
dfm = df.merge( df2, left_on='id', right_index=True, how='inner' )

In [88]:
df.to_csv( file_prefix + '.results.csv' )
df2.to_csv( file_prefix + '.racers.csv' )
dfm.to_csv( file_prefix + '.merged.csv' )

--- 

## Credits

### Author: Dominic Wrapson


> **@domwrap**
<br>
<img src="https://png.icons8.com/material/24/000000/github-2.png">
<img src="https://png.icons8.com/material/24/000000/stackoverflow.png">
<img src="https://png.icons8.com/material/24/000000/linkedin.png">
<img src="https://png.icons8.com/material/24/000000/windows8.png">
<img src="https://png.icons8.com/ios-glyphs/24/000000/instagram-new.png">
<img src="https://png.icons8.com/material/24/000000/twitter.png">
<a href="https://medium.com/@domwrap"><img src="https://png.icons8.com/material/24/000000/medium-logo.png"></a>
>
> <img src="https://png.icons8.com/material/24/000000/home.png"> http://domwrap.me
>
><img src="https://png.icons8.com/material/24/000000/cycling-mountain-bike.png"> [Hwulex](https://www.pinkbike.com/u/Hwulex/)


---

#### Special Thanks

Mark Shilton for the inspiration
- http://lookatthestats.blogspot.ca
- https://plus.google.com/+MarkShilton
- https://dirtmountainbike.com/author/mrgeekstats


<a href="https://icons8.com">Icon pack by Icons8</a>