# UCI MTB DH Data Retrieval

## Setup
#### Import Libraries

If you do not have these libraries available, you should install them using `pip`

```
pip install requests
pip install bs4
pip install pandas
```

In [42]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

Widen display area to prevent column wrapping, and always show all columns for debug 

In [43]:
pd.set_option('display.width', 2000)
pd.set_option('display.max_columns', None)

## Config

Which race data are we collecting?

1. Losinj
1. Fort William
1. Leogang
1. Val di Sole
1. Vallnord
1. Mont-Sainte-Anne
1. La Bresse

In [59]:
race = 1
gender = 'f'

#### Data Sources

The UCI Live Timing API contains a lot of data points, but not all the ones we want (speed being the main one missing), and not even all the ones they include on their own PDF which is frustrating.

Similarly, Roots & Rain also has a lot of the data points, but again not all of them; most notably it's missing timing splits 4 and 5.

Therefore we need to pull from both sources and combine the sets.

Here we specify the URLs for both sources from which we will extract our data. The UCI API URL can be found by loading the Live Timing page then using your browser's inspector on the Network tab (in Chrome at least) to see the data feed. As the UCI seems to be using a Single Page Application (SPA) here, it's not straight forward to extract this link automagically.

*Links will be added as data sources become available.*

In [60]:
races = [
    [
        'losinj',
         'http://prod.chronorace.be/api/results/uci/dh/race/20180421_dh/3',
         'https://www.rootsandrain.com/race5897/2018-apr-22-mercedes-benz-uci-world-cup-1-losinj/results/filters/m/',
         'http://prod.chronorace.be/api/results/uci/dh/race/20180421_dh/6',
         'https://www.rootsandrain.com/race5897/2018-apr-22-mercedes-benz-uci-world-cup-1-losinj/results/filters/f/'
    ]
    , [ 'fortbill', '', '' ]
    , [ 'leogang', '', '' ]
    , [ 'valdisole', '', '' ]
    , [ 'vallnord', '', '' ]
    , [ 'msa', '', '' ]
    , [ 'labresse', '', '' ]
]
key = 0 if gender == 'm' else 2
raceName = races[race-1][0]
urlUci = races[race-1][key+1]
urlRoots = races[race-1][key+2]

# UCI API
### Load Data

These two lines make the actual request to the server, and then converts the JSON string response in to a usable list format (deserialization)

In [61]:
r = requests.get( urlUci )
d = r.json()

The API returns with three main sections:

1. `Last Finisher`
 - Racers in order of start time
2. `Results`
 - Racers in finishing rank order
3. `Riders`
 - Personal details on all racers
 
Each contains many data points. To see all the contained data, you can un-comment and execute any of the lines in the next section to explore more.

In [47]:
# display( d )
# display( d['Results'][7] )
# display( d['Riders']['1034'] )
# display( d['Results'][61] )

### Extract Data

Here we iterate over the `Results` sub-set of data to extract the information we care about: basically those that finished the race, some identifying info, and their splits.

If you looked at detail of the returned data set in the last step you might have noticed the rider's name is not stored next to their result, riders are only identified by a reference number. To facilitate our analysis later on it is useful to import each rider's name at this stage by cross-referencing the `Riders` sub-set.

We start with an empty list `lst` and in each loop iteration add an entry (actually a dict) to that list for each rider.

In [63]:
splits = len(d['Results'][0]['Times'] )
lst = []
for idx, row in enumerate( d['Results'] ):
    fin = "Finished" == row['Status']
    res = {
        'rank': row['Position']  if fin else idx+1,
        'name': d['Riders'][str(row['RaceNr'])]['PrintName'],
        'id': row['RaceNr'],
        'uci': d['Riders'][str(row['RaceNr'])]['UciRiderId'],
        'bib': d['Riders'][str(row['RaceNr'])]['RaceNr'],
        'status': row['Status'],
        'speed': np.nan
    }

    # Add all splits to result set
    for split in range( 0, splits ):
        res['split' + str(split+1)] = row['Times'][split]['RaceTime']/1000 if fin else np.nan

    # Append result set to list
    lst.append(res)

This line loads the completed list in to a Pandas dataframe so that we can easily write it out to CSV later on 

In [64]:
df = pd.DataFrame( lst )

#### Expand Dataset

Calculate and add all the extra columns we need for split and sector differences and their rankings

In [65]:
for i in range( 1, splits+1 ):
    split = 'split' + str(i)
    sector = split + '_sector'
    df[split + '_rank'] = df[split].rank(method='dense')
    df[split + '_vs_best'] = (df[split] - df[split].min())
    df[split + '_vs_winner'] = (df[split] - df[split][0])
    
    if i > 1:
        df[split + '_sector'] = df[split] - df['split' + str(i-1)]
        df[split + '_sector_rank'] = df[sector].rank(method='dense')
        df[split + '_sector_vs_best'] = (df[sector] - df[sector].min())
        df[split + '_sector_vs_winner'] = (df[sector] - df[sector][0])

We can take a peek at our data at this point to make sure it looks how we expect.

At this point the `speed` column is NaN (Not a Number) for all racers. This will be filled in below.

In [66]:
display( df.head() )

Unnamed: 0,bib,id,name,rank,speed,split1,split2,split3,split4,split5,status,uci,split1_rank,split1_vs_best,split1_vs_winner,split2_rank,split2_vs_best,split2_vs_winner,split2_sector,split2_sector_rank,split2_sector_vs_best,split2_sector_vs_winner,split3_rank,split3_vs_best,split3_vs_winner,split3_sector,split3_sector_rank,split3_sector_vs_best,split3_sector_vs_winner,split4_rank,split4_vs_best,split4_vs_winner,split4_sector,split4_sector_rank,split4_sector_vs_best,split4_sector_vs_winner,split5_rank,split5_vs_best,split5_vs_winner,split5_sector,split5_sector_rank,split5_sector_vs_best,split5_sector_vs_winner
0,1,2001,NICOLE Myriam,1,,23.472,60.224,92.252,135.352,160.706,Finished,10004535237,3.0,0.625,0.0,1.0,0.0,0.0,36.752,1.0,0.0,0.0,1.0,0.0,0.0,32.028,1.0,0.0,0.0,1.0,0.0,0.0,43.1,2.0,0.129,0.0,1.0,0.0,0.0,25.354,4.0,0.566,0.0
1,4,2004,ATHERTON Rachel,2,,22.847,61.748,95.397,139.415,164.265,Finished,10003434487,1.0,0.0,-0.625,3.0,1.524,1.524,38.901,4.0,2.149,2.149,2.0,3.145,3.145,33.649,2.0,1.621,1.621,2.0,4.063,4.063,44.018,4.0,1.047,0.918,2.0,3.559,3.559,24.85,2.0,0.062,-0.504
2,2,2002,SEAGRAVE Tahnee,3,,22.902,61.248,95.686,139.696,164.484,Finished,10007414016,2.0,0.055,-0.57,2.0,1.024,1.024,38.346,2.0,1.594,1.594,3.0,3.434,3.434,34.438,7.0,2.41,2.41,3.0,4.344,4.344,44.01,3.0,1.039,0.91,3.0,3.778,3.778,24.788,1.0,0.0,-0.566
3,8,2008,CABIROU Marine,4,,24.295,63.484,97.391,140.362,165.935,Finished,10009563069,6.0,1.448,0.823,5.0,3.26,3.26,39.189,5.0,2.437,2.437,5.0,5.139,5.139,33.907,4.0,1.879,1.879,4.0,5.01,5.01,42.971,1.0,0.0,-0.129,4.0,5.229,5.229,25.573,7.0,0.785,0.219
4,28,2028,RAVANEL Cecile,5,,24.986,63.606,97.407,142.042,168.416,Finished,10002816317,9.0,2.139,1.514,6.0,3.382,3.382,38.62,3.0,1.868,1.868,6.0,5.155,5.155,33.801,3.0,1.773,1.773,5.0,6.69,6.69,44.635,5.0,1.664,1.535,5.0,7.71,7.71,26.374,12.0,1.586,1.02


#### Rider Data

Saving the personal information about each racer is much easier as we can just export the entire `Riders` dataset. However, the rows and columns are the wrong way round so the `.T` command *transposes* the information, meaning it basically flips the axes.

In [67]:
df2 = pd.DataFrame( d['Riders'] )
df2 = df2.T

Here we can glimpse the first few rows of our `DataFrame` and can check the data looks as we expect

In [68]:
display( df2.head() )

Unnamed: 0,BirthDate,CategoryCode,FamilyName,GivenName,Id,Nation,Outfit,PrintName,RaceId,RaceNr,ScoreboardName,StartOrder,StartTime,UciCode,UciRank,UciRiderId,UciTeamCode,UciTeamId,UciTeamName,WorldCupRank
2001,1990-02-08T00:00:00,WE,NICOLE,Myriam,102001,FRA,NCh,NICOLE Myriam,0,1,NICOLE M,14,46710000,FRA19900208,1,10004535237,CVN,1590,COMMENCAL / VALLNORD,1
2002,1995-06-15T00:00:00,WE,SEAGRAVE,Tahnee,102002,GBR,,SEAGRAVE Tahnee,0,2,SEAGRAVE T,12,46290000,GBR19950615,3,10007414016,FMD,1863,TRANSITION BIKES / MUC-OFF FACTORY RACING,2
2003,1988-06-13T00:00:00,WE,HANNAH,Tracey,102003,AUS,NCh,HANNAH Tracey,0,3,HANNAH T,15,46920000,AUS19880613,2,10003732258,URT,1608,POLYGON UR,3
2004,1987-12-06T00:00:00,WE,ATHERTON,Rachel,102004,GBR,NCh,ATHERTON Rachel,0,4,ATHERTON R,16,47130000,GBR19871206,7,10003434487,TDH,1598,TREK FACTORY RACING DH,4
2005,1986-09-19T00:00:00,WE,SIEGENTHALER,Emilie,102005,SUI,,SIEGENTHALER Emilie,0,5,SIEGENTHALER,11,46080000,SUI19860919,6,10004167243,PFR,1864,PIVOT FACTORY RACING,5


# Roots and Rain
### Load Data

Similar to the UCI api, we make a request to the server with the previously declared `urlRoots` variable. This time however we simply load the content of the response as text which is actually the HTML code of the web page. We do not do have a nice JSON API to read which means we will not deserialize.

Next we invoke a utility called `BeautifulSoup` to help us extract the data from this messy HTML code

In [69]:
r = requests.post( urlRoots )
c = r.content
soup = BeautifulSoup( c, "html.parser" )

### Extract Data

If you look at the Roots and Rain page you'll see it listed in a tabular format. What we do here is find all the rows of that table so we can extract the information we need.

Specifically we are looking for instances of `tr` (table row), with a class that *begins with* `c-` as this is a common denomenator I discovered when looking through the code with the browser inspector

In [70]:
rows = soup.find_all( "tr", class_=lambda x: x and 'c-' in x )

Similar to the UCI data set, here we will iterate over each row in our data set--basically each table row from the web page--and extract the bits we need.

Racer speed is the metric we're interested in, but in order to match that to our existing data set we need a corresponding identifier so we also extract the racer licence number as that exists in both sets and we can match them together: it is the *intersect* between both sets of data.

To summarise:
1. Extract licence number and corresponding speed
2. Import speed to existing DataFrame matching racers by licence

The `if` condition in the middle will exit this block of code once we hit the end of the Elite finishers, seeing as that's all we have in our existing data set so can't match anyone else

In [71]:
for row in rows:
    cells = row.find_all( "td" )

    speed = cells[7].text[:5]
    licence = cells[4].text
    bib = int( cells[1].text )
    pos = cells[0].text[8:]
    if "" == pos: break
    # Match rider by UCI licence if present, otherwise fallback to bib
    if len(df2.loc[df2['UciRiderId'] == licence].index.values ):
        rid = int(df2.loc[df2['UciRiderId'] == licence].index.values[0])
    else:
        rid = int( df2.loc[df2['RaceNr'] == bib].index.values[0] )
    df.loc[df['id'] == rid, 'speed'] = speed

As before, we can take another look at how our data is looking, with the `speed` column now containing data 

In [72]:
display( df.head() )

Unnamed: 0,bib,id,name,rank,speed,split1,split2,split3,split4,split5,status,uci,split1_rank,split1_vs_best,split1_vs_winner,split2_rank,split2_vs_best,split2_vs_winner,split2_sector,split2_sector_rank,split2_sector_vs_best,split2_sector_vs_winner,split3_rank,split3_vs_best,split3_vs_winner,split3_sector,split3_sector_rank,split3_sector_vs_best,split3_sector_vs_winner,split4_rank,split4_vs_best,split4_vs_winner,split4_sector,split4_sector_rank,split4_sector_vs_best,split4_sector_vs_winner,split5_rank,split5_vs_best,split5_vs_winner,split5_sector,split5_sector_rank,split5_sector_vs_best,split5_sector_vs_winner
0,1,2001,NICOLE Myriam,1,41.5,23.472,60.224,92.252,135.352,160.706,Finished,10004535237,3.0,0.625,0.0,1.0,0.0,0.0,36.752,1.0,0.0,0.0,1.0,0.0,0.0,32.028,1.0,0.0,0.0,1.0,0.0,0.0,43.1,2.0,0.129,0.0,1.0,0.0,0.0,25.354,4.0,0.566,0.0
1,4,2004,ATHERTON Rachel,2,43.03,22.847,61.748,95.397,139.415,164.265,Finished,10003434487,1.0,0.0,-0.625,3.0,1.524,1.524,38.901,4.0,2.149,2.149,2.0,3.145,3.145,33.649,2.0,1.621,1.621,2.0,4.063,4.063,44.018,4.0,1.047,0.918,2.0,3.559,3.559,24.85,2.0,0.062,-0.504
2,2,2002,SEAGRAVE Tahnee,3,42.89,22.902,61.248,95.686,139.696,164.484,Finished,10007414016,2.0,0.055,-0.57,2.0,1.024,1.024,38.346,2.0,1.594,1.594,3.0,3.434,3.434,34.438,7.0,2.41,2.41,3.0,4.344,4.344,44.01,3.0,1.039,0.91,3.0,3.778,3.778,24.788,1.0,0.0,-0.566
3,8,2008,CABIROU Marine,4,42.03,24.295,63.484,97.391,140.362,165.935,Finished,10009563069,6.0,1.448,0.823,5.0,3.26,3.26,39.189,5.0,2.437,2.437,5.0,5.139,5.139,33.907,4.0,1.879,1.879,4.0,5.01,5.01,42.971,1.0,0.0,-0.129,4.0,5.229,5.229,25.573,7.0,0.785,0.219
4,28,2028,RAVANEL Cecile,5,42.03,24.986,63.606,97.407,142.042,168.416,Finished,10002816317,9.0,2.139,1.514,6.0,3.382,3.382,38.62,3.0,1.868,1.868,6.0,5.155,5.155,33.801,3.0,1.773,1.773,5.0,6.69,6.69,44.635,5.0,1.664,1.535,5.0,7.71,7.71,26.374,12.0,1.586,1.02


# Data Export

All that's left is to save our data to CSV files so we can quickly import it again for analysis and visualization without making constant requests to the online servers. This not only reduces load on the services providing the data, but also allows us to work on our analysis "offline", moreover giving us a local copy in case the results are ever taken down. It's also much quicker to load data this way than constantly hitting online servers.

In [77]:
event = 'dh'
filePrefix = event + '_' + str(race) + '_' + raceName + '_' + gender
df.to_csv( filePrefix + '.results.csv' )
df2.to_csv( filePrefix + '.racers.csv' )

--- 

## Credits

### Author: Dominic Wrapson


> **@domwrap**
<br>
<img src="https://png.icons8.com/material/24/000000/github-2.png">
<img src="https://png.icons8.com/material/24/000000/stackoverflow.png">
<img src="https://png.icons8.com/material/24/000000/linkedin.png">
<img src="https://png.icons8.com/material/24/000000/windows8.png">
<img src="https://png.icons8.com/ios-glyphs/24/000000/instagram-new.png">
<img src="https://png.icons8.com/material/24/000000/twitter.png">
<a href="https://medium.com/@domwrap"><img src="https://png.icons8.com/material/24/000000/medium-logo.png"></a>
>
> <img src="https://png.icons8.com/material/24/000000/home.png"> http://domwrap.me
>
><img src="https://png.icons8.com/material/24/000000/cycling-mountain-bike.png"> [Hwulex](https://www.pinkbike.com/u/Hwulex/)


---

#### Special Thanks

Mark Shilton for the inspiration
- http://lookatthestats.blogspot.ca
- https://plus.google.com/+MarkShilton
- https://dirtmountainbike.com/author/mrgeekstats


<a href="https://icons8.com">Icon pack by Icons8</a>