# Enduro World Series (EWS) web scraping and analysis

First off, what is enduro? Basically, it's downhill mountain biking where you have to pedal your way to each stage. Racers are timed on the downhill portion, and then have to pedal their way to the next stage (instead of taking a chair lift, etc.). It looks like:

![https://images.app.goo.gl/64AV4ZtHASXin8Ru9](img/muddy_enduro.gif)

But also a long day in the saddle. For an example, here's the summary of a race on Strava of a pro enduro racer [Jesse Melamed](https://www.strava.com/activities/7260508291) who took 2nd place (by less than half a second to first!). On the clock, his time was 03:00.67 - wheras the total pedaling time was over three and a half hours!

![](img/example_ews.png)

Enduro racing at the world stage happens in the Enduro World Series, where the best of the best earn points by winning stages and races. At the end of the season a victor is crowned based on the number of points earned. We're going to take a look at the results in these races and look for trends that identify the types of performances that can crown a winner.

## Gather the data - web scraping
The following cells of this notebook download the results from the EWS for 2022. We only use the ! Like any good data science project, data wrangling takes 80% of the time... in this case, it was much more. I initally attempted to download the PDF's from multiple prior years, but the attempt to build regular expressions was just not worth the time. Looking into the newest data allowed me to pull data in JSON format and much more easily transform the data into a usable format.

First, we begin by downloading results and scraping the files from https://www.enduroworldseries.com/

In [229]:
import bs4
import requests
import typing_extensions
import re
import copy
import csv
import os
import traceback
import json

import pandas as pd

from PyPDF2 import PdfWriter, PdfReader

Some functions to make web scraping more pretty. We're using the `requests` package to make requests to the server.

In [None]:
#todo find out the file structure for races - it seems that each result is sorted by class in the form //race_results/class/class#
#todo determine the classes and class numbers present

import requests

base_url = "https://a23ea854a37f.arangodb.cloud:8529/_db/EWSDB/api_production//"

payload = ""
headers = {
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.9",
    "Authorization": "Basic QVBJX0VXUzpJRG9BUElUaGluZ3NGb3JQZW9wbGUuITI=",
    "Connection": "keep-alive",
    "DNT": "1",
    "Origin": "https://www.enduroworldseries.com",
    "Referer": "https://www.enduroworldseries.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "macOS",
    "sec-gpc": "1"
}

In [None]:
def url_to_json_dict(url, payload=payload, headers=headers, save=False, folder="", filename="", year="2022"):
	results = requests.request("GET", url, data=payload, headers=headers)
	
	if save:
		with open(year+folder+filename+".json", 'w+') as f:
			json.dump(results, f)

	return json.loads(results.text)

In [None]:
def url_to_json_string(url, payload=payload, headers=headers, save=False, folder="", filename="", year="2022"):
	results = requests.request("GET", url, data=payload, headers=headers)
	
	if save:
		with open(year+folder+filename+".json", 'w+') as f:
			json.dump(results, f)

	return results.text

In [None]:
url_races_2022 = "race_names/2022"
race_information = url_to_json_dict(base_url+url_races_2022)

In [None]:
race_names_2022 = [race['description'] for race in race_information]

race_url_strings_2022 = {race:race.replace(' ', '%20') for race in race_names_2022}

In [None]:
race_classes_2022 = {race:url_to_json_dict(base_url+"race_classes/2022/"+race_string) for race, race_string in race_url_strings_2022.items()}

In [None]:
# Individual rider query: race_results/rider/[rider class]/[rider #]

rider_result_test = url_to_json_dict(base_url+"race_results/rider/80467121/22930")

rider_result_test = rider_result_test[0]

results_format = ['time', 'stage_result', 'cumulative_result', 'cumulative_behind', 'overall_time']

In [None]:
print(rider_result_test[0]['stage'][:7].lower().replace(" ", "_"))

In [None]:
# create custom unpacking of data - convert data to columns
def unpack_stage_results(rider_results, rider_id, results_format=results_format, save=False):
	i = 1
	offset = 0
	header = []
	results = []
	while i < len(rider_results) + 1:
		stage_data = rider_results[i-1]['stage'] # trims results from format of 'Stage 1PRO' to 'Stage 1' in case of pro/queen stage
		if len(stage_data) > 7:
			stage_data = stage_data[:7]
		stage_info = stage_data.lower().replace(" ", "_") # modifies results from format of 'Stage 1' to 'stage_1'
		stage_info = stage_info + "_"
		for result in results_format:
			header.append(stage_info + result)
			results.append(rider_results[i-1][result])

		i +=1

	return ['rider_id']+header, [rider_id]+results

In [None]:
# race_class = "80467121"
# TODO adjust for riders not having results in all categories

#for race in race_classes_2022['EWS Burke']:

results_dict = dict()

for race_name in race_classes_2022.keys():

	for race_class in race_classes_2022[race_name]:
	#for race_class in [{'name': 'EWS80 | MEN', '_key': '80470280'},{'name': 'MEN', '_key': '80467139'}]:

		# race information for a specific race class
		race_class_key = race_class['_key']
		race_class_desc = race_class['name']

		# download race results for a race class
		rider_class_results = url_to_json_dict(base_url+"race_results/class/"+race_class_key+"/1000/0")
		rider_class_df = pd.json_normalize(rider_class_results, 'results')

		# the ID for riders in each class (used to download specific results)
		rider_id_list = rider_class_df['rider_id']

		stage_class_results = []
		

		for rider_id in rider_id_list:
			individual_results = url_to_json_dict(f"{base_url}race_results/rider/{race_class_key}/{rider_id}")
			individual_results = individual_results[0]

			if len(stage_class_results) == 0:
				header, results = unpack_stage_results(individual_results, rider_id)
				stage_class_results = [header, results]

			else:
				_, results = unpack_stage_results(individual_results, rider_id)
				stage_class_results.append(results)

		stage_class_df = pd.DataFrame(stage_class_results[1:], columns=stage_class_results[0])

		full_rider_results = pd.merge(rider_class_df, stage_class_df, how='left', on='rider_id')

		# remove the '_key' column
		full_rider_results.drop('_key', inplace=True, axis=1)

		# add in the race class
		full_rider_results.insert(0, 'race_class', value=race_class_desc)

		# add in the race name at the beginning of the dataframe
		full_rider_results.insert(0, 'race_name', value=race_name)

		results_dict.update({race_name + "_" + race_class_desc : full_rider_results})

		full_rider_results.to_csv(race_name + "_2022_" + race_class_desc + ".csv", index=False)

In [None]:
EWS_2022_results = pd.concat([results for race_class, results in results_dict.items() if 'EWS ' in race_class])
EWS_2022_results.to_csv('EWS_2022_results_by_race.csv',index=False)

# EWSE_2022_results = pd.concat([results for race_class, results in results_dict.items() if 'EWS-E' in race_class])
# EWSE_2022_results.to_csv('EWS-E_2022_results_by_race.csv',index=False)

In [None]:
results_dict

## Analysis of results
Now that the data is present in the 