This repository extracts historical short track speed skating data and analyzes it to reveal trends.
Data for each ISU short track race since 1994 is scraped from the International Skating Union (ISU) results website: https://shorttrack.sportresult.com/.
The scraped data is combined into rounds_with_splits.csv, which contains one row for each athlete in each race. The fields contain information about the athlete (or relay team), including name, nationality, gender, starting position, finishing position, and finishing time. Where available, the athlete's lap split times and position at the end of each lap is also listed.
The lap split data is further broken down to create individual_athlete_lap_data.pk. Each row in this file shows how many positions the athlete (or relay team) gained or lost during the course of one lap of one race.
To collect race data:
pip install -r shorttrack_scrapy/requirements-scrapy.txt
scrapy crawl shorttrack_spider
The full scraping operation reads about 57000 pages and takes approximately 2 hours, depending on the execution environment and connection speeds.
Ths CSV format of individual_athlete_lap_data.pk is too large to commit
directly, so it has been saved as a Pickle file (with .zip
compression). The dashboard takes care of loading this file,
but if you wish to load it directly:
# in a Python shell
import pandas as pd
laptimes_df = pd.read_pickle('data/full/individual_athlete_lap_data.pk')
The ISU's terms of use forbid the "permanent copying or storage" of their data. Whether storage on GitHub constitutes "permanence" is unclear - I will take down the data portion of this repository upon request.
Athlete trends are extracted from the dataset and displayed in an interactive dashboard.
Kwak Yoon-Gy is known for starting the 1500m distance near the back of the pack and making a big pass near the end of the race. The data supports this, showing a low tendency to start in the first two positions, a skew of two-position passes toward the later laps, and no tendency to instigate the pace early on. He doesn't have the fastest top speed, show by his faster following speed than leading speed. Finally, in a race where multiple skaters advance, he still often goes for the top spot.
The dashboard is deployed at shorttrack.herokuapp.com. Due to memory limitations with the free tier of Heroku, this deployment is for demo purposes only - a limited number of athletes are included. To deploy with all athlete data, construct your own deployment with the instructions below.
The dashboard is generated by shorttrack_ui.py. To run the dashboard locally:
# step 1: clone the repository to your environment of choice
git clone https://github.com/alexanderhale/shorttrack.git
# step 2: install requirements
pip install -r requirements.txt
# step 3: start Panel dashboard
panel serve athlete_profile/shorttrack_ui.py
The dashboard will run on localhost
with Panel's default settings, or you can specify any
command-line arguments you wish.
- Many more athlete trends could be extracted - suggestions are welcome!
- Athletes are currently only being compared to their own results - extracting some global trends would allow comparison of an athlete to other athletes (e.g. is this athlete's average start time fast, slow, or average?).
- What is the most frequent lap that the winner of a race makes their pass to the front?
- The dashboard could do with some beautifying.
- Some machine learning could be applied to learn deeper trends - for example, is there a pattern of positions within the pack that the winner often follows?