ProCyclingStats Points Prediction

Objective

The goal of this project was to determine how accurately I could predict ProCyclingStats points with linear regression using features scraped from the race startlist, rider resume, and individual rankings pages of "the single most useful cycling database on the worldwide web" (Rouleur, 2017).

Tools

Requests for pulling content
BeautifulSoup for saving and parsing HTML
Numpy and Pandas for data manipulation
Statsmodels and Scikit-learn for modeling
Matplotlib and Seaborn for plotting

Scraping & Parsing

I used the following races' startlists to generate a list of riders whose profile pages I would scrape for PCS points by season as well as a few other characteristics for my model.

Startlists - 85 total

Since 2007: Giro d'Italia, Tour de France
Since 2010: Vuelta a España
Since 2013: Amstel Gold Race, Gent–Wevelgem, Il Lombardia, Liège–Bastogne–Liège, Milan–San Remo, Paris–Nice, Paris–Roubaix, Ronde van Vlaanderen, Strade Bianche, Tour of California, Tour de Suisse

The output was 1,730 unique riders which comprised 17,806 pages for about 10 seasons of data per rider. I also scraped the PCS individual rankings for each year since 2005 totaling 292 pages; however this ranking data has not been incorporated into my modeling yet.

Features per Rider - those in italics were parsed but not loaded into model

rider name
team name
nationality
date of birth (converted to age by year)
height
weight
lifetime points by category: one day, general classification, time trial, sprint
total race seasons
dates raced by year
race names by year
distance raced by year
number of race days by year
number of stage races by year
UCI points won by year
PCS points won by year

Modeling & Evaluation

The target variable (S0) for each rider was their most recently completed full racing season's PCS points (i.e. 2018 was excluded). This allowed the inclusion of retired riders alongside currently active ones, despite their careers spanning different years without overlap, with S1, S2, etc. as relative references to prior seasons. Age was the only other variable besides points that was used on a per year basis with the same nomenclature, Sn.

Data was split into 80% training and 20% testing (holdout) sets. All model evaluation and selection was performed with 5-fold cross-validation on the training set. A single final test was done on the 20% out of sample data.

Baseline

Ordinary least squares linear regression with a single feature, S1 points
0.642 R²
0.578 avg. CV R²
183 avg. CV RMSE

OLS with 7 features

Addition of selected features above
0.708 R²
0.654 avg. CV R²
166 avg. CV RMSE

OLS with 74 features

All non-italics features above with nationality as dummy variables
0.760 R²

Lasso with scaled features

Regularization with optimal alpha selected 21 of 74 features
0.749 R²
0.715 avg. CV R²
151 avg. CV RMSE

OLS with scaled features

21 features without Lasso's coefficients
0.756 R²
0.710 avg. CV R²
149 avg. CV RMSE

Out of sample OLS test

21 features
0.687 R²
137 avg. CV RMSE

Observations & Next Steps

The target distribution is highly right skewed and predictive accuracy may benefit from a log or Box-Cox transformation. The residual plots also exhibited heteroskedasticity. It is worth exploring higher degree polynomial terms to evaluate their impact on regression fit. Given the different classes of riders (domestiques, sprinters, climbers, general classification, time trial, and one-day race specialists), there is likely a classification component to points prediction. Non-linear models such as random forest and gradient boosting trees may perform better at handling these distinct groups while also avoiding issues arising from violations of OLS assumptions. Incorporating some of the unused features above and new ones such as watts/kilogram, team performance, rankings information, and data from Dopeology would be important to consider regardless of the algorithm chosen.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
procyclingstats_linear_regression.ipynb		procyclingstats_linear_regression.ipynb
procyclingstats_points_prediction_presentation.pdf		procyclingstats_points_prediction_presentation.pdf
procyclingstats_scraping_parsing.ipynb		procyclingstats_scraping_parsing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

procyclingstats_linear_regression.ipynb

procyclingstats_linear_regression.ipynb