# The 80-Minute Journey: A Data-Driven Analysis of Marathon Progression

**Author:** Aditya Padmarajan
**Dataset:** Strava activity export (Feb 2022 – Oct 2025)  

---

## Project Overview

This notebook analyzes **347 running activities over 3.5 years**, documenting the progression from a first-time marathoner (4:46:07 at Royal Victoria Marathon 2022) to a sub-3:30 finisher (3:26:00 at Royal Victoria Marathon 2025) — an improvement of **80 minutes** across 6 marathon races.

The analysis explores how training patterns, physiological adaptations, and race execution evolved to produce consistent performance gains at two recurring races: the **Royal Victoria Marathon** (4 finishes) and **BMO Vancouver Marathon** (2 finishes).

---

## Datasets

| File | Description | Records |
|------|-------------|---------|
| `activities_dataset.csv` | Complete Strava activity export with pace, HR, elevation, and metadata | 347 activities |
| `global_challenges.csv` | Strava challenge participation history | 699 challenges |

---

## Marathon Race Summary (Official Times)

| Race | Date | Finish Time | Pace | Avg HR |
|------|------|-------------|------|--------|
| Royal Victoria Marathon 2022 | Oct 9, 2022 | 4:46:07 | 6:30/km | 147 bpm |
| BMO Vancouver Marathon 2023 | May 7, 2023 | 4:25:48 | 6:13/km | 160 bpm |
| Royal Victoria Marathon 2023 | Oct 8, 2023 | 4:16:58 | 6:04/km | 161 bpm |
| Royal Victoria Marathon 2024 | Oct 13, 2024 | 3:47:47 | 5:22/km | 162 bpm |
| BMO Vancouver Marathon 2025 | May 4, 2025 | 3:37:23 | 5:07/km | 164 bpm |
| Royal Victoria Marathon 2025 | Oct 12, 2025 | 3:26:00 | 4:50/km | 170 bpm |

---

## Planned Visualizations

### 1. Marathon Progression Timeline
A bar or line chart displaying finish times across all 6 marathons, highlighting the downward trend from 4:46 to 3:26. This serves as the anchor visualization for the entire analysis.

### 2. Pace Evolution Curve
Line chart tracking average pace (min/km) for each marathon with a trend line, demonstrating the progression from 6:30/km to 4:50/km.

### 3. Heart Rate Efficiency Analysis
Scatterplot comparing pace vs. average heart rate across marathons. This visualization reveals aerobic efficiency gains — running faster at similar cardiac output indicates improved fitness.

### 4. Training Volume by Marathon Block
Stacked area or bar chart showing weekly mileage in the 12–16 weeks preceding each marathon. Correlates training load with race-day performance.

### 5. Monthly Running Volume Heatmap
Calendar-style heatmap (similar to GitHub contributions) showing daily/weekly running activity, revealing consistency patterns and training periodization.

### 6. Long Run Progression
Tracks the longest training runs before each marathon, showing how peak long run distance evolved across training cycles.

### 7. Course Comparison: Victoria vs Vancouver
Side-by-side comparison of the two race courses, analyzing elevation profiles, pace distribution, and heart rate response to control for course difficulty.

### 8. Cumulative Distance Over Time
Running total of kilometers logged since February 2022, with marathon race days marked as milestones.

### 9. Elevation vs Pace Relationship
Scatterplot examining how elevation gain impacts average pace across all training runs, useful for understanding performance on hilly courses.

### 10. Challenge Engagement Timeline
Bar chart or heatmap showing monthly Strava challenge completions, illustrating engagement and motivation patterns throughout the training journey.

---

## Key Questions This Analysis Will Answer

1. **What training volume correlates with marathon performance?**  
   Is there a weekly mileage threshold that predicts sub-4:00 or sub-3:30 performance?

2. **How did aerobic efficiency improve?**  
   Can we quantify the pace/HR relationship improvement over time?

3. **What distinguishes Victoria vs Vancouver performances?**  
   Are course-specific factors (elevation, weather) affecting results?

4. **What does the optimal taper look like?**  
   How did training volume change in the final 2–3 weeks before each race?

5. **Is there a long run distance that predicts race success?**  
   What was the longest run before each PR?

---

## Technical Notes

- **Distance** is measured in kilometers
- **Speed** is measured in meters per second (converted to min/km pace for analysis)
- **Time** values are in seconds
- **Heart Rate** data is unavailable for the first 20 runs (Feb–Jun 2022) due to no HR monitor
- **Elevation** is measured in meters

---

In [43]:
import numpy as np
import pandas as pd

# Data Collection and Pre-Processing

In [44]:
df = pd.read_csv("activities_dataset.csv")
# Filter to running activities FIRST
df2 = df.query("`Activity Type` == 'Run'").reset_index(drop=True)

# Rename columns
df2 = df2.rename(columns={
    "Distance": "Distance (km)",
    "Average Speed": "Avg Speed (m/s)",
    "Max Speed": "Max Speed (m/s)",
    "Moving Time": "Moving Time (s)",
    "Elapsed Time": "Elapsed Time (s)",
    "Average Heart Rate": "Avg HR (bpm)",
    "Max Heart Rate": "Max HR (bpm)",
    "Elevation Gain": "Elevation Gain (m)",
    "Elevation Loss": "Elevation Loss (m)",
    "Elevation High": "Elevation High (m)",
    "Elevation Low": "Elevation Low (m)"
})

# Feature Engineering

In [45]:
# Parse dates
# %b -> Abbreviated Month Name (Feb, May, Oct)
# %d -> Day of Month (01-31)
# %Y -> 4-digit year (2022, 2025)
# %I -> Hour (12-hour clock)
# %M -> Minutes (00-59)
# %S -> Seconds (00-59)
# %p -> AM or PM
df2["Activity Date"] = pd.to_datetime(df2["Activity Date"], format="%b %d, %Y, %I:%M:%S %p")

# Time-Based Columns for grouping 
df2["Year"] = df2["Activity Date"].dt.year
df2["Month"] = df2["Activity Date"].dt.month
df2["Week"] = df2["Activity Date"].dt.isocalendar().week
df2["Day"] = df2["Activity Date"].dt.day_name()

# Calculate pace
df2["Pace (min/km)"] = 1000 / (df2["Avg Speed (m/s)"] * 60)
df2["Max Pace (min/km)"] = 1000 / (df2["Max Speed (m/s)"] * 60)

# Formatted pace columns
df2["Pace (min:sec/km)"] = df2["Pace (min/km)"].apply(
    lambda x: f"{int(x)}:{int((x % 1) * 60):02d}"
)

# Formatted Moving Time (seconds -> H:MM:SS)
df2["Moving Time (H:M:S)"] = df2["Moving Time (s)"].apply(
    lambda x: f"{int(x // 3600)}:{int((x % 3600) // 60):02d}:{int(x % 60):02d}" if pd.notna(x) else None
)

# Formatted Elapsed Time (seconds -> H:MM:SS)
df2["Elapsed Time (H:M:S)"] = df2["Elapsed Time (s)"].apply(
    lambda x: f"{int(x // 3600)}:{int((x % 3600) // 60):02d}:{int(x % 60):02d}" if pd.notna(x) else None
)

In [47]:
# Filter out Running Data

cols_to_keep = [
    "Activity ID",
    "Activity Date",
    "Year",
    "Month",
    "Week",
    "Day",
    "Activity Name",
    "Activity Type",
    "Distance (km)",
    "Pace (min/km)",
    "Pace (min:sec/km)",
    "Moving Time (H:M:S)",
    "Elapsed Time (H:M:S)",
    "Avg HR (bpm)",
    "Max HR (bpm)",
    "Elevation Gain (m)",
    "Elevation Loss (m)",
    "Elevation High (m)",
    "Elevation Low (m)",
    "Calories",
    "Relative Effort"
]

running_df = df2[cols_to_keep].reset_index(drop = True)

# Flag HR Availability
running_df["HR Available"] = running_df["Avg HR (bpm)"].notna()

# Extract Marathons
marathon_df = running_df[running_df["Distance (km)"] > 40].reset_index(drop = True)