# üö£ Data Science Tutorial: Training Heatmap, Trend Regression & Workout Clustering

In this notebook, you'll learn how to implement 3 data science features using **your real Concept2 workout data**:

1. **Training Heatmap** ‚Äî A GitHub-style calendar showing your rowing activity
2. **Trend Regression** ‚Äî Fit a trendline to see if your pace is improving
3. **Workout Clustering** ‚Äî Use K-Means to auto-categorise your workouts

We'll go step-by-step: **WHAT** each technique does, **WHY** we use it, and **HOW** the code works.

Run each cell in order ‚Äî `Shift+Enter` to execute.

 # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
 # Step 0: Setup ‚Äî Load Your Data
 # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
 Before any analysis, we need data. Your web app is running on localhost:8000 and has a JSON API endpoint. We'll pull your workouts through it.
 
 **Why `requests`?** ‚Äî It's the simplest HTTP library for Python. We're making
a quick GET request to our own API. (We use `httpx` in the async web app,
but `requests` is simpler for synchronous notebook work.)

In [3]:
# ‚îÄ‚îÄ CELL 1: Import Libraries ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# We import everything upfront so you can see all our dependencies.

import requests          # HTTP client ‚Äî talks to our local API
import pandas as pd      # DataFrame library ‚Äî the backbone of data analysis in Python
import numpy as np       # Numerical computing ‚Äî we'll use it for math operations
import plotly.express as px          # High-level charting ‚Äî quick beautiful plots
import plotly.graph_objects as go    # Low-level charting ‚Äî for more control
from plotly.subplots import make_subplots   # Side-by-side charts

# These two are for Machine Learning (Section 3):
from sklearn.preprocessing import StandardScaler  # Scales features to same range
from sklearn.cluster import KMeans                 # The clustering algorithm

print("‚úÖ All libraries loaded successfully!")

‚úÖ All libraries loaded successfully!


# #### How we get the data
 
 Your web app has a new `/export/csv` endpoint ‚Äî visit it while logged in
 and it saves your workouts to `workouts.csv` in the project folder.
 
 In this notebook we simply load that CSV. This is a very common pattern
 in data science: **separate data collection from data analysis**.
 
 Your web app handles auth + API calls ‚Üí exports clean CSV.  
 Your notebook loads CSV ‚Üí does analysis. Clean separation of concerns.

In [5]:
# ‚îÄ‚îÄ CELL 1: Load your workout data ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# pd.read_csv() reads a CSV file into a DataFrame ‚Äî a table-like data 
# structure with rows and columns.  Think of it as an Excel spreadsheet
# in Python.
#
# üî∏ FIRST: Visit http://localhost:8000/export/csv while logged in ‚Äî 
#    this saves the file workouts.csv to your project folder.

import pandas as pd
import numpy as np

df = pd.read_csv("workouts.csv", parse_dates=["date"])  
#                                 ‚Üë tells pandas to treat the "date" 
#                                   column as datetime objects instead 
#                                   of plain strings

# Let's see what we're working with:
print(f"Shape: {df.shape[0]} rows √ó {df.shape[1]} columns\n")
print("Columns:")
print(df.dtypes)     # shows each column's data type
print("\n--- First 5 rows ---")
df.head()            # displays the first 5 rows as a nice table

Shape: 57 rows √ó 13 columns

Columns:
id                         int64
date              datetime64[us]
distance_m                 int64
time_seconds             float64
type                         str
workout_type                 str
pace_500m                float64
stroke_rate              float64
calories                 float64
heart_rate_avg           float64
drag_factor              float64
weight_class                 str
verified                    bool
dtype: object

--- First 5 rows ---


Unnamed: 0,id,date,distance_m,time_seconds,type,workout_type,pace_500m,stroke_rate,calories,heart_rate_avg,drag_factor,weight_class,verified
0,99664115,2025-03-20 08:36:00,5000,1621.4,rower,FixedDistanceSplits,162.14,29.0,260.0,,152.0,H,True
1,99706540,2025-03-21 08:47:00,5000,1595.1,rower,FixedDistanceSplits,159.51,28.0,262.0,,151.0,H,True
2,99751796,2025-03-22 13:41:00,5000,1535.4,rower,FixedDistanceSplits,153.54,30.0,267.0,,148.0,H,True
3,99861141,2025-03-25 07:58:00,5000,1529.3,rower,FixedDistanceSplits,152.93,29.0,269.0,,152.0,H,True
4,99908703,2025-03-26 09:19:00,5000,1647.0,rower,FixedDistanceSplits,164.7,29.0,263.0,,152.0,H,True


In [6]:
# ‚îÄ‚îÄ CELL 2: Quick Data Exploration ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# Before diving into analysis, ALWAYS explore your data first.
# This is a core data science habit ‚Äî understand what you have 
# before you build on it.

print("=== Basic Statistics ===")
print(df.describe())          # count, mean, std, min, max for numeric columns

print("\n=== Missing Values ===")
print(df.isnull().sum())      # how many NaN/missing per column
#  ‚Üë .isnull() returns True/False for each cell
#    .sum() counts how many Trues per column

print(f"\n=== Date Range ===")
print(f"From: {df['date'].min()}")
print(f"To:   {df['date'].max()}")

print(f"\n=== Machine Types ===")
print(df["type"].value_counts())  # count of workouts per machine type

=== Basic Statistics ===
                 id                        date    distance_m  time_seconds  \
count  5.700000e+01                          57     57.000000     57.000000   
mean   1.037562e+08  2025-07-05 08:28:31.578947   7595.456140   2530.545614   
min    9.966412e+07         2025-03-20 08:36:00    291.000000     60.000000   
25%    1.006424e+08         2025-04-14 08:17:00   5000.000000   1595.100000   
50%    1.035508e+08         2025-06-26 08:18:00   6000.000000   1993.200000   
75%    1.051730e+08         2025-08-17 12:39:00  10000.000000   3383.700000   
max    1.125051e+08         2026-02-07 14:40:00  15000.000000   5480.900000   
std    3.037838e+06                         NaN   3105.329371   1121.041972   

        pace_500m  stroke_rate    calories  heart_rate_avg  drag_factor  
count   57.000000    56.000000   56.000000             0.0    56.000000  
mean   163.691237    29.089286  394.107143             NaN   184.642857  
min    103.092784    25.000000   23.00000

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# SECTION 1: Training Heatmap (GitHub-style calendar)
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
 
 ## WHAT is it? 
 A calendar-like grid where each cell is a day, colored by how much you
 rowed. Similar to GitHub's contribution graph ‚Äî dark green = heavy training
 day, white = rest day.

 ## WHY build it?
 - Instantly see training consistency and patterns  
 - Spot gaps (rest periods, injuries, holidays)  
 - Identify if you train more on certain days of the week  
 - Motivates streaks ("don't break the chain!")
 
 ## HOW will we build it?
 1. Group workouts by date ‚Üí sum the distance for each day
 2. Fill in ALL days (including rest days with 0)
 3. Reshape into a week √ó weekday matrix  
 4. Plot with Plotly's `go.Heatmap`

In [7]:
# ‚îÄ‚îÄ STEP 1.1: Aggregate distance per day ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT:  Group all workouts that happened on the same date and sum 
#        their distance. If you rowed twice on Jan 5, we combine them.
#
# WHY:   Each cell in our heatmap = 1 day. We need one value per day.
#
# HOW:   
#   - df["date"].dt.date ‚Üí strips the time part, keeps just the date
#   - .groupby("day")    ‚Üí groups rows that share the same day
#   - .agg(...)          ‚Üí applies an aggregation function to each group
#   - "sum" on distance  ‚Üí total meters rowed that day

daily = df.copy()
daily["day"] = daily["date"].dt.date   # extract just the date (no time)

daily_agg = (
    daily
    .groupby("day")                    # group by calendar day
    .agg(
        total_meters=("distance_m", "sum"),  # sum all distances for that day
        num_workouts=("id", "count"),        # count how many sessions
    )
    .reset_index()                     # turn the groupby index back into a column
)

daily_agg["day"] = pd.to_datetime(daily_agg["day"])  # convert back to datetime

print(f"You rowed on {len(daily_agg)} distinct days")
print(f"Average distance on active days: {daily_agg['total_meters'].mean():.0f}m")
daily_agg.head(10)

You rowed on 56 distinct days
Average distance on active days: 7731m


Unnamed: 0,day,total_meters,num_workouts
0,2025-03-20,5000,1
1,2025-03-21,5000,1
2,2025-03-22,5000,1
3,2025-03-25,5000,1
4,2025-03-26,5000,1
5,2025-03-27,5000,1
6,2025-03-29,5000,1
7,2025-03-31,5500,1
8,2025-04-01,6000,1
9,2025-04-02,7000,1


In [8]:
# ‚îÄ‚îÄ STEP 1.2: Fill in rest days (days with 0 meters) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT:  Create a continuous range of dates from your first to last workout, 
#        and fill days where you didn't row with 0.
#
# WHY:   The heatmap needs EVERY day in the calendar, not just active days.
#        Without this, rest days would be invisible.
#
# HOW:
#   - pd.date_range()  ‚Üí generates every date between start and end
#   - .reindex()        ‚Üí aligns our data with this full range
#   - .fillna(0)        ‚Üí replaces NaN (missing = rest day) with 0

# Create a continuous date range covering your entire training history
all_days = pd.date_range(
    start=daily_agg["day"].min(), 
    end=daily_agg["day"].max(), 
    freq="D"   # "D" = daily frequency
)

# Set the date as the index so we can reindex
daily_full = daily_agg.set_index("day").reindex(all_days)
#                      ‚Üë set_index: make day the row label
#                                   ‚Üë reindex: align to the full date range
#                                     (days without data become NaN)

daily_full["total_meters"] = daily_full["total_meters"].fillna(0)  # rest days ‚Üí 0
daily_full["num_workouts"] = daily_full["num_workouts"].fillna(0)
daily_full.index.name = "day"
daily_full = daily_full.reset_index()

print(f"Total calendar days: {len(daily_full)}")
print(f"Rest days (0m): {(daily_full['total_meters'] == 0).sum()}")
print(f"Active days:    {(daily_full['total_meters'] > 0).sum()}")
daily_full.head(10)

Total calendar days: 325
Rest days (0m): 269
Active days:    56


Unnamed: 0,day,total_meters,num_workouts
0,2025-03-20,5000.0,1.0
1,2025-03-21,5000.0,1.0
2,2025-03-22,5000.0,1.0
3,2025-03-23,0.0,0.0
4,2025-03-24,0.0,0.0
5,2025-03-25,5000.0,1.0
6,2025-03-26,5000.0,1.0
7,2025-03-27,5000.0,1.0
8,2025-03-28,0.0,0.0
9,2025-03-29,5000.0,1.0


In [9]:
# ‚îÄ‚îÄ STEP 1.3: Build the calendar matrix ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT:  Reshape our flat list of dates into a 2D grid:
#        Rows = weeks,  Columns = days of the week (Mon‚ÄìSun)
#
# WHY:   A heatmap needs a 2D matrix.  Each cell in the grid maps to 
#        one specific day. This is how GitHub's contribution chart works too.
#
# HOW:
#   - .dt.isocalendar() ‚Üí gives ISO year, week number, weekday (1=Mon, 7=Sun)
#   - .pivot_table()    ‚Üí reshapes from long ‚Üí wide format
#        Think of it like a spreadsheet pivot: 
#        rows=week_label, columns=weekday, values=meters

daily_full["week_num"]  = daily_full["day"].dt.isocalendar().week.astype(int)
daily_full["year"]      = daily_full["day"].dt.isocalendar().year.astype(int)
daily_full["weekday"]   = daily_full["day"].dt.isocalendar().day.astype(int)  # 1=Mon, 7=Sun

# Create a label like "2025-W03" for each week (for the Y-axis)
daily_full["week_label"] = (
    daily_full["year"].astype(str) + "-W" + 
    daily_full["week_num"].astype(str).str.zfill(2)
)

# Pivot: one row per week, one column per weekday
#   pivot_table works like a spreadsheet pivot table:
#     index   = what becomes the rows    (each week)
#     columns = what becomes the columns (Mon through Sun)
#     values  = what fills the cells     (total meters)
#     aggfunc = how to combine if duplicates exist (sum, but usually 1:1 here)

heatmap_matrix = daily_full.pivot_table(
    index="week_label",      # rows = weeks
    columns="weekday",       # columns = Mon(1) to Sun(7)
    values="total_meters",   # cell values = meters rowed
    aggfunc="sum",           # if somehow two entries for same day, sum them
    fill_value=0,            # fill any missing cells with 0
)

# Rename columns from numbers to day names for readability
day_names = {1: "Mon", 2: "Tue", 3: "Wed", 4: "Thu", 5: "Fri", 6: "Sat", 7: "Sun"}
heatmap_matrix.rename(columns=day_names, inplace=True)

print(f"Matrix shape: {heatmap_matrix.shape} (weeks √ó days)")
print("\nFirst 5 weeks:")
heatmap_matrix.head()

Matrix shape: (47, 7) (weeks √ó days)

First 5 weeks:


weekday,Mon,Tue,Wed,Thu,Fri,Sat,Sun
week_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2025-W12,0.0,0.0,0.0,5000.0,5000.0,5000.0,0.0
2025-W13,0.0,5000.0,5000.0,5000.0,0.0,5000.0,0.0
2025-W14,5500.0,6000.0,7000.0,0.0,10000.0,8000.0,0.0
2025-W15,5000.0,0.0,0.0,0.0,0.0,5000.0,0.0
2025-W16,5000.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
# ‚îÄ‚îÄ STEP 1.4: Plot the Heatmap! ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT:  Render the matrix as a color-coded heatmap.
#
# WHY we use go.Heatmap (not px.imshow):
#   - More control over hover text, color scales, and axis labels
#   - go = "graph objects" ‚Äî Plotly's low-level API for fine-grained control
#   - px = "plotly express" ‚Äî quick charts, less customization
#
# KEY CONCEPTS:
#   - z = the 2D matrix of values (meters) ‚Üí determines color intensity
#   - colorscale = maps values to colors (0m = light gray, max = dark green)
#   - hovertemplate = what appears when you hover over a cell
#   - We reverse the Y-axis so the most recent weeks appear at the TOP

import plotly.graph_objects as go

# Convert the matrix to a numpy array for Plotly
z_values = heatmap_matrix.values          # the 2D grid of meter values
weeks    = heatmap_matrix.index.tolist()   # Y-axis labels (week names)
days     = heatmap_matrix.columns.tolist() # X-axis labels (Mon‚ÄìSun)

fig_heatmap = go.Figure(data=go.Heatmap(
    z=z_values,              # 2D matrix: color intensity
    x=days,                  # X-axis: day of week
    y=weeks,                 # Y-axis: week label
    
    # Color scale: white for rest, light‚Üídark green for more meters
    colorscale=[
        [0.0, "#ebedf0"],   # 0 meters = light gray (rest day)
        [0.001, "#9be9a8"], # tiny activity = light green (we use 0.001 to 
        [0.25, "#40c463"],  #   make 0 clearly different from any activity)
        [0.5, "#30a14e"],
        [1.0, "#216e39"],   # max meters = darkest green
    ],
    
    # Hover text: shows the actual meters and km when you mouse over
    hovertemplate=(
        "Week: %{y}<br>"
        "Day: %{x}<br>"
        "Distance: %{z:,.0f}m (%{customdata:.1f} km)"
        "<extra></extra>"   # removes trace name from hover box
    ),
    customdata=z_values / 1000,  # also pass km values for hover display
    
    colorbar=dict(title="Meters", thickness=15),  # legend showing the color scale
))

fig_heatmap.update_layout(
    title="üö£ Training Heatmap ‚Äî Distance per Day",
    xaxis_title="Day of Week",
    yaxis_title="Week",
    yaxis=dict(autorange="reversed"),  # newest weeks at top
    height=max(300, len(weeks) * 22),  # auto-size height to fit all weeks
    template="plotly_white",
)

fig_heatmap.show()

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# SECTION 2: Trend Regression (Is your pace improving?)
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
 
 ## WHAT is Linear Regression?
 Linear regression fits a straight line (y = mx + b) through noisy data to
 find the **overall trend**. Individual workouts bounce around, but the line
 tells you the average direction ‚Äî are you getting faster or slower?

 - **m** (slope) = how much pace changes per day ‚Üí your improvement rate
 - **b** (intercept) = the starting baseline
 - **R¬≤** (R-squared) = how well the line fits (0 = random, 1 = perfect fit)

 ## WHY use it?
 - Quantifies improvement: "You're improving by 0.5 seconds/500m per month"
 - Filters out noise ‚Äî one bad workout doesn't mean you're getting worse
 - Predicts where you'll be in the future (extrapolation)
 - It's the simplest, most interpretable model ‚Äî start here before anything fancier

 ## WHY numpy `polyfit` instead of scikit-learn?
 For simple 1-variable regression, `np.polyfit()` is the fastest way ‚Äî it's
 literally one line of code. Scikit-learn's `LinearRegression` does the same
 math but with more ceremony (fit/predict pattern). We'll see both approaches.

 ## HOW will we build it?
 1. Filter to workouts that have a pace value
 2. Convert dates to numbers (days since first workout)
 3. Fit a line with `np.polyfit()`
 4. Calculate R¬≤ to see how good the fit is
 5. Plot data points + trendline with Plotly

In [11]:
# ‚îÄ‚îÄ STEP 2.1: Prepare the data for regression ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT:  Filter to rows with valid pace, then convert dates ‚Üí numbers.
#
# WHY convert dates to numbers?
#   Regression needs numbers on both axes (y = mx + b).
#   Dates aren't numbers, so we convert to "days since first workout".
#   Day 0 = your first workout,  Day 30 = one month later, etc.
#
# WHY filter out NaN pace?
#   Some workouts (e.g. just-time or just-distance) may have no pace.
#   Regression can't handle NaN values ‚Äî it needs clean data.

pace_df = df[df["pace_500m"].notna()].copy()
#          ‚Üë .notna() returns True where pace exists
#            df[...] filters to keep only those rows
#            .copy() prevents pandas SettingWithCopyWarning

# Convert dates to numbers: "days since first workout"
first_day = pace_df["date"].min()
pace_df["days_since_start"] = (pace_df["date"] - first_day).dt.days
#                               ‚Üë subtracting datetimes gives a timedelta
#                                  .dt.days extracts just the day count

print(f"Workouts with pace data: {len(pace_df)}")
print(f"Training span: {pace_df['days_since_start'].max()} days")
print(f"Pace range: {pace_df['pace_500m'].min():.1f}s ‚Äì {pace_df['pace_500m'].max():.1f}s per 500m")
pace_df[["date", "days_since_start", "pace_500m", "distance_m"]].head(10)

Workouts with pace data: 57
Training span: 324 days
Pace range: 103.1s ‚Äì 196.6s per 500m


Unnamed: 0,date,days_since_start,pace_500m,distance_m
0,2025-03-20 08:36:00,0,162.14,5000
1,2025-03-21 08:47:00,1,159.51,5000
2,2025-03-22 13:41:00,2,153.54,5000
3,2025-03-25 07:58:00,4,152.93,5000
4,2025-03-26 09:19:00,6,164.7,5000
5,2025-03-27 09:04:00,7,168.07,5000
6,2025-03-29 13:35:00,9,156.3,5000
7,2025-03-31 09:16:00,11,169.627273,5500
8,2025-04-01 08:09:00,11,166.1,6000
9,2025-04-02 08:44:00,13,165.907143,7000


In [12]:
# ‚îÄ‚îÄ STEP 2.2: Fit the regression line with numpy ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT: np.polyfit(x, y, degree) fits a polynomial of the given degree.
#       degree=1 ‚Üí straight line (linear regression): y = slope*x + intercept
#
# HOW it works under the hood:
#   polyfit uses "least squares" ‚Äî it finds the line that minimizes the 
#   sum of squared differences between actual points and the line.
#   Imagine stretching a rubber band between points ‚Äî the line sits 
#   where the total stretch is minimized.
#
# WHAT is R¬≤ (R-squared)?
#   - Measures how much of the variation in pace is explained by time
#   - R¬≤=1.0: perfect fit (all points on the line)  
#   - R¬≤=0.0: the line explains nothing (pace is random)
#   - R¬≤=0.3: time explains 30% of pace variation ‚Äî typical for noisy data!
#   Formula: R¬≤ = 1 - (sum of squared residuals) / (sum of squared deviations from mean)

import numpy as np

x = pace_df["days_since_start"].values  # independent variable (time)
y = pace_df["pace_500m"].values         # dependent variable (pace)

# ‚îÄ‚îÄ‚îÄ Method 1: numpy polyfit (the simple way) ‚îÄ‚îÄ‚îÄ
coefficients = np.polyfit(x, y, deg=1)  
#                                  ‚Üë deg=1 means fit a straight line
# Returns [slope, intercept]

slope     = coefficients[0]   # change in pace per day
intercept = coefficients[1]   # pace at day 0

# Generate the trendline y-values
trend_y = np.polyval(coefficients, x)  # evaluate the polynomial at each x
#          ‚Üë polyval = "polynomial evaluate": plugs each x into slope*x + intercept

# Calculate R¬≤ (coefficient of determination)
ss_residuals = np.sum((y - trend_y) ** 2)     # sum of squared errors
ss_total     = np.sum((y - np.mean(y)) ** 2)  # total variance in y
r_squared    = 1 - (ss_residuals / ss_total)

# ‚îÄ‚îÄ‚îÄ Interpret the results ‚îÄ‚îÄ‚îÄ
pace_change_per_month = slope * 30  # convert per-day to per-month
direction = "improving ‚úÖ" if slope < 0 else "getting slower ‚ùå"
#            ‚Üë LOWER pace = FASTER (fewer seconds per 500m)

print(f"‚ïê‚ïê‚ïê REGRESSION RESULTS ‚ïê‚ïê‚ïê")
print(f"Slope:     {slope:.4f} seconds/500m per day")
print(f"           = {pace_change_per_month:.2f} seconds/500m per month")
print(f"Direction: You're {direction}")
print(f"R¬≤:        {r_squared:.3f} ({r_squared*100:.1f}% of variation explained)")
print(f"Intercept: {intercept:.1f} seconds/500m at day 0")

‚ïê‚ïê‚ïê REGRESSION RESULTS ‚ïê‚ïê‚ïê
Slope:     0.0185 seconds/500m per day
           = 0.56 seconds/500m per month
Direction: You're getting slower ‚ùå
R¬≤:        0.015 (1.5% of variation explained)
Intercept: 161.7 seconds/500m at day 0


In [13]:
# ‚îÄ‚îÄ STEP 2.3: Now with scikit-learn (the "proper" ML way) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHY show both? np.polyfit is quick and easy, but scikit-learn's 
# LinearRegression follows the standard ML workflow that you'll use 
# for ALL other models:
#
#   1. Create the model          ‚Üí model = LinearRegression()
#   2. Prepare the features      ‚Üí X must be 2D: [[1], [2], [3], ...]
#   3. Fit (train) the model     ‚Üí model.fit(X, y)
#   4. Predict                   ‚Üí model.predict(X_new)
#   5. Evaluate                  ‚Üí model.score(X, y) = R¬≤
#
# This fit/predict/score pattern is THE pattern in scikit-learn.
# Every model ‚Äî regression, classification, clustering ‚Äî uses it.

from sklearn.linear_model import LinearRegression

# sklearn requires X as a 2D array: shape (n_samples, n_features)
# Our x is 1D: [0, 1, 5, ...] ‚Üí reshape to [[0], [1], [5], ...]
X = x.reshape(-1, 1)  
#       ‚Üë -1 means "figure out this dimension automatically"
#         1 means "1 column" (one feature: days_since_start)

# Create and train the model
model = LinearRegression()    # instantiate the model object
model.fit(X, y)               # fit = "learn the best line from data"

# Extract learned parameters
print(f"‚ïê‚ïê‚ïê SCIKIT-LEARN RESULTS ‚ïê‚ïê‚ïê")
print(f"Slope (coef_):     {model.coef_[0]:.4f}")        # same as np.polyfit slope
print(f"Intercept:         {model.intercept_:.1f}")       # same as np.polyfit intercept
print(f"R¬≤ (score):        {model.score(X, y):.3f}")      # same as our manual R¬≤

# Predict pace at arbitrary future dates
future_day = pace_df["days_since_start"].max() + 90  # 3 months from now
predicted_pace = model.predict([[future_day]])[0]
mins = int(predicted_pace // 60)
secs = predicted_pace % 60
print(f"\nPredicted pace in 3 months: {mins}:{secs:04.1f} /500m")
print("(‚ö†Ô∏è  Extrapolation ‚Äî take with a grain of salt!)")

‚ïê‚ïê‚ïê SCIKIT-LEARN RESULTS ‚ïê‚ïê‚ïê
Slope (coef_):     0.0185
Intercept:         161.7
R¬≤ (score):        0.015

Predicted pace in 3 months: 2:49.4 /500m
(‚ö†Ô∏è  Extrapolation ‚Äî take with a grain of salt!)


In [14]:
# ‚îÄ‚îÄ STEP 2.4: Visualize the trend ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT: Plot the actual pace data + the regression trendline + a 
#       rolling average for additional context.
#
# WHY add a rolling average?
#   The regression line shows the OVERALL trend (static, one line).
#   A rolling average shows HOW the trend changes ‚Äî maybe you improved
#   fast at first, then plateaued. The rolling average captures that.
#
# HOW rolling average works:
#   .rolling(window=10) ‚Üí for each point, look at the last 10 workouts
#   .mean()             ‚Üí average those 10 paces
#   This smooths out noise, revealing the underlying pattern.

import plotly.graph_objects as go

# Compute 10-workout rolling average
pace_df = pace_df.sort_values("date")   # make sure it's in chronological order
pace_df["rolling_avg"] = pace_df["pace_500m"].rolling(
    window=10,        # average over last 10 workouts
    min_periods=3,    # need at least 3 data points to start (otherwise NaN)
).mean()

# Helper: format seconds ‚Üí "M:SS"
def fmt_pace(s):
    return f"{int(s // 60)}:{s % 60:04.1f}"

fig_trend = go.Figure()

# 1) Scatter plot of actual pace data (individual workouts)
fig_trend.add_trace(go.Scatter(
    x=pace_df["date"],
    y=pace_df["pace_500m"],
    mode="markers",
    name="Actual Pace",
    marker=dict(size=6, color="#2196F3", opacity=0.6),
    hovertemplate="Date: %{x}<br>Pace: %{text}<extra></extra>",
    text=pace_df["pace_500m"].apply(fmt_pace),
))

# 2) Regression trendline
fig_trend.add_trace(go.Scatter(
    x=pace_df["date"],
    y=trend_y,                     # the y-values from our polyfit
    mode="lines",
    name=f"Trend (R¬≤={r_squared:.2f})",
    line=dict(color="red", width=2, dash="dash"),
))

# 3) Rolling average (smoothed curve)
fig_trend.add_trace(go.Scatter(
    x=pace_df["date"],
    y=pace_df["rolling_avg"],
    mode="lines",
    name="10-workout Rolling Avg",
    line=dict(color="#4CAF50", width=2),
))

# Format Y-axis ticks as M:SS
min_pace = (int(pace_df["pace_500m"].min()) // 5) * 5
max_pace = ((int(pace_df["pace_500m"].max()) // 5) + 1) * 5
tickvals = list(range(min_pace, max_pace + 1, 5))
ticktext = [f"{v // 60}:{v % 60:02d}" for v in tickvals]

# Add annotation showing the improvement rate
fig_trend.add_annotation(
    x=0.02, y=0.98,
    xref="paper", yref="paper",   # position relative to the plot area
    text=(
        f"<b>Improvement Rate:</b> {abs(pace_change_per_month):.1f}s /500m per month<br>"
        f"<b>Direction:</b> {'‚Üì Getting Faster' if slope < 0 else '‚Üë Getting Slower'}"
    ),
    showarrow=False,
    bgcolor="rgba(255,255,255,0.8)",
    bordercolor="#ccc",
    font=dict(size=12),
    align="left",
)

fig_trend.update_layout(
    title="üìà Pace Trend Analysis with Linear Regression",
    xaxis_title="Date",
    yaxis_title="Pace /500m",
    yaxis=dict(tickvals=tickvals, ticktext=ticktext),
    template="plotly_white",
    height=500,
    legend=dict(x=0.02, y=0.02, bgcolor="rgba(255,255,255,0.8)"),
)

fig_trend.show()

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# SECTION 3: Workout Clustering with K-Means
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

 ## WHAT is Clustering?
 Clustering is an **unsupervised** machine learning technique. "Unsupervised"
 means we DON'T tell the model what the categories are ‚Äî it discovers them
 on its own by finding groups of similar workouts.

 ## WHAT is K-Means specifically?
 K-Means is the most popular clustering algorithm. Here's the intuition:  
 1. Pick K random "center" points in your data space  
 2. Assign each workout to its nearest center  
 3. Recalculate centers as the average of their assigned workouts  
 4. Repeat steps 2-3 until centers stop moving  
 
 It's like throwing K pins onto a dartboard, then repeatedly adjusting them
 until each pin is at the center of its nearest group of darts.

 ## WHY use it on workout data?
 - Auto-discovers your workout categories: sprints, steady-state, long rows
 - No manual labeling needed ‚Äî the algorithm finds patterns you might miss
 - Helps you see if your training is balanced or if you're missing types

 ## WHY K-Means over other clustering methods?
 - Simple to understand and implement
 - Fast (works well even with thousands of rows)
 - The main downside: you must choose K (number of clusters) in advance  
   ‚Üí We'll use the **Elbow Method** to find the best K!
 
 ## CRITICAL STEP: Feature Scaling
 K-Means uses distance between points. Distance in meters (0‚Äì42195) would 
 dwarf pace in seconds (90‚Äì300). We MUST **scale** features to the same range
 so each feature contributes equally. StandardScaler transforms each feature  
 to have mean=0, std=1. This is called **standardization**.

 ## HOW will we build it?
 1. Select features (distance, pace, duration)
 2. Scale them with StandardScaler
 3. Use the Elbow Method to find optimal K
 4. Run K-Means clustering
 5. Visualize and interpret the clusters

In [15]:
# ‚îÄ‚îÄ STEP 3.1: Select and prepare features for clustering ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT: Pick which columns (features) describe a workout's "character".
#       We want features that distinguish sprint vs. steady-state vs. long row.
#
# WHY these 3 features?
#   - distance_m:   Separates short sprints from long endurance rows
#   - pace_500m:    Separates intense from easy efforts  
#   - time_seconds: Separates quick sessions from extended ones
#
# WHY drop NaN rows?
#   K-Means cannot handle missing values. We need complete rows only.
#   dropna() removes any row that has at least one NaN in our selected columns.

features = ["distance_m", "pace_500m", "time_seconds"]

cluster_df = df[features].dropna().copy()
#               ‚Üë select only the 3 columns we care about
#                          ‚Üë remove rows with any NaN
#                                  ‚Üë copy so edits don't affect the original df

print(f"Workouts available for clustering: {len(cluster_df)} / {len(df)}")
print(f"\n=== Feature Statistics (BEFORE scaling) ===")
print(cluster_df.describe().round(1))
print(f"\nNotice the range differences:")
print(f"  distance_m:   {cluster_df['distance_m'].min():.0f} ‚Äì {cluster_df['distance_m'].max():.0f}")
print(f"  pace_500m:    {cluster_df['pace_500m'].min():.1f} ‚Äì {cluster_df['pace_500m'].max():.1f}")
print(f"  time_seconds: {cluster_df['time_seconds'].min():.0f} ‚Äì {cluster_df['time_seconds'].max():.0f}")
print(f"\n‚ö†Ô∏è  distance goes up to ~42000, but pace only to ~300.")
print(f"    Without scaling, distance would DOMINATE the clustering!")

Workouts available for clustering: 57 / 57

=== Feature Statistics (BEFORE scaling) ===
       distance_m  pace_500m  time_seconds
count        57.0       57.0          57.0
mean       7595.5      163.7        2530.5
std        3105.3       12.3        1121.0
min         291.0      103.1          60.0
25%        5000.0      156.8        1595.1
50%        6000.0      164.7        1993.2
75%       10000.0      169.6        3383.7
max       15000.0      196.6        5480.9

Notice the range differences:
  distance_m:   291 ‚Äì 15000
  pace_500m:    103.1 ‚Äì 196.6
  time_seconds: 60 ‚Äì 5481

‚ö†Ô∏è  distance goes up to ~42000, but pace only to ~300.
    Without scaling, distance would DOMINATE the clustering!


In [16]:
# ‚îÄ‚îÄ STEP 3.2: Scale the features (StandardScaler) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT: Transform each feature so it has mean=0 and standard deviation=1.
#       This is called "standardization" or "z-score normalization".
#
# HOW the math works:
#       scaled_value = (original_value - mean) / standard_deviation
#
#   Example: If mean distance = 5000m and std = 3000m:
#     - 2000m ‚Üí (2000-5000)/3000 = -1.0  (below average)
#     - 5000m ‚Üí (5000-5000)/3000 =  0.0  (at average)
#     - 8000m ‚Üí (8000-5000)/3000 = +1.0  (above average)
#
# WHY:
#   After scaling, ALL features are on the same scale (~-3 to +3).
#   Now distance, pace, and time contribute EQUALLY to the clustering.
#
# WHY StandardScaler (not MinMaxScaler)?
#   StandardScaler handles outliers better. MinMaxScaler squishes 
#   everything to [0,1], so one extreme workout would compress all others.
#   StandardScaler uses mean/std which are more robust.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# fit_transform() does two things in one call:
#   1. fit():      Calculate the mean and std of each column
#   2. transform(): Apply the formula (value - mean) / std
X_scaled = scaler.fit_transform(cluster_df)

# Let's verify the scaling worked:
print("=== Feature Statistics (AFTER scaling) ===")
print(f"{'Feature':<15} {'Mean':>8} {'Std':>8} {'Min':>8} {'Max':>8}")
print("-" * 47)
for i, feat in enumerate(features):
    col = X_scaled[:, i]   # extract column i from the 2D array
    print(f"{feat:<15} {col.mean():>8.2f} {col.std():>8.2f} {col.min():>8.2f} {col.max():>8.2f}")

print(f"\n‚úÖ All features now centered at ~0 with similar ranges!")

=== Feature Statistics (AFTER scaling) ===
Feature             Mean      Std      Min      Max
-----------------------------------------------
distance_m          0.00     1.00    -2.37     2.41
pace_500m           0.00     1.00    -4.98     2.70
time_seconds        0.00     1.00    -2.22     2.66

‚úÖ All features now centered at ~0 with similar ranges!


In [18]:
# ‚îÄ‚îÄ STEP 3.3: The Elbow Method ‚Äî Finding the best K ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT: K-Means requires you to choose K (number of clusters) upfront.
#       The Elbow Method helps you pick the right K.
#
# HOW it works:
#   Run K-Means for K=2, K=3, K=4, ..., K=10 and record the "inertia" 
#   (also called "Within-Cluster Sum of Squares" ‚Äî WCSS).
#
#   Inertia = sum of distances from each point to its cluster center.
#   Lower inertia = tighter clusters = better fit.
#
#   But more clusters ALWAYS reduces inertia (K=N gives inertia=0!).
#   The trick: look for the "elbow" ‚Äî the point where adding more clusters 
#   gives DIMINISHING returns. That's your sweet spot.
#
#   It's like this: 
#     K=2 ‚Üí huge drop in inertia
#     K=3 ‚Üí big drop  
#     K=4 ‚Üí moderate drop  ‚Üê This is probably the elbow!
#     K=5 ‚Üí tiny drop
#     K=6 ‚Üí tiny drop
#
# WHY random_state=42?
#   K-Means starts with random initial centers. Setting random_state makes
#   it reproducible ‚Äî you'll get the same results every time you run it.
#   42 is just a convention (answer to life, the universe, and everything üôÇ).

from sklearn.cluster import KMeans

k_range = range(2, 11)      # test K from 2 to 10
inertias = []                # store inertia for each K

for k in k_range:
    kmeans = KMeans(
        n_clusters=k,         # number of clusters to create
        random_state=42,      # reproducibility
        n_init=10,            # run 10 times with different initial centers, pick best
    )
    kmeans.fit(X_scaled)      # run the algorithm on our scaled data
    inertias.append(kmeans.inertia_)  # .inertia_ = the WCSS for this K
    print(f"K={k:2d}  ‚Üí  Inertia: {kmeans.inertia_:>10.1f}")

# Plot the Elbow curve
import plotly.express as px

fig_elbow = px.line(
    x=list(k_range), y=inertias,
    markers=True,
    title="ü¶æ Elbow Method ‚Äî Finding the Optimal K",
    labels={"x": "Number of Clusters (K)", "y": "Inertia (WCSS)"},
)
fig_elbow.update_layout(template="plotly_white")
fig_elbow.show()

print("\nüîç Look at the chart: Where does the curve 'bend' like an elbow?")
print("   That K value is your optimal number of clusters.")

K= 2  ‚Üí  Inertia:       67.8
K= 3  ‚Üí  Inertia:       41.8
K= 4  ‚Üí  Inertia:       26.7
K= 5  ‚Üí  Inertia:       15.8
K= 6  ‚Üí  Inertia:        9.6
K= 7  ‚Üí  Inertia:        6.4
K= 8  ‚Üí  Inertia:        5.0
K= 9  ‚Üí  Inertia:        4.0
K=10  ‚Üí  Inertia:        3.3



üîç Look at the chart: Where does the curve 'bend' like an elbow?
   That K value is your optimal number of clusters.


In [19]:
# ‚îÄ‚îÄ STEP 3.4: Run K-Means with the chosen K ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT: Run the final clustering with our chosen K.
#       (Change OPTIMAL_K below based on what your Elbow chart showed!)
#
# WHAT happens inside .fit()?
#   1. Place K initial cluster centers randomly in the feature space
#   2. ASSIGN each workout to the nearest center (using Euclidean distance)
#   3. RECALCULATE each center as the mean of all assigned workouts
#   4. Repeat steps 2-3 until assignments stop changing (convergence)
#
# WHAT we get back:
#   - .labels_           ‚Üí array of cluster IDs (0, 1, 2, ...) for each workout
#   - .cluster_centers_  ‚Üí the K centroid coordinates in scaled space
#   - .inertia_          ‚Üí how tight the clusters are

# ‚¨áÔ∏è CHANGE THIS based on your Elbow chart! ‚¨áÔ∏è
OPTIMAL_K = 4
#           ‚Üë If your elbow was at 3, change to 3. If at 5, use 5.

kmeans_final = KMeans(n_clusters=OPTIMAL_K, random_state=42, n_init=10)
kmeans_final.fit(X_scaled)

# Add cluster labels back to our DataFrame
cluster_df["cluster"] = kmeans_final.labels_
#                       ‚Üë labels_ is an array like [0, 2, 1, 0, 1, 2, ...]
#                         each number = which cluster that workout belongs to

print(f"‚úÖ K-Means completed with K={OPTIMAL_K}")
print(f"\n=== Cluster Distribution ===")
print(cluster_df["cluster"].value_counts().sort_index())
print(f"\nInertia: {kmeans_final.inertia_:.1f}")

‚úÖ K-Means completed with K=4

=== Cluster Distribution ===
cluster
0    29
1    23
2     4
3     1
Name: count, dtype: int64

Inertia: 26.7


In [20]:
# ‚îÄ‚îÄ STEP 3.5: Interpret the clusters ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT: Look at the average characteristics of each cluster to understand
#       what KIND of workout each cluster represents.
#
# WHY: K-Means gives us numbers (cluster 0, 1, 2...) ‚Äî they have no 
#      inherent meaning. WE have to interpret them by examining the
#      averages. This is the "data science" part ‚Äî turning numbers 
#      into insights!
#
# HOW: 
#   .groupby("cluster").mean() ‚Üí average of each feature per cluster
#   Then we label each cluster based on its characteristics:
#     - Short distance + fast pace ‚Üí "Sprint / Test"
#     - Long distance + moderate pace ‚Üí "Endurance"
#     - Medium everything ‚Üí "Steady-State"

cluster_stats = cluster_df.groupby("cluster").agg(
    avg_distance=("distance_m", "mean"),
    avg_pace=("pace_500m", "mean"),
    avg_duration_min=("time_seconds", lambda x: x.mean() / 60),  # convert to minutes
    count=("distance_m", "count"),
).round(1)

# Format pace as M:SS for readability
cluster_stats["avg_pace_fmt"] = cluster_stats["avg_pace"].apply(
    lambda s: f"{int(s // 60)}:{s % 60:04.1f}"
)

print("‚ïê‚ïê‚ïê CLUSTER PROFILES ‚ïê‚ïê‚ïê\n")
for idx, row in cluster_stats.iterrows():
    print(f"Cluster {idx}:  {row['count']:.0f} workouts")
    print(f"  Avg Distance:  {row['avg_distance']:,.0f}m ({row['avg_distance']/1000:.1f}km)")
    print(f"  Avg Pace:      {row['avg_pace_fmt']} /500m")
    print(f"  Avg Duration:  {row['avg_duration_min']:.0f} minutes")
    
    # Auto-label based on characteristics
    if row["avg_distance"] < 3000 and row["avg_pace"] < 130:
        label = "‚ö° Sprint / Speed Test"
    elif row["avg_distance"] < 3000:
        label = "üèÉ Short Workout"
    elif row["avg_distance"] > 10000:
        label = "üö£ Long Endurance"
    elif row["avg_pace"] < 130:
        label = "üî• Fast Medium-Distance"
    else:
        label = "üí™ Steady-State"
    print(f"  ‚Üí Auto-label:  {label}\n")

cluster_stats

‚ïê‚ïê‚ïê CLUSTER PROFILES ‚ïê‚ïê‚ïê

Cluster 0:  29 workouts
  Avg Distance:  5,121m (5.1km)
  Avg Pace:      2:40.5 /500m
  Avg Duration:  27 minutes
  ‚Üí Auto-label:  üí™ Steady-State

Cluster 1:  23 workouts
  Avg Distance:  9,962m (10.0km)
  Avg Pace:      2:47.4 /500m
  Avg Duration:  56 minutes
  ‚Üí Auto-label:  üí™ Steady-State

Cluster 2:  4 workouts
  Avg Distance:  13,750m (13.8km)
  Avg Pace:      3:00.4 /500m
  Avg Duration:  82 minutes
  ‚Üí Auto-label:  üö£ Long Endurance

Cluster 3:  1 workouts
  Avg Distance:  291m (0.3km)
  Avg Pace:      1:43.1 /500m
  Avg Duration:  1 minutes
  ‚Üí Auto-label:  ‚ö° Sprint / Speed Test



Unnamed: 0_level_0,avg_distance,avg_pace,avg_duration_min,count,avg_pace_fmt
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,5121.3,160.5,27.4,29,2:40.5
1,9962.3,167.4,55.6,23,2:47.4
2,13750.0,180.4,82.0,4,3:00.4
3,291.0,103.1,1.0,1,1:43.1


In [21]:
# ‚îÄ‚îÄ STEP 3.6: Visualize the clusters (2D scatter) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT: Plot each workout as a dot, colored by its cluster.
#       We'll show two views: Distance vs Pace, and Distance vs Duration.
#
# WHY two views?
#   Our data lives in 3D (distance, pace, time) but screens are 2D.
#   So we project it onto two 2D "views" ‚Äî like looking at a box from 
#   the front and from the side. Each view reveals different cluster structure.
#
# WHY use plotly express here (not graph_objects)?
#   px.scatter automatically handles the color legend, hover text, and
#   a lot of formatting when you pass a "color" column. It's the right
#   tool when you don't need granular marker control.

from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Color map for clusters
colors = ["#2196F3", "#FF5722", "#4CAF50", "#FFC107", "#9C27B0", "#00BCD4"]

fig_clusters = make_subplots(
    rows=1, cols=2,
    subplot_titles=["Distance vs Pace (colored by cluster)", 
                    "Distance vs Duration (colored by cluster)"],
    horizontal_spacing=0.12,
)

for cluster_id in sorted(cluster_df["cluster"].unique()):
    mask = cluster_df["cluster"] == cluster_id
    subset = cluster_df[mask]
    color = colors[cluster_id % len(colors)]
    
    # Format pace for hover
    pace_fmt = subset["pace_500m"].apply(lambda s: f"{int(s // 60)}:{s % 60:04.1f}")
    
    # Left plot: Distance vs Pace  
    fig_clusters.add_trace(go.Scatter(
        x=subset["distance_m"],
        y=subset["pace_500m"],
        mode="markers",
        name=f"Cluster {cluster_id}",
        marker=dict(size=8, color=color, opacity=0.7),
        hovertemplate=(
            f"Cluster {cluster_id}<br>"
            "Distance: %{x:,}m<br>"
            "Pace: %{text}<extra></extra>"
        ),
        text=pace_fmt,
        legendgroup=f"c{cluster_id}",  # group both plots under same legend entry
    ), row=1, col=1)
    
    # Right plot: Distance vs Duration
    fig_clusters.add_trace(go.Scatter(
        x=subset["distance_m"],
        y=subset["time_seconds"] / 60,   # convert to minutes for readability
        mode="markers",
        name=f"Cluster {cluster_id}",
        marker=dict(size=8, color=color, opacity=0.7),
        hovertemplate=(
            f"Cluster {cluster_id}<br>"
            "Distance: %{x:,}m<br>"
            "Duration: %{y:.0f} min<extra></extra>"
        ),
        legendgroup=f"c{cluster_id}",
        showlegend=False,   # don't duplicate legend entries
    ), row=1, col=2)

# Format y-axis on left plot as M:SS
min_pace = (int(cluster_df["pace_500m"].min()) // 5) * 5
max_pace = ((int(cluster_df["pace_500m"].max()) // 5) + 1) * 5
tickvals = list(range(min_pace, max_pace + 1, 10))
ticktext = [f"{v // 60}:{v % 60:02d}" for v in tickvals]

fig_clusters.update_xaxes(title_text="Distance (m)", row=1, col=1)
fig_clusters.update_xaxes(title_text="Distance (m)", row=1, col=2)
fig_clusters.update_yaxes(title_text="Pace /500m", tickvals=tickvals, ticktext=ticktext, row=1, col=1)
fig_clusters.update_yaxes(title_text="Duration (minutes)", row=1, col=2)

fig_clusters.update_layout(
    title="üéØ Workout Clusters ‚Äî K-Means Results",
    template="plotly_white",
    height=500,
    width=1000,
)

fig_clusters.show()

In [22]:
# ‚îÄ‚îÄ STEP 3.7: Training Balance Pie Chart ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
# WHAT: A pie chart showing what percentage of workouts fall into 
#       each cluster ‚Äî revealing your training balance.
#
# WHY: If 80% of workouts are steady-state, you might be missing speed work.
#      If 90% are sprints, you might need more endurance. This visualization
#      makes imbalances OBVIOUS at a glance.

cluster_counts = cluster_df["cluster"].value_counts().sort_index()

# Create labels (you can customize these after seeing Step 3.5 output!)
labels = [f"Cluster {i}" for i in cluster_counts.index]

fig_pie = go.Figure(data=[go.Pie(
    labels=labels,
    values=cluster_counts.values,
    marker=dict(colors=colors[:len(labels)]),
    textinfo="label+percent",       # show both label and percentage on slices
    hovertemplate="%{label}<br>Count: %{value}<br>Share: %{percent}<extra></extra>",
)])

fig_pie.update_layout(
    title="üìä Training Balance ‚Äî Workout Distribution by Cluster",
    template="plotly_white",
)

fig_pie.show()

# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# üéì Recap ‚Äî What You Learned
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

 ### Section 1: Training Heatmap
 - **groupby + agg** ‚Äî how to aggregate data by a category (date)
 - **reindex + fillna** ‚Äî how to fill missing dates with zero values
 - **pivot_table** ‚Äî how to reshape long data into a 2D matrix
 - **go.Heatmap** ‚Äî how to render a matrix as a color-coded calendar

 ### Section 2: Linear Regression
 - **np.polyfit** ‚Äî the quick way to fit a straight line (y = mx + b)
 - **sklearn.LinearRegression** ‚Äî the standard ML fit/predict/score pattern
 - **R¬≤** ‚Äî how to measure how good a model's fit is (0‚Äì1)
 - **Rolling averages** ‚Äî how to smooth noisy time-series data
 - **Interpretation** ‚Äî slope < 0 means pace is decreasing = getting faster!

 ### Section 3: K-Means Clustering
 - **StandardScaler** ‚Äî why and how to normalize features before distance-based ML
 - **Elbow Method** ‚Äî how to choose the optimal number of clusters
 - **KMeans.fit()** ‚Äî the unsupervised learning workflow
 - **Cluster interpretation** ‚Äî turning numeric labels into meaningful categories
 - **make_subplots** ‚Äî how to build multi-panel visualizations

 ### Libraries Cheat Sheet
 | Library | Used For | Key Functions |
 |---------|----------|---------------|
 | **pandas** | Data manipulation | `read_csv`, `groupby`, `pivot_table`, `rolling` |
 | **numpy** | Math & arrays | `polyfit`, `polyval`, `reshape` |
 | **plotly** | Interactive charts | `go.Heatmap`, `go.Scatter`, `px.line` |
 | **scikit-learn** | Machine Learning | `StandardScaler`, `KMeans`, `LinearRegression` |

 ### Next Steps
 - Try changing K in the clustering section ‚Äî how do clusters change?
 - Add stroke_rate or calories as additional clustering features
 - Try polynomial regression (degree=2) to capture non-linear trends
 - Filter by machine type and compare clusters for RowErg vs SkiErg