# 02 — EDA & Two-way ANOVA
- visuals for distributions and group differences
- ANOVA for Session_Type, Time_of_Day, and interaction

# 🚴 Cycling Effectiveness Analysis

This project explores the question:

“Which is more effective: indoor cycling in the mornings or afternoons versus outdoor cycling in the mornings or afternoons?”

I use a cycling activity dataset (from Kaggle) containing detailed session logs, including date/time, activity type (indoor vs outdoor), duration, distance, calories burned, heart rate, speed, cadence, power, and training stress score.

The goal is to define and measure effectiveness of a cycling session, compare performance across different conditions (indoor/outdoor × morning/afternoon), and build models that can predict workout effectiveness based on session features.



### Objectives
1. Data Cleaning & Preparation
 - Parse timestamps into separate date and time columns.
 - Engineer categorical features for Session Type (Indoor vs Outdoor) and Time of Day (Morning vs Afternoon).
 - Handle missing values (--) and normalize numeric columns.
 
2. Exploratory Data Analysis (EDA)
 - Summarize metrics (calories per minute, average HR, speed) across categories.•Visualize distributions with boxplots/violin plots.
 - Test for statistical significance (two-way ANOVA).
 
3. Effectiveness Definition
 - Primary: Calories per minute (caloric efficiency).
 - Alternative: Power per minute or Training Stress Score per minute.
 
4. Machine Learning Modeling
 - Regression (TensorFlow): Predict calories per minute from ride metrics.
 - Classification (TensorFlow): Predict whether a ride will be “high effectiveness” vs “low effectiveness.”
 
5. Insights & Conclusions
 - Identify which combination (indoor/outdoor × morning/afternoon) yields the most effective workouts.
 - Evaluate whether time of day or ride type is a stronger predictor of effectiveness.
 - Provide evidence-based recommendations for training optimization.


Tools & Libraries
 - Data Wrangling: pandas, numpy
 - Visualization: matplotlib, seaborn
 - Statistics: scipy, statsmodels (ANOVA)
 - Machine Learning: scikit-learn (utilities), TensorFlow/Keras (models)
 - Notebook Environment: JupyterLab

✨ Expected Outcome:
This project delivers both descriptive insights (e.g., “Outdoor morning rides burn more calories per minute than indoor afternoons”) and predictive models that help forecast session effectiveness under different conditions.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score, classification_report
from sklearn.utils import shuffle

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

df = pd.read_csv("/Users/amlim/cycling-effectiveness/data/cleaned_cycling.csv")


## Visualization

In [None]:
df = df.rename(columns={
    "Training Stress Score": "Training_Stress_Score",
    "Session Type": "Session_Type",
    "Time of Day": "Time_of_Day"
})
# focusing on columns: Session Type, Time of Day, Calories, Distance, Training Stress Score
sns.set(style="whitegrid", palette="muted", font_scale=1.2)

sns.boxplot(data=df, x="Time_of_Day", y="Calories", hue="Session_Type")
plt.title("Calories by Session Type and Time of Day")
plt.show()

In [None]:
sns.boxplot(data=df, x="Time_of_Day", y="Training_Stress_Score", hue="Session_Type")
plt.title("Training Stress Score by Session Type and Time of Day")
plt.show()

In [None]:
sns.violinplot(data=df, x="Time_of_Day", y="Calories", hue="Session_Type", split=True)
plt.title("Calories Distribution by Session Type & Time of Day")
plt.show()

In [None]:
sns.scatterplot(data=df, x="Distance", y="Calories", hue="Session_Type", style="Time_of_Day")
plt.title("Calories vs. Distance by Session Type & Time of Day")
plt.show()

sns.pairplot(df, vars=["Calories", "Distance", "Training_Stress_Score"], hue="Session_Type")
plt.show()

In [None]:
df.groupby(["Session_Type","Time_of_Day"])[["Calories","Training_Stress_Score"]].mean()

In [None]:

model_cal = smf.ols("Calories ~ C(Session_Type) * C(Time_of_Day)", data=df).fit()
anova_cal = sm.stats.anova_lm(model_cal, typ=2)
print("ANOVA for Calories\n", anova_cal)



Session Type is not significant (p=0.14). Calories burned doesn't differ much just by being indoors or outdoors.

Time of Day is high significant (p<0.001). Calories burned does vary across Morning, Afternoon, Evening.

Interaction is not significant. Session Type and Time of Day don't combine in a meaningful way.

In [None]:
model_tss = smf.ols("Training_Stress_Score ~ C(Session_Type) * C(Time_of_Day)", data=df).fit()
anova_tss = sm.stats.anova_lm(model_tss, typ=2)
print("ANOVA for TSS\n", anova_tss)

Session Type not significant. TSS doesn't differ by indoor vs. outdoor.

Time of Day is highly significant.

Interaction is not significant.