# Project Title
*all language subject to change*

**Author:** Axel Christian Cabato

**Date:** [Date]

# 1. Introduction
The goal of this project is to utilize a [Kaggle](https://www.kaggle.com) dataset to perform data analysis and generate a report. Documenting my processes, insights, and conclusions within this Jupyter Notebook.

This analysis...

## 2. Dataset Loading & Exploratory Data Analysis

- Data Source: https://www.kaggle.com/datasets/valakhorasani/gym-members-exercise-dataset?select=gym_members_exercise_tracking.csv
- Data Format: Comma-separated values (CSV)
- [Kaggle](https://www.kaggle.com/datasets/valakhorasani/gym-members-exercise-dataset?select=gym_members_exercise_tracking.csv) Description: This dataset provides a detailed overview of gym members' exercise routines, physical attributes, and fitness metrics, including key performance indicators such as heart rate, calories burned, and workout duration.

In [None]:
# Import pathlib and pandas libraries
from pathlib import Path
import pandas as pd

# Define the base directory
BASE_DIR = Path.cwd()

# Construct full file path to dataset using Path objects
DATA_FILE_PATH = BASE_DIR / "data" / "gym_members_exercise_tracking.csv"

# Use the path object in the read_csv function
df = pd.read_csv(DATA_FILE_PATH)


# Confirm successful load of the dataset by previewing the last 5 observations
df.tail()

### Exploratory Data Analysis

#### Data Profiling

In [None]:
# Output a concicise summary of the DataFrame
print("DATAFRAME SUMMARY")
df.info()

print("\n")

# Check for any missing values
print("MISSING VALUES CHECK")
print(df.isnull().sum())

Through our profiling of the dataset, we can confirm its structural integrity. It consists of 973 observations and 15 columns, with each column appropriately named and typed according to its quantitative or qualitative nature. A complete check for missing values across all fields revealed none.

In [None]:
# Generate descriptive statistics for numerical columns
df.describe()

This initial inspection showed no obvious data entry errors; however, further univariate analysis revealed a critical positive skew and extreme outliers in **BMI** and **Weight** which will be analyzed as a high-risk cohort.

In [None]:
# Display the total count of each distinct row under "Gender" and "Workout_Type"
df[["Gender", "Workout_Type"]].value_counts()

All categorical features contain a small and consistent set of unique values. Particularly for **Gender** and **Workout_Type**:
- For **Gender**, it is a binary categorical value with only two disctinct classes ("Male" and "Female"). The absence of additional unique values, such as inconsistent spellings, abbreviations, or missing value placeholders, confirms the high degree of data consistency for this feature.
- Similarly, **Workout_Type** also has a small amount of disctinct and consistently labeled classes: "Cardio", "Strength", "HIIT" and "Yoga". This categorical integrity ensures that the variable is ready for direct use in analysis or for a simple transformation into a quantitative format, such as one-hot encoding, without requiring a separate data cleaning stage.

---

#### Recognizing the Data Source & Context

While clean in structure, the dataset contains several potential biases, limitations, and quirks that a data analyst must consider. The primary bias is that the dataset is simulated and was generated using averages from publicly available studies and industry reports. This means the data may under- or over-represent certain behaviors or characteristics.
- For instance, the randomization of **Experience_Level** and **Workout_Frequency** might not perfectly reflect the actual distribution of gym members, where, for example, a large number might be beginners who work out less frequently. This synthetic nature is *the most significant limitation*, as it lacks the unpredictable and messy nuances of real human behavior.
- Any insights or models derived from this dataset would need to be validated with actual, real-world data before being applied to a genuine scenario.

The dataset also has a few quirks that are uncommon in real-world data. It has **no missing values** and all categorical values are perfectly consistent, *which is highly unusual*. 

Furthermore, the data is simplified and contains only the variables that were explicitly defined in the generation process. 
- For example, the **Workout_Type** column is limited to a small, consistent set of categories (*Cardio*, *Strength*, *Yoga*, *HIIT*), and does not reflect the full range of possible exercises performed by gym members.

> This foundational understanding will serve as a solid basis for our deeper exploratory data analysis.

---

To prepare the data for any audience who might be more familiar with imperial units, I will perform some feature engineering by constructing new attributes from the existing dataset.
- Specifically, I will convert the **Weight (kg)** and **Height (m)** variables from their current metric system to their imperial counterparts. This will be done to ensure the data is standardized for any subsequent statistical analysis and for enhanced data visualization tailored to our target audience.

In [None]:
# Meter to Feet Conversion
df["Height (ft)"] = round(df["Height (m)"] * 3.28, 2)

# Kilogram to Pound Conversion
df["Weight (lb)"] = round(df["Weight (kg)"] * 2.2)

# Verify post-conversaion values are correct
df[["Height (m)", "Height (ft)", "Weight (kg)", "Weight (lb)"]]

In [None]:
df[["Height (ft)", "Weight (lb)"]].describe()

I use the `.describe()` method to validate the newly-engineered **Weight (lbs)** and **Height (ft)** features, confirming that the new columns have a reasonable range of values and are correctly populated. This ensures the integrity of our dataset for subsequent analysis.

> I will be using Imperial units in my analyses going foward.

---

#### Data Visualization

Having performed the necessary data profiling and cleaning, I can now move on to Data Visualization. By visually exploring the dataset, I'll gain a deeper understanding of the health metrics and workout habits of the simulated gym members.

I have determined that the first step of my visual analysis should be to examine the distribution of our numerical features individually, using *Univariate Analysis*.

##### Univariate Analysis

This type of analysis will allow me to understand the central tendency and the spread of the data, and to easily spot any potential outliers. To accomplish this, I will generate histograms for each of the key numerical columns.

In [None]:
# Import the Matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Define the numerical features for univariate analysis
numerical_features = ["Age", "Weight (lb)", "Height (ft)", "Calories_Burned",
                      "Session_Duration (hours)", "Fat_Percentage", "BMI"]

# Set the style for the plots
sns.set_style("whitegrid")

# Create a figure and a set of subplots (3x3 grid for 7 plots)
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(18, 15))

# Flatten the axes array to easily iterate through it
axes = axes.flatten()

# Iterate through the numerical features and create a histogram for each
for i, feature in enumerate(numerical_features):
    ax = axes[i]
    # Use seaborn's histplot to create a histogram with a KDE curve
    sns.histplot(data=df, x=feature, kde=True, ax=ax, color="skyblue")
    ax.set_title(f"Distribution of {feature}", fontsize=14)
    ax.set_xlabel(feature, fontsize=12)
    ax.set_ylabel("Count", fontsize=12)

# Hide any unused subplots
for j in range(len(numerical_features), len(axes)):
    axes[j].axis("off")

# Set a main title for the entire figure
fig.suptitle(
    "Univariate Analysis: Histograms of Key Numerical Features", fontsize=20, y=1.02)

# Adjust layout to prevent titles from overlapping
plt.tight_layout()

# Display the plots
plt.show()

##### Some Key Insights:

1. **Age**
    - The symmetrical distribution, centered around 39 years old, confirms the fitness market's primary demographic appeal lies consistently within the active adult range (18-59), with minimal concentration at the extremes.
2. **Workout_Frequency (days/week)** (Histogram not shown above)
    - The high concentration of members engaging in exercise 3 to 4 days per week (nearly 70%) reveals a general population trend toward a sustainable, moderate habit rather than extreme commitment.
3. **Calories_Burned**
    - The high standard deviation in calorie expenditure, despite a moderate average session duration of only 1.26 hours, implies that intensity, personal physiology, and efficiency are the primary drivers of performance variation, not simply workout time.
4. **BMI & Fat_Percentage**
    - While Fat_Percentage exhibits a balanced, near-symmetrical distribution indicative of a typical gym population, the BMI feature displays a critical positive skew that, when combined with high-value outliers, mandates a strategic intervention for a vulnerable, high-risk cohort of members.

In conclusion, the typical gym member is a moderately aged adult, centered around 39 years old, who achieves a consistent performance baseline during workouts, characterized by a steady session duration averaging 1.26 hours and a predictable energy expenditure of approximately 905 calories. This population exhibits a healthy cardiovascular profile, with Resting BPM tightly distributed around 62 beats per minute, confirming the overall fitness level; however, the significant positive skew and high range observed in Weight (from 40 kg to 129.9 kg) isolates a vital minority cohort whose specialized physical needs deviate sharply from the average, requiring tailored high-impact or low-impact programming. Ultimately, the tight clustering of core metrics (Age, Heart Rate, Session Duration) suggests that intensity and individual physiology are the primary drivers of performance variation, not simply time spent exercising.

##### Categorical Analysis

Next, I move onto Categorical Analysis. I will utilize bar charts to analyze the counts and proportions of each categorical variable to help me understand the composition of the dataset and how different groups behave.

In [None]:
def add_value_labels(ax, position="on_top", fmt='{:.0f}'):
    """
    Adds value labels to each bar in a plot.

    Args:
        ax (plt.Axes): The axes object to add labels to.
        position (str): The position of the labels.
                        'on_top' (default) places labels above the bars.
                        'within' places labels inside the bars.
        fmt (str): The format string for the labels (e.g., '{:.0f}' for integers,
                   '{:.1f}' for one decimal place).
    """
    # Loop over each bar (patch) in the axes
    for p in ax.patches:
        # Get the height of the bar
        height = p.get_height()
        # Define the x-coordinate for the text (center of the bar)
        x = p.get_x() + p.get_width() / 2.

        if position == 'on_top':
            # Position the text slightly above the bar
            y = height + 1
            # Add the text label
            ax.text(x, y, fmt.format(height),
                    ha='center', va='bottom', fontsize=10)
        elif position == 'within':
            # Position the text in the middle of the bar
            y = height / 2
            # Add the text label with white color for contrast
            ax.text(x, y, fmt.format(height), ha='center',
                    va='center', color='white', fontsize=10)


def create_count_and_bar_charts(df):
    """
    Generates a set of categorical charts with labels placed on top of or within the bars.

    Args:
        df (pd.DataFrame): The DataFrame containing the data for the plots.
    """
    # Set the visual style for the plots using seaborn
    sns.set_style("whitegrid")

    # Create a figure with a 2x2 grid of subplots
    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(18, 12))
    # Flatten the axes array to easily iterate through them
    axes = axes.flatten()

    # --- Plot 1: Count of Gender (Labels on top) ---
    # Add hue=x to avoid Seaborn deprecation warning.
    sns.countplot(x='Gender', hue='Gender', data=df,
                  ax=axes[0], palette='viridis')
    axes[0].set_title('Distribution of Members by Gender', fontsize=14)
    axes[0].set_xlabel('Gender', fontsize=12)
    axes[0].set_ylabel('Count', fontsize=12)
    # Add labels using the utility function, placed on top
    add_value_labels(axes[0], position='within')

    # --- Plot 2: Count of Workout Type (Labels within) ---
    # Add hue=x to avoid Seaborn deprecation warning.
    sns.countplot(x='Workout_Type', hue='Workout_Type',
                  data=df, ax=axes[1], palette='plasma')
    axes[1].set_title('Frequency of Different Workout Types', fontsize=14)
    axes[1].set_xlabel('Workout Type', fontsize=12)
    axes[1].set_ylabel('Count', fontsize=12)
    # Rotate x-axis labels for readability
    axes[1].tick_params(axis='x', rotation=45)
    # Add labels using the utility function, placed within the bars
    add_value_labels(axes[1], position='within')

    # --- Plot 3: Average Calories Burned by Workout Type (Labels within) ---
    # Add hue=x to avoid Seaborn deprecation warning.
    sns.barplot(x='Workout_Type', y='Calories_Burned', hue='Workout_Type',
                data=df, ax=axes[2], palette='cividis', errorbar=None)
    axes[2].set_title('Average Calories Burned by Workout Type', fontsize=14)
    axes[2].set_xlabel('Workout Type', fontsize=12)
    axes[2].set_ylabel('Average Calories Burned', fontsize=12)
    # Rotate x-axis labels for readability
    axes[2].tick_params(axis='x', rotation=45)
    # Add labels using the utility function, placed within, formatted to one decimal place
    add_value_labels(axes[2], position='within', fmt='{:.1f}')

    # --- Plot 4: Average Session Duration by Experience Level (Labels on top) ---
    # Add hue=x to avoid Seaborn deprecation warning.
    sns.barplot(x='Experience_Level', y='Session_Duration (hours)', hue='Experience_Level',
                data=df, ax=axes[3], palette='magma', errorbar=None, legend=False)
    axes[3].set_title(
        'Average Session Duration by Experience Level', fontsize=14)
    axes[3].set_xlabel('Experience Level', fontsize=12)
    axes[3].set_ylabel('Average Session Duration (hours)', fontsize=12)
    # Ensure x-axis labels are integers
    axes[3].xaxis.set_major_locator(plt.MaxNLocator(integer=True))

    # Add a main title for the entire figure
    fig.suptitle('Categorical Analysis: Enhanced Bar Charts',
                 fontsize=20, y=1.02)

    # Automatically adjust subplot parameters to give a tight layout
    plt.tight_layout()

    # Display the plots
    plt.show()


create_count_and_bar_charts(df)

##### Insights

1. **Gender**
    - The near-perfect gender parity (Male 52.5%, Female 47.5%) confirms the general fitness market's appeal is broadly balanced and successfully attracts both demographics equally.
2. **Workout_Type**
    - The remarkably even distribution across all four primary exercise types (ranging from 22.7% to 26.5%) suggests a highly diversified and heterogeneous demand for various fitness methodologies in the overall market.
3. **Experience_Level**
    - The overwhelming concentration of members at the Beginner (Level 1) and Intermediate (Level 2) stages (≈80%) signifies a clear market-wide imperative to focus on retention and guided training pathways for novice users.
4. **Workout_Frequency (days/week)**
    - The vast majority of members (nearly 70%) commit to exercising 3 or 4 days per week, indicating that sustainable, moderate attendance is the dominant commitment pattern across the population.

Based on the data, I can conclude that the overall fitness market demonstrates balanced appeal and diverse interests, evidenced by near-perfect gender parity and an even distribution of demand across all four major workout categories. Furthermore, the commitment profile is strongly anchored in sustainability, with almost 70% of members consistently engaging in a moderate schedule of three to four workout days per week. This widespread moderate commitment, combined with the fact that 80% of the population is classified as Beginner or Intermediate, establishes a clear, unified market vulnerability that necessitates immediate investment in standardized, supportive programming for novice retention.

##### Bivariate Analysis

*intro text*

In [None]:
# --- Helper function for adding labels to bars (omitted for brevity) ---
def add_value_labels(ax, fmt='{:.0f}'):
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x() + p.get_width() / 2., height + 10,
                fmt.format(height), ha="center", va="bottom", fontsize=9)

# --- Bivariate Analysis Setup (omitted for brevity) ---
numerical_cols = ['Age', 'Weight (kg)', 'Height (m)', 'Max_BPM', 'Avg_BPM', 'Resting_BPM', 'Session_Duration (hours)', 'Calories_Burned', 'Water_Intake (liters)', 'BMI', 'Fat_Percentage']
correlation_matrix = df[numerical_cols].corr()
workout_type_summary = df.groupby('Workout_Type')['Calories_Burned'].mean().sort_values(ascending=False).reset_index()

# --- Visualization Code Generation for 4 Key Plots with fixes ---
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
plt.suptitle('Key Bivariate Relationships in Gym Member Data', fontsize=18, y=1.05)

# --- Plot 1: Correlation Heatmap ---
sns.heatmap(
    correlation_matrix[['Calories_Burned', 'Session_Duration (hours)', 'Avg_BPM', 'Weight (kg)']].T,
    annot=True, cmap='viridis', fmt=".2f", linewidths=.5, linecolor='black',
    ax=axes[0, 0]
)
axes[0, 0].set_title('1. Correlation with Key Performance Metrics', fontsize=14)
axes[0, 0].tick_params(axis='y', rotation=0)
axes[0, 0].tick_params(axis='x', rotation=45)

# --- Plot 2: Bar plot: Avg Calories Burned by Workout Type (FIXES APPLIED) ---
sns.barplot(
    x='Workout_Type', y='Calories_Burned', data=workout_type_summary,
    hue='Workout_Type', ax=axes[0, 1], palette='flare'
)
axes[0, 1].set_title('2. Average Calories Burned by Workout Type', fontsize=14)
axes[0, 1].set_xlabel('Workout Type')
axes[0, 1].set_ylabel('Average Calories Burned')
axes[0, 1].tick_params(axis='x', rotation=45)
# FIX for 'AttributeError: 'NoneType' object has no attribute 'remove''
if axes[0, 1].legend_ is not None:
    axes[0, 1].legend_.remove()
add_value_labels(axes[0, 1])

# --- Plot 3: Box Plot: Calories Burned by Experience Level (FIXES APPLIED) ---
sns.boxplot(
    x='Experience_Level', y='Calories_Burned', data=df,
    order=[3, 2, 1], hue='Experience_Level', ax=axes[1, 0], palette='magma'
)
axes[1, 0].set_title('3. Calories Burned Distribution by Experience Level', fontsize=14)
axes[1, 0].set_xlabel('Experience Level (3=Advanced, 1=Beginner)')
axes[1, 0].set_ylabel('Calories Burned')
# FIX for 'AttributeError: 'NoneType' object has no attribute 'remove''
if axes[1, 0].legend_ is not None:
    axes[1, 0].legend_.remove()

# --- Plot 4: Scatter Plot: Avg_BPM vs. Calories_Burned ---
sns.scatterplot(
    x='Avg_BPM', y='Calories_Burned', data=df,
    hue='Session_Duration (hours)', size='Session_Duration (hours)',
    sizes=(20, 200), palette='crest', ax=axes[1, 1], legend='full'
)
axes[1, 1].set_title('4. Relationship between Average BPM and Calories Burned', fontsize=14)
axes[1, 1].set_xlabel('Average BPM')
axes[1, 1].set_ylabel('Calories Burned')
axes[1, 1].legend(title='Duration (hrs)', loc='upper left')

# FIX: Use standard plt.tight_layout()
plt.tight_layout()
plt.savefig('4_key_bivariate_plots_fixed.png')
plt.close()

##### Insights

1. **Session Duration vs. Calories Burned**
    - The exceptionally high positive correlation (0.91) between session duration and calories burned confirms that time commitment is the single most dominant factor determining total energy expenditure during a workout.
2. **Workout Type vs. Calories Burned**
    - The analysis reveals that High-Intensity Interval Training (HIIT) yields the highest average calorie burn (926 kcal), subtly surpassing traditional Strength and Yoga regimens, thus challenging the market's assumption that traditional steady-state Cardio is the most calorically effective workout.
3. **Experience Level vs. Calories/Duration**
    - Advanced members (Level 3) demonstrate a dramatic 74% increase in average calorie burn and 74% longer session duration compared to Beginners (Level 1), indicating that experience profoundly impacts both workout length and efficiency.
4. **Calories Burned vs. Fat Percentage**
    - The strong negative correlation (−0.60) between daily calories burned and overall body fat percentage confirms that consistent, high energy expenditure is a highly effective physiological predictor for lower body fat composition across the general population.

The bivariate relationships conclusively demonstrate that workout output is governed by a simple Duration-Intensity-Result model, where time commitment is the highest correlator of calories burned, while a strong negative correlation links high energy expenditure to lower body fat. The data further reveals that market demand is optimized by higher-burn workouts like HIIT and Strength training, signaling a shift away from traditional Cardio as the presumed calorie king, and that Experience Level serves as the most pronounced differentiator in both duration and resulting energy output. This disparity presents a unified and lucrative market opportunity to design progressive training programs that systematically bridge the 74% gap between Beginner performance and Advanced member retention.

___

*any blocks below this text is meant to added back into the final arrangement of the report at later date*

## 3. Data Cleaning & Transformation

Since the dataset has no missing values, inconsistent formats, or clear outliers, we conclude it is "clean" and does not need to undergo any data cleansing.

### Data Transformation

This section outlines the process of transforming the raw dataset to ensure it is in the optimal format for analysis and model building. While the dataset has high structural integrity with no missing values, several features require transformation to be used effectively. Specifically, we will address the conversion of categorical data into a numerical format, and the scaling of numerical features to standardize their range. This process ensures all variables are ready for potential use in machine learning models, preventing potential issues with feature bias and performance.

#### Feature Scaling

We standardize the numerical columns using `StandardScaler` from the `scikit-learn` library, transforming its values so that they have a **mean of 0** and a **standard deviation of 1**.

In [None]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load the dataset
df = pd.read_csv('gym_members_exercise_tracking.csv')

# Identify numerical columns to scale
numerical_cols = ['Age', 'Weight (kg)', 'Height (m)', 'Max_BPM', 'Avg_BPM', 'Resting_BPM', 'Session_Duration (hours)',
                  'Calories_Burned', 'Fat_Percentage', 'Water_Intake (liters)', 'Workout_Frequency (days/week)', 'Experience_Level', 'BMI']

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the numerical data
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

# Display the scaled data (first 5 rows)
print(df[numerical_cols].head())

#### Encoding Categorical Variables

We use `get_dummies()` from `pandas` to perform one-hot encoding.

In [None]:
# Identify categorical columns to encode
categorical_cols = ['Gender', 'Workout_Type']

# Perform one-hot encoding
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Display the transformed DataFrame with the new columns
print(df.head())

#### Binning/Discretization

#### Feature Engineering

#### Skewed Data Handling

#### Dimensionality Reduction

---