# **Fitness patterns and perfofmance analysis**

## Objectives of the Project

- Classify gym members into **fitness levels** (Low, Medium, High) based on their biometric and workout behavior.
- Understand the **key drivers** of calorie burn and fitness level (e.g., BMI, heart rate, session duration).
- Build a model that supports **personalized fitness guidance** and **member segmentation**.
- Communicate insights visually for stakeholders using feature importance and performance metrics.

## Inputs
In this Notebook Explanotory Data Analysis (EDA) and Vizualisation Analysis are performed on the "Gym Members Exercise Dataset" data set [Kaggle](https://www.kaggle.com/datasets/valakhorasani/gym-members-exercise-dataset/data). 
Dataset cointains the following columns:
* Age: Age of the gym member.
* Gender: Gender of the gym member (Male or Female).
* Weight (kg): Member’s weight in kilograms.
* Height (m): Member’s height in meters.
* Max_BPM: Maximum heart rate (beats per minute) during workout sessions.
* Avg_BPM: Average heart rate during workout sessions.
* Resting_BPM: Heart rate at rest before workout.
* Session_Duration (hours): Duration of each workout session in hours.
* Calories_Burned: Total calories burned during each session.
* Workout_Type: Type of workout performed (e.g., Cardio, Strength, Yoga, HIIT).
* Fat_Percentage: Body fat percentage of the member.
* Water_Intake (liters): Daily water intake during workouts.
* Workout_Frequency (days/week): Number of workout sessions per week.
* Experience_Level: Level of experience, from beginner (1) to expert (3).
* BMI: Body Mass Index, calculated from height and weight.

## Outputs
**FitnessLevel** (Target Variable):  
  * `"Low"`: Low effort or irregular activity  
  * `"Medium"`: Moderate consistency and calorie burn  
  * `"High"`: High engagement and performance  

**Exploratory Data Analysis (EDA)**:  
  * Descriptive statistics  
  * Correlation heatmaps  
  * Pairplots  
  * Fitness level distributions  
  * Other performance-related insights via plots  

**Machine Learning Model Output**:  
  * Trained classification model predicting fitness level  
  * Evaluation metrics: Accuracy, Classification Report, Confusion Matrix  
  * Feature importance plot to highlight influential variables





---

For this project different Python librarires are used for analysis and vizualisation. Libraries are imported prior furtherwork on the project.

In [41]:
import pandas as pd                 #import Pandas for data manipulation
import numpy as np                  #import Numpy for numerical operations
import matplotlib.pyplot as plt     #import Matplotlib for data visualization
import seaborn as sns               #import Seaborn for statistical data visualization
from plotly.subplots import make_subplots  #import Plotly subplots for creating complex figures
import plotly.express as px         #import Plotly Express for interactive visualizations
from sklearn.model_selection import train_test_split                   #import train_test_split for splitting data into training and testing sets
from sklearn.preprocessing import LabelEncoder, StandardScaler         #import LabelEncoder and StandardScaler for data preprocessing
from sklearn.ensemble import RandomForestClassifier                    #import RandomForestClassifier for classification tasks
from sklearn.metrics import classification_report, confusion_matrix    #import metrics for model evaluation

Stylw and plot size

In [42]:
sns.set_style("whitegrid")      #set Seaborn style to whitegrid for better aesthetics
plt.rcParams['figure.figsize'] = (12, 6)    #set default figure size for Matplotlib plots

# 1. Explanotary Data Analysis

##### In this section EDA, including data load and cleaning, is performed.

As a first step, "gym_members_exercise_tracking.csv" data set is loaded into DataFrame

In [None]:
df = pd.read_csv('../data/gym_members_exercise_tracking.csv')  # Load the car price dataset
df.head()                                            # Display the first few rows of the dataset

Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Workout_Type,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI
0,56,Male,88.3,1.71,180,157,60,1.69,1313.0,Yoga,12.6,3.5,4,3,30.2
1,46,Female,74.9,1.53,179,151,66,1.3,883.0,HIIT,33.9,2.1,4,2,32.0
2,32,Female,68.1,1.66,167,122,54,1.11,677.0,Cardio,33.4,2.3,4,2,24.71
3,25,Male,53.2,1.7,190,164,56,0.59,532.0,Strength,28.8,2.1,3,1,18.41
4,38,Male,46.1,1.79,188,158,68,0.64,556.0,Strength,29.2,2.8,3,1,14.39


Here the shape, Info and column types of the Dataset are shown

In [44]:
print(df.shape)                     # Print the shape of the DataFrame           
print(df.info())                    # Print concise summary of the DataFrame            
print(df.dtypes)                    # Print data types of each column

(973, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 973 entries, 0 to 972
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Age                            973 non-null    int64  
 1   Gender                         973 non-null    object 
 2   Weight (kg)                    973 non-null    float64
 3   Height (m)                     973 non-null    float64
 4   Max_BPM                        973 non-null    int64  
 5   Avg_BPM                        973 non-null    int64  
 6   Resting_BPM                    973 non-null    int64  
 7   Session_Duration (hours)       973 non-null    float64
 8   Calories_Burned                973 non-null    float64
 9   Workout_Type                   973 non-null    object 
 10  Fat_Percentage                 973 non-null    float64
 11  Water_Intake (liters)          973 non-null    float64
 12  Workout_Frequency (days/week)  973 non-n

As it shown, gym_members_exercise_tracking.csv (further df) consists of 973 etries in 15 columns with following types:
* float64 - 7
* Int64 - 6
* object - 2

In the next steps df is checked for any inconsistencies (duplicates, missing values and etc.)

In [45]:
df.isnull().sum()                 # Check for missing values in each column

Age                              0
Gender                           0
Weight (kg)                      0
Height (m)                       0
Max_BPM                          0
Avg_BPM                          0
Resting_BPM                      0
Session_Duration (hours)         0
Calories_Burned                  0
Workout_Type                     0
Fat_Percentage                   0
Water_Intake (liters)            0
Workout_Frequency (days/week)    0
Experience_Level                 0
BMI                              0
dtype: int64

as it shown above the df has no missing values. Let's check for duplicates:

In [46]:
df.duplicated() # Check for duplicate rows in the DataFrame


0      False
1      False
2      False
3      False
4      False
       ...  
968    False
969    False
970    False
971    False
972    False
Length: 973, dtype: bool

And there are no duplicates. Initial data inspection shows that df has no missing values and duplicates, which simplifies further work with the dataset.

Fetching all columns names

In [47]:
df.columns          # Display the column names of the DataFrame

Index(['Age', 'Gender', 'Weight (kg)', 'Height (m)', 'Max_BPM', 'Avg_BPM',
       'Resting_BPM', 'Session_Duration (hours)', 'Calories_Burned',
       'Workout_Type', 'Fat_Percentage', 'Water_Intake (liters)',
       'Workout_Frequency (days/week)', 'Experience_Level', 'BMI'],
      dtype='object')

 As it is well known, BMI(Body Mass Index) is calculated as: Weight(kg)/(Height(m)**2). For further analysis columns "Weight(kg)" and "Height(m)" can be dropped, because dataset already contains more informative column "BMI".

In [48]:
df.drop(columns=['Weight (kg)', 'Height (m)'], inplace=True)  # Drop 'Weight(kg)' and 'Height(m)' columns as BMI is already present

Let's check wich unique names have object columns "Gender" and "Workout_Type"

In [49]:
print(df["Gender"].unique())# Check unique values in "Gender" column
print(df["Workout_Type"].unique()) # Check unique values in "Workout_Type" column

['Male' 'Female']
['Yoga' 'HIIT' 'Cardio' 'Strength']


For further visualizations and classification it is necessary to create new variable "Fitness_Level" based on how much person burns calories during trainings. More calories burned, higher fitness level person has.

In [50]:
# Create fitness classification based on calories burned
def classify_fitness(calories):
    if calories < 300:
        return "Low"
    elif calories < 600:
        return "Medium"
    else:
        return "High"

df["Fitness_Level"] = df["Calories_Burned"].apply(classify_fitness)  # Apply the classification function to create a new column

Below counts of each unique category names of the "Fitness_Level" column are displayed

In [51]:
print(df["Fitness_Level"].unique())# Display the unique names of each fitness level category
print(df["Fitness_Level"].value_counts()) # Display the counts of each fitness level category 

['High' 'Medium']
Fitness_Level
High      841
Medium    132
Name: count, dtype: int64


As it shown below, "Fitness_Level" contains only 2 unique categories: "High" and "Medium", 841 and 132 counts, respectively, which corresponds to amount of entries of the dataset

`For a Data Visualization part of the project, it is not necessary to encode categorical variables, namely "Gender" and "Workout_Type". For simplicity, it will be done directly in Maschine Learning Model part.`

Next step is to take a look on a descriptive statistic summary for each column in the df. This summary includes following metrics:
* count: Number of non-missing (non-NaN) values. Helps check missing data.
* mean:	Average (sum / count). Central tendency of numeric data.
* std: Standard deviation. How spread out the values are from the mean.
* min: Minimum value. The smallest observed value.
* 25%: 25th percentile (Q1). 25% of data is below this value
* 50% (median): 50th percentile. Half the data is below this value.
* 75%: 75th percentile (Q3). 75% of data is below this value
* max: Maximum value. The largest observed value

In following cell a descriptive statistics of numeric columns is performed:

In [52]:
df.describe() # Generate descriptive statistics of numerical columns

Unnamed: 0,Age,Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI
count,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0
mean,38.683453,179.883864,143.766701,62.223022,1.256423,905.422405,24.976773,2.626619,3.321686,1.809866,24.912127
std,12.180928,11.525686,14.345101,7.32706,0.343033,272.641516,6.259419,0.600172,0.913047,0.739693,6.660879
min,18.0,160.0,120.0,50.0,0.5,303.0,10.0,1.5,2.0,1.0,12.32
25%,28.0,170.0,131.0,56.0,1.04,720.0,21.3,2.2,3.0,1.0,20.11
50%,40.0,180.0,143.0,62.0,1.26,893.0,26.2,2.6,3.0,2.0,24.16
75%,49.0,190.0,156.0,68.0,1.46,1076.0,29.3,3.1,4.0,2.0,28.56
max,59.0,199.0,169.0,74.0,2.0,1783.0,35.0,3.7,5.0,3.0,49.84


Key Insights:
* Age:
    * Members range from 18 to 59 years old.
    * Median age is 40, with most users between 28 and 49 – indicating a predominantly adult fitness population. 
* Heart Rate Data (BPM):
    * Max_BPM ranges from 160 to 199 bpm.
    * Avg_BPM ~144, Resting_BPM ~62.
    * Most users fall within expected cardiovascular performance zones.
* Session Duration & Calories Burned:
    * Sessions vary from 0.5 to 2 hours.
    * Calories burned range from 303 to 1783 kcal, with a median around 893 — good spread for modeling fitness level.
* Fat % & Water Intake:
    * Fat percentage from 10% to 35%
    * Water intake between 1.5–3.7L — indicating hydration is being monitored.
* Workout Frequency & Experience:
    * Most work out 2–5 times/week.
    * Experience Level ranges from 1 to 3 (Beginner to Advanced).
* BMI
    * Median BMI is ~24.16; 25% of users are above ~28.6 — showing potential overweight group.

Conclusion:
The dataset captures a diverse, balanced gym population, covering all experience levels and health profiles. It's well-suited for classification and clustering tasks, with rich features for meaningful dashboards and insight generation.


---

At this stage of the project dataset inspection and preparation are over. Therefore, dataset can be saved into dedicated GitHub repository for further use in Tableau or PowerBi.

In [None]:
df.to_csv('../data/cleaned_gym_members_exercise_tracking.csv', index=False)

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo