# Data Cleaning and Visualization

This notebook focuses on preparing the 'Gym Members Exercise Tracking' dataset for analysis and predictive modeling. The primary tasks include:

1. Loading and inspecting the data.
2. Cleaning and standardizing columns.
3. Handling missing values.
4. Performing exploratory visualizations to understand feature distributions and relationships.

The final cleaned dataset will be used to build machine learning models in a separate notebook.

# Data Cleaning and Vizualization

### Data cleaning:

## 1. Load the Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.metrics import classification_report, mean_squared_error, accuracy_score
import warnings
warnings.filterwarnings('ignore')

## 2. Loading and Inspecting the Dataset

In [None]:
df = pd.read_csv("gym_members_exercise_tracking.csv")

print("Dataset Overview:")
print(df.head())
print("\nColumn Information:")
print(df.info())
print("\nMissing Values:")
print(df.isnull().sum())

## 3. Handle Missing Values

In [None]:
# Identify missing values
missing_counts = df.isnull().sum()
print("Missing values per column:\n", missing_counts)

## 4. Rename columns for easier use

In [None]:
df.rename(columns={"Water_Intake (liters)": "Water Intake"}, inplace=True)
df.rename(columns={"Weigth (kg)": "Weigth"}, inplace=True)
df.rename(columns={"Height (m)": "Height"}, inplace=True)
df.rename(columns={"Calories_Burned": "Calories Burned"}, inplace=True)
df.rename(columns={"Fat_Percentage": "Fat Percentage"}, inplace=True)
df.rename(columns={"Workout_Type": "Workout Type"}, inplace=True)
df.rename(columns={"Experience_Level": "Experience Level"}, inplace=True)

## 5. Handling Missing Values

In [None]:
#Impute missing numerical data with mean (or median)
numerical_cols = ['Fat Percentage', 'Water Intake']  # Example columns with missing values
for col in numerical_cols:
    df[col].fillna(df[col].mean(), inplace=True)

In [None]:
# Impute missing categorical data with mode
categorical_cols = ['Workout Type']  # Example categorical column
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

## 6. Standardize and Validate Data

In [None]:
# Recalculate BMI if available
if 'Weight' in df.columns and 'Height' in df.columns:
    df['BMI_Calculated'] = df['Weight'] / (df['Height'] / 100) ** 2

In [None]:
# Check for unrealistic values and correct them
df['Age'] = df['Age'].clip(lower=18, upper=65)  # Age should be between 18 and 65
df['Calories Burned'] = df['Calories Burned'].clip(lower=0)  # Calories burned should be non-negative

## 7. Handle Outliers

In [None]:
from scipy.stats import zscore
z_scores = np.abs(zscore(df.select_dtypes(include=np.number)))
threshold = 3  # Define a threshold for z-score
outlier_indices = np.where(z_scores > threshold)
print("Outlier indices:\n", outlier_indices)

# Optional: Remove outliers (if necessary)
df_cleaned = df[(z_scores < threshold).all(axis=1)]
# Save the cleaned data
output_path = "gym_members_exercise_cleaned.csv"
df_cleaned.to_csv(output_path, index=False)
print(f"Cleaned dataset saved to {output_path}")

In [None]:
df1 = pd.read_csv("gym_members_exercise_cleaned.csv")

In [None]:
# Dataset Dimensions Overview
num_records = len(df1)
num_columns = len(df1.columns)
print(num_columns, num_records)

## Dataset Summary and Insights:

In [None]:
def summary(df1):
    summ = pd.DataFrame(df1.dtypes, columns=['data type'])
    summ['#missing'] = df1.isnull().sum().values
    summ['Duplicate'] = df1.duplicated().sum()
    summ['#unique'] = df1.nunique().values
    desc = pd.DataFrame(df.describe(include='all').transpose())
    summ['min'] = desc['min'].values
    summ['max'] = desc['max'].values
    summ['avg'] = desc['mean'].values
    summ['std dev'] = desc['std'].values
    summ['top value'] = desc['top'].values
    summ['Freq'] = desc['freq'].values

    return summ

summary(df).style.background_gradient()

In [None]:
# Inspect Data
print(df1.head())
print(df1.info())
df1.nunique()

## Data Vizualization:

### Data vizualtization for distribution analysis

In [None]:
# Distribution of Gender
plt.figure(figsize=(6, 4))
sns.countplot(data=df1, x="Gender", hue="Gender", palette="Set2", legend=False)
plt.title("Distribution of Gender")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.show()
palette = "Set2" 

# Distribution of Workout Type
plt.figure(figsize=(8, 4))
sns.countplot(data=df1, x="Workout Type", hue="Workout Type", palette=palette)
plt.title("Distribution of Workout Type")
plt.xlabel("Workout Type")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

# Distribution of Experience Level
plt.figure(figsize=(6, 4))
sns.countplot(data=df1, x="Experience Level", hue="Experience Level", palette=palette)
plt.title("Distribution of Experience Level")
plt.xlabel("Experience Level")
plt.ylabel("Count")
plt.show()


# Histogram: Distribution of BMI
plt.figure(figsize=(10, 6))
sns.histplot(df1["BMI"], kde=False, bins=20, edgecolor="black", alpha=0.7)  # Using the first color from the palette
plt.title("Distribution of BMI")
plt.xlabel("BMI")
plt.ylabel("Frequency")
plt.grid(axis="y")
plt.show()

# Histogram for Workout Frequency Distribution
plt.figure(figsize=(10, 6))
sns.histplot(df1['Workout_Frequency (days/week)'], kde=True, color='blue', bins=7)
plt.title("Distribution of Workout Frequency")
plt.xlabel("Workout Frequency (Days/Week)")
plt.ylabel("Frequency")
plt.show()

# Plotting Distribution of Session Duration
plt.figure(figsize=(10, 6))
sns.histplot(df['Session_Duration (hours)'], bins=20, kde=True, color='skyblue', edgecolor='black')
plt.title("Session Duration Distribution")
plt.xlabel("Session Duration (hours)")
plt.ylabel("Frequency")
plt.grid(axis='y')
plt.show()

# Histogram - Distribution of Calories Burned
plt.figure(figsize=(10, 6))
plt.hist(df1["Calories Burned"], bins=20, edgecolor="black", alpha=0.7, color='red')
plt.title("Distribution of Calories Burned")
plt.xlabel("Calories Burned")
plt.ylabel("Frequency")
plt.show()

# Weight Distribution (Histogram + KDE)
plt.figure(figsize=(10, 6))
sns.histplot(df1['Weight (kg)'], bins=15, kde=True, color='skyblue', edgecolor='black')
plt.title("Weight Distribution (kg)")
plt.xlabel("Weight (kg)")
plt.ylabel("Frequency")
plt.show()

In [None]:
age_counts = df1['Gender'].value_counts()
plt.figure(figsize=(6, 6))
plt.pie(age_counts, labels=age_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Age Distribution by Gender')
plt.show()
# Distribution of Age
plt.figure(figsize=(8, 6))
sns.histplot(df1['Age'], kde=True, color='purple', bins=20)
plt.title("Distribution of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show

### Experience Level Distribution by Workout Type

In [None]:
# Plotting Experience Level by Workout Type
plt.figure(figsize=(12, 6))
sns.countplot(x='Workout Type', hue='Experience Level', data=df1)
plt.title("Experience Level Distribution by Workout Type", fontsize=16, fontweight='bold')
plt.xlabel("Workout Type")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

###  Average BPM by Workout Frequency

In [None]:
# Calculate Average BPM by Workout Frequency
bpm_by_workout = df1.groupby('Workout_Frequency (days/week)')[['Max_BPM', 'Avg_BPM', 'Resting_BPM']].mean()
# Plotting the Average BPM for each workout frequency
bpm_by_workout.plot(kind='bar', figsize=(12, 6), color=['lightblue', 'lightgreen', 'lightcoral'])
plt.title("Average BPM by Workout Frequency")
plt.xlabel("Workout Frequency (Days/Week)")
plt.ylabel("BPM")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

### Categorical Distributions by Group
1. Gender Distribution by Workout Type
2. Workout Type Distribution by Experience Level
3. Gender Distribution by Experience Level

These plots help in understanding the relationships between categorical variables, such as gender, workout type, and experience level, and provide insights into the demographic distribution of gym members.

In [None]:
plt.figure(figsize=(6,6))
sns.countplot(x = 'Gender',hue = 'Workout Type', data= df1)
plt.show()
plt.figure(figsize=(6,6))
sns.countplot(x = 'Workout Type',hue = 'Experience Level', data = df1)
plt.show()
plt.figure(figsize=(6,6))
sns.countplot(x = 'Gender',hue = 'Experience Level', data = df1)
plt.show()

### Scatter Plot Visualizations for Relationships Between Variables

In [None]:
# Scatter Plot: BMI vs. Weight
plt.figure(figsize=(10, 6))
plt.scatter(df1["Weight (kg)"], df1["BMI"], alpha=0.5)
plt.title("BMI vs. Weight")
plt.xlabel("Weight (kg)")
plt.ylabel("BMI")
plt.grid(True)
plt.show()

# Scatter Plot: Calories Burned vs. Session Duration
plt.figure(figsize=(10, 6))
plt.scatter(df1["Session_Duration (hours)"], df1["Calories Burned"], alpha=0.5, color='orange')
plt.title("Calories Burned vs. Session Duration")
plt.xlabel("Session Duration (hours)")
plt.ylabel("Calories Burned")
plt.grid(True)
plt.show()

### Scatter Plot: BMI vs. Weight by Gender

In [None]:
sns.set(style="whitegrid")

plt.figure(figsize=(12, 5))
sns.scatterplot(data=df1, x='Weight (kg)', y='BMI', hue='Gender', palette='Set1')

plt.title('BMI vs. Weight by Gender')
plt.xlabel('Weight (kg)')
plt.ylabel('BMI')
plt.legend(title='Gender')
plt.show()

### Scatter Plot: BMI vs. Calories Burned by Gender

In [None]:
sns.set(style="whitegrid")

plt.figure(figsize=(12, 5))
sns.scatterplot(data=df1, x='Calories Burned', y='BMI', hue='Gender', palette='Set1')

plt.title('BMI vs. Calories Burned by Gender')
plt.xlabel('Calories Burned')
plt.ylabel('BMI')
plt.legend(title='Gender')
plt.show()

### Scatter Plot: BMI vs Fat Percentage

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='BMI', y='Fat Percentage', data=df1, alpha=0.7)
plt.title("BMI vs Fat Percentage")
plt.xlabel("BMI")
plt.ylabel("Fat Percentage (%)")
plt.grid(True)
plt.show()

### Scatter Plot: Calories Burned per Hour vs. Session Duration by Workout Type

In [None]:
# Check how session duration and frequency impact calories burned
df1['Calories per Hour'] = df1['Calories Burned'] / df1['Session_Duration (hours)']
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Session_Duration (hours)', y='Calories per Hour', data=df1, hue='Workout Type')
plt.title("Calories Burned per Hour vs. Session Duration")
plt.xlabel("Session Duration (hours)")
plt.ylabel("Calories Burned per Hour")
plt.show()

In [None]:
# Check calories burned by Experience Level
experience_calories = df.groupby('Experience Level')['Calories Burned'].mean()
print("Average Calories Burned by Experience Level:")
print(experience_calories)

### Summary of Key Insights
Based on the analysis of the gym members' exercise data:
1. Exercise Preferences:
Gender Differences: There is a notable distinction in workout preferences based on gender. Males generally prefer strength training, while females are more inclined towards cardio-based exercises.
2. Age and Fitness Correlations:
Younger Members Burn More Calories: Age has a moderate correlation with both calories burned and workout duration. Younger members tend to have higher metabolism rates, leading to greater calorie expenditure and longer, more intense workouts. In contrast, older members tend to have shorter, less intense sessions, which may be attributed to fitness levels or joint concerns.
3. BMI and Calories Burned:
Higher BMI Leads to Lower Caloric Burn: A negative correlation exists between BMI and calories burned. Members with higher BMI often burn fewer calories, indicating that personalized workout routines for individuals with higher body fat percentages may be necessary to help them achieve higher caloric expenditure and improve their fitness outcomes.
4. Workout Duration and Effectiveness:
Workout Duration Doesn't Always Equal Better Results: Longer workout sessions don’t always lead to better fitness outcomes. While the duration does impact total calories burned, the type of exercise performed plays a more significant role in fitness improvement. For example, strength training, despite being performed for shorter durations, may result in higher fitness benefits compared to longer cardio sessions.
5. Experience Level’s Impact on Performance:
Experienced Members Achieve Better Results: More experienced gym members tend to perform better in terms of calories burned and workout intensity. This is likely due to the effectiveness of their routines and higher fitness levels. However, beginners, while they may not burn as many calories or work out with as high intensity, show notable progress as they increase both their workout duration and calories burned over time.

### Now we can train the models, move on to the next notebook: models.ipynb

In [None]:
# Demonstrate before-and-after missing value imputation
print('Before Imputation:')
print(df['Fat Percentage'].isnull().sum())

# Imputation
df['Fat Percentage'].fillna(df['Fat Percentage'].mean(), inplace=True)

print('After Imputation:')
print(df['Fat Percentage'].isnull().sum())

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlation')
plt.show()

## Conclusion

The dataset has been cleaned and is ready for analysis. Key actions included:

- Renaming columns for consistency.
- Handling missing values in 'Fat Percentage' and 'Water Intake' using mean imputation.
- Performing basic exploratory analysis.

This dataset will now be used in the modeling phase to predict gym members' exercise outcomes.

In [None]:
# Demonstrate before-and-after missing value imputation
print('Before Imputation:')
print(df['Fat Percentage'].isnull().sum())

# Imputation
df['Fat Percentage'].fillna(df['Fat Percentage'].mean(), inplace=True)

print('After Imputation:')
print(df['Fat Percentage'].isnull().sum())

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlation')
plt.show()

## Conclusion

The dataset has been cleaned and is ready for analysis. Key actions included:

- Renaming columns for consistency.
- Handling missing values in 'Fat Percentage' and 'Water Intake' using mean imputation.
- Performing basic exploratory analysis.

This dataset will now be used in the modeling phase to predict gym members' exercise outcomes.