# **Mega GYM Dataset:** 
### **Dated:** June 3rd, 2024
### **Dateset Link:** https://www.kaggle.com/datasets/niharika41298/gym-exercise-data/data?select=megaGymDataset.csv
### Done by **Faizan Ahmad** (ma143faizan@gmail.com)

## **Main Goals:**
1. **To find top 5 rated exercises for each body part.** ✔
2. **To find top 5 rated exercises for each Type / Category of Workouts.** ✔
3. **To find top 10 rated exercises for each Level of exercises.** ✔
4. **To ask the user about Type, BodyPart, Equipment he / she has, the Level of fitness he is at and recommend top exercises to him based on demand.** ✔

## **About Dataset:**
**Context:**
This is a dataset created for analyzing and evaluating workouts that one can do at the gym(or at home) to stay healthy. Exercising and being fit is becoming very important and almost a daily routine for all individuals and what better than to take a data-driven route for success to meet one's fitness goals.

**Inspiration:**
If you go to a gym, the first thing you realize is the myriad of exercises available to do. The exercises range from bodyweight, machine-based or dumbbell/barbell based. With so many exercises to do, beginners or even professional can wonder which the exercise that will target a specific muscle the best and that is where this analysis can be useful. I also thought it would be fun to visualize the exercise details.

**Content:**
There is one file with 9 columns for each exercise. Columns may contain null values as the data is raw and scraped from various internet sources.


# **Data Cleaning and Pre-processing:**

In [34]:
# Importing the libraries and dataset
import pandas as pd
import numpy as np

# Importing the dataset without Unnamed column
df = pd.read_csv('megaGymDataset.csv', index_col=0)
df.head()

Unnamed: 0,Title,Desc,Type,BodyPart,Equipment,Level,Rating,RatingDesc
0,Partner plank band row,The partner plank band row is an abdominal exe...,Strength,Abdominals,Bands,Intermediate,0.0,
1,Banded crunch isometric hold,The banded crunch isometric hold is an exercis...,Strength,Abdominals,Bands,Intermediate,,
2,FYR Banded Plank Jack,The banded plank jack is a variation on the pl...,Strength,Abdominals,Bands,Intermediate,,
3,Banded crunch,The banded crunch is an exercise targeting the...,Strength,Abdominals,Bands,Intermediate,,
4,Crunch,The crunch is a popular core exercise targetin...,Strength,Abdominals,Bands,Intermediate,,


In [35]:
# Checking the unique values of the column
df.Type.unique()

array(['Strength', 'Plyometrics', 'Cardio', 'Stretching', 'Powerlifting',
       'Strongman', 'Olympic Weightlifting'], dtype=object)

In [36]:
# Taking a look at the data: the last 5 rows
df.tail()

Unnamed: 0,Title,Desc,Type,BodyPart,Equipment,Level,Rating,RatingDesc
2913,EZ-bar skullcrusher-,The EZ-bar skullcrusher is a popular exercise ...,Strength,Triceps,E-Z Curl Bar,Intermediate,8.1,Average
2914,Lying Close-Grip Barbell Triceps Press To Chin,,Strength,Triceps,E-Z Curl Bar,Beginner,8.1,Average
2915,EZ-Bar Skullcrusher - Gethin Variation,The EZ-bar skullcrusher is a popular exercise ...,Strength,Triceps,E-Z Curl Bar,Intermediate,,
2916,TBS Skullcrusher,The EZ-bar skullcrusher is a popular exercise ...,Strength,Triceps,E-Z Curl Bar,Intermediate,,
2917,30 Arms EZ-Bar Skullcrusher,,Strength,Triceps,E-Z Curl Bar,Intermediate,,


**Let's use ydata_profiling to do a detailed EDA and see if we can find anything interesting to work with and plan our next moves ...**

In [37]:
# importing ydata_profiling
from ydata_profiling import ProfileReport

# instantiating the report
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)

# displaying the report
profile.to_notebook_iframe()

# # exporting the report
# profile.to_file("report.html")


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## **Here's what we got:** 
- **Dataset Total Rows:** 2918
- **Duplicate Rows:** 7
- **Total size in Memory:** 1.7MB
- **Missing Values:**
  - Desc Column: 1550 (53.1%)
  - Equipment Column: 32 (1.1%)
  - Rating Column: 1887 (64.7%)
  - Rating Desc: 2056 (70.5%)
- Rating of an exercise doesn't depend much on BodyPart and Equipment.

In [38]:
# let's start by taking a look at the duplicate rows
df[df.duplicated()]

Unnamed: 0,Title,Desc,Type,BodyPart,Equipment,Level,Rating,RatingDesc
97,Decline bar press sit-up,The decline bar press sit-up is a weighted cor...,Strength,Abdominals,Barbell,Intermediate,8.5,Average
645,Exercise Ball Cable Crunch - Gethin Variation,The exercise ball crunch is a popular gym exer...,Strength,Abdominals,Cable,Intermediate,,
939,Band-suspended kettlebell bench press,The band-suspended kettlebell bench press is a...,Strength,Chest,Bands,Intermediate,,
958,Band-suspended kettlebell bench press,The band-suspended kettlebell bench press is a...,Strength,Chest,Bands,Intermediate,,
1709,Seated Cable Rows,The cable seated row is a popular exercise to ...,Strength,Middle Back,Cable,Intermediate,8.8,Average
1730,Seated Cable Rows,The cable seated row is a popular exercise to ...,Strength,Middle Back,Cable,Intermediate,8.8,Average
2004,Dumbbell step-up,The dumbbell step-up is a great exercise for b...,Strength,Quadriceps,Dumbbell,Intermediate,8.2,Average
2655,Arnold press,Named after the iconic bodybuilder and movie s...,Strength,Shoulders,Dumbbell,Intermediate,8.9,Average
2658,Seated rear delt fly,The seated rear delt fly is an upper-body exer...,Strength,Shoulders,Dumbbell,Intermediate,8.4,Average


In [39]:
# Dropping the duplicates
df.drop_duplicates(inplace=True, keep='first')

In [40]:
# info about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2909 entries, 0 to 2917
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Title       2909 non-null   object 
 1   Desc        1359 non-null   object 
 2   Type        2909 non-null   object 
 3   BodyPart    2909 non-null   object 
 4   Equipment   2877 non-null   object 
 5   Level       2909 non-null   object 
 6   Rating      1025 non-null   float64
 7   RatingDesc  856 non-null    object 
dtypes: float64(1), object(7)
memory usage: 204.5+ KB


In [41]:
# Checking for unique values in the Rating and RatingDesc columns: 
# We can impute the missing values in the RatingDesc column based on the Rating column
print(df.Rating.unique())
print(df.RatingDesc.unique())

[0.  nan 8.9 8.5 8.3 7.  4.7 7.7 7.3 9.3 8.6 9.5 9.2 9.  8.8 8.4 8.  9.1
 8.2 8.1 7.9 5.  8.7 7.8 7.5 7.4 6.9 6.5 3.9 6.4 4.  2.8 6.7 3.8 2.4 1.6
 7.1 3.6 3.2 5.8 7.6 7.2 4.8 3.3 1.  6.  5.3 2.7 6.3 5.6 4.1 4.9 4.2 5.5
 5.9 3.  9.4 6.2 9.6 2.5 5.2 6.6 3.5 3.1 4.4 4.3 5.4 4.5 5.1 5.7 6.8 3.4
 6.1]
[nan 'Average']


**Dropping the null values will also drop a bunch of many different rows because of Desc column, but considering if we are building a gym app, then it would be better to ignore missing data to keep credibility and reliability of the app ...**

In [44]:
# Dropping all the NaN values
df.dropna(inplace=True)

**Initially I was planning on imputing missing values based on Rating to RatingDesc. But now that we can see there is only one value in RatingDesc; I'll drop this column, create a new column to describe Rating based on values in Rating column ...**

In [10]:
# Dropping the RatingDesc column because it has all the values as either NaN or average
df.drop(columns=["RatingDesc"], inplace=True)

In [45]:
# Range of values in the Rating column
df1 = df.Rating.sort_values()
df1.unique()

array([0. , 1.6, 3. , 3.2, 3.4, 3.6, 4. , 4.4, 4.5, 4.7, 4.8, 4.9, 5. ,
       5.1, 5.3, 5.4, 5.5, 5.7, 5.9, 6. , 6.2, 6.5, 6.6, 6.7, 6.8, 6.9,
       7. , 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8. , 8.1, 8.2,
       8.3, 8.4, 8.5, 8.6, 8.7, 8.8, 8.9, 9. , 9.1, 9.2, 9.3, 9.4, 9.5,
       9.6])

## **Creating a new column and imputing ratingdesc based on Rating column:**

In [12]:
# Creating a new column and imputing ratingdesc based on rating column
def impute_rating(row):
    if row['Rating'] == 0.0:
        return 'No Rating'
    elif row['Rating'] <= 4.0:
        return 'Below Average'
    elif row['Rating'] <= 7.0:
        return 'Average'
    else: 
        return 'Above Average'

df['RatingDesc'] = df.apply(lambda row: impute_rating(row), axis=1)

df.head()

Unnamed: 0,Title,Desc,Type,BodyPart,Equipment,Level,Rating,RatingDesc
8,Barbell roll-out,The barbell roll-out is an abdominal exercise ...,Strength,Abdominals,Barbell,Intermediate,8.9,Above Average
9,Barbell Ab Rollout - On Knees,The barbell roll-out is an abdominal exercise ...,Strength,Abdominals,Barbell,Intermediate,8.9,Above Average
10,Decline bar press sit-up,The decline bar press sit-up is a weighted cor...,Strength,Abdominals,Barbell,Intermediate,8.5,Above Average
11,Bench barbell roll-out,The bench barbell roll-out is a challenging ex...,Strength,Abdominals,Barbell,Beginner,8.3,Above Average
13,Seated bar twist,The seated bar twist is a core exercise meant ...,Strength,Abdominals,Barbell,Intermediate,4.7,Average


In [46]:
df.isnull().sum()

Title         0
Desc          0
Type          0
BodyPart      0
Equipment     0
Level         0
Rating        0
RatingDesc    0
dtype: int64

### Now that our data is clean, we can move forward in life and do other important things like doing EDA:

# **Main Goals:**

## 1. **Top 5 Rated Exercises for each body part:**

In [14]:
# to show all rows in jupyter output
pd.set_option('display.max_rows', None)

In [None]:
# Group by 'BodyPart' and then apply the top 5 highest rated exercises for each group
df1 = df.groupby('BodyPart').apply(lambda x: x.nlargest(5, 'Rating')).reset_index(drop=True)

# Sort by 'BodyPart' and 'Rating'
df1 = df1.sort_values(by=['BodyPart', 'Rating'], ascending=[True, False])
# Show only the columns 'BodyPart', 'Title', and 'Rating'
df1[['BodyPart','Title','Rating']]


Unnamed: 0,BodyPart,Title,Rating
0,Abdominals,Landmine twist,9.5
1,Abdominals,Dumbbell V-Sit Cross Jab,9.3
2,Abdominals,Dumbbell spell caster,9.3
3,Abdominals,Suspended ab fall-out,9.3
4,Abdominals,Standing cable low-to-high twist,9.3
5,Abductors,Iliotibial band SMR,8.2
6,Abductors,Thigh abductor,8.2
7,Abductors,Single-leg lying cross-over stretch,1.6
8,Abductors,Standing hip circle,0.0
9,Adductors,Thigh adductor,9.0


## 2. **Top 5 rated exercises for each Type / Category of Workouts:**

In [None]:
# Group by 'Type' and then apply the top 5 highest rated exercises for each group
df2 = df.groupby('Type').apply(lambda x: x.nlargest(5, 'Rating')).reset_index(drop=True)

# Sort by 'BodyPart' and 'Rating'
df2 = df2.sort_values(by=['Type', 'Rating'], ascending=[True, False])
# Show only the columns 'BodyPart', 'Title', and 'Rating'
df2[['Type','BodyPart','Title','Rating']]


Unnamed: 0,Type,BodyPart,Title,Rating
0,Cardio,Quadriceps,Jumping rope,9.2
1,Cardio,Quadriceps,Stair climber,9.1
2,Cardio,Middle Back,Rower,8.9
3,Cardio,Quadriceps,Elliptical trainer,8.8
4,Cardio,Quadriceps,Stairmaster,8.8
5,Olympic Weightlifting,Quadriceps,Push-press,9.3
6,Olympic Weightlifting,Quadriceps,Power snatch-,9.3
7,Olympic Weightlifting,Shoulders,Clean and jerk,9.3
8,Olympic Weightlifting,Quadriceps,Push-jerk,8.2
9,Olympic Weightlifting,Shoulders,Behind-the-head push-press,6.5


## 3. **Top 5 rated exercises for each BodyPart and each Level of exercises:**

In [None]:
# Group by 'Level' and 'BodyPart' and then apply the top 5 highest rated exercises for each group
df2 = df.groupby(['Level', 'BodyPart']).apply(lambda x: x.nlargest(5, 'Rating')).reset_index(drop=True)

# Sort by 'BodyPart' and 'Rating'
df2 = df2.sort_values(by=['Level', 'BodyPart', 'Rating'], ascending=[True, True, False])
df2[['Level','BodyPart','Title','Rating']]


Unnamed: 0,Level,BodyPart,Title,Rating
0,Beginner,Abdominals,Dumbbell spell caster,9.3
1,Beginner,Abdominals,Bench barbell roll-out,8.3
2,Beginner,Abdominals,Cable reverse crunch,8.2
3,Beginner,Abdominals,Leg Pull-In,7.4
4,Beginner,Abdominals,Partner front Russian twist and pass,0.0
5,Beginner,Abductors,Single-leg lying cross-over stretch,1.6
6,Beginner,Adductors,Adductor SMR,4.0
7,Beginner,Biceps,Wide-grip barbell curl,9.3
8,Beginner,Biceps,Biceps curl to shoulder press,9.1
9,Beginner,Biceps,Close-grip EZ-bar curl,8.9


## 5. **To ask the user about Type, BodyPart, Equipment he / she has, and the Level of Fitness he is at and recommend top exercises based on demand:**

In [24]:
columns = ['Type', 'BodyPart', 'Equipment', 'Level']
for col in columns:
    unique_values = df[col].unique()
    print(f"Unique values in '{col}':\n {unique_values}")


Unique values in 'Type':
 ['Strength' 'Plyometrics' 'Stretching' 'Powerlifting' 'Strongman' 'Cardio'
 'Olympic Weightlifting']
Unique values in 'BodyPart':
 ['Abdominals' 'Abductors' 'Adductors' 'Biceps' 'Calves' 'Chest' 'Forearms'
 'Glutes' 'Hamstrings' 'Lats' 'Lower Back' 'Middle Back' 'Traps'
 'Quadriceps' 'Shoulders' 'Triceps']
Unique values in 'Equipment':
 ['Barbell' 'Kettlebells' 'Dumbbell' 'Other' 'Cable' 'Machine' 'Body Only'
 'Medicine Ball' 'Exercise Ball' 'Foam Roll' 'E-Z Curl Bar' 'Bands']
Unique values in 'Level':
 ['Intermediate' 'Beginner' 'Expert']


In [30]:
filtered_df = df.loc[
    (df['Type'] == 'Strength') & 
    (df['BodyPart'] == 'Biceps') & 
    (df['Equipment'] == 'Dumbbell') & 
    (df['Level'] == 'Beginner')
]

filtered_df


Unnamed: 0,Title,Desc,Type,BodyPart,Equipment,Level,Rating,RatingDesc
733,Biceps curl to shoulder press,The biceps curl to shoulder press is a dumbbel...,Strength,Biceps,Dumbbell,Beginner,9.1,Above Average
737,Cross-body hammer curl,The cross-body hammer curl is a dumbbell exerc...,Strength,Biceps,Dumbbell,Beginner,8.9,Above Average
741,Standing concentration curl,The standing concentration curl is a variation...,Strength,Biceps,Dumbbell,Beginner,8.7,Above Average
748,Standing Dumbbell Reverse Curl,"With the reverse-grip dumbbell curl, the palms...",Strength,Biceps,Dumbbell,Beginner,8.1,Above Average
749,Palms-out incline biceps curl,The palms-out incline biceps curl is an exerci...,Strength,Biceps,Dumbbell,Beginner,8.0,Above Average
750,Straight-arm plank with biceps curl,The straight-arm plank with biceps curl is a h...,Strength,Biceps,Dumbbell,Beginner,7.7,Above Average
751,Face-down incline dumbbell biceps curl,The face-down incline dumbbell biceps curl is ...,Strength,Biceps,Dumbbell,Beginner,7.6,Above Average
