## Table of Contents
1. Data preparation
2. Exploratory data analysis
3. Model training
4. Conclusion

### Analysis of Exercise and Fitness Metrics and Exercise Intensity prediction
Engaging in regular physical activity is essential for individuals to maintain optimal health and well-being. 
The benefits of physical activity extend far beyond just physical fitness. They encompass mental, emotional, and 
social aspects of our lives, making it an integral part of human existence. 
Likewise, the topic of health is one of the most critical subjects for humanity to prioritize due to its profound 
impact on individuals and society as a whole.

### Gathering individual health data when working out at a gym is crucial for several reasons

Firstly, tracking personal health data allows individuals to monitor their progress and make informed decisions about their fitness goals. By collecting data, individuals can tailor their workout routines, adjust intensity levels, and make necessary modifications to achieve desired results effectively.
Secondly, individual health data provides insights into overall health and potential risk factors. By regularly monitoring metrics such as heart rate variability, resting heart rate, and blood pressure, individuals can identify any abnormalities or potential health issues.

Collecting health data promotes accountability and motivation. When individuals track their progress and see tangible results, it serves as a powerful motivator to continue their fitness journey.

Lastly, the aggregation of health data from gym-goers can contribute to research and the development of evidence-based practices. With consent and proper anonymization, aggregated health data can be used to identify trends, patterns, and correlations that can benefit the larger population.

This project is mostly EDA practice - we take a closer look at the data and look for patterns in it. We will also do a little data preparation step since it's a good practice overall and try to predict Exercise Intensity as our ML task, but there won't be any meaningful tuning and I don't expect good results from it. (Don't mind sklearn pipeline, I'm just probing it's possibilities)

## Description of columns:

- ID - A unique identifier for each sample in the dataset.
- Exercise - The type of exercise performed during the session
- Calories Burn - The estimated number of calories burned during the exercise session.
- Dream Weight - The desired weight of the individual.
- Actual Weight - The measured weight of the individual, including natural variation.
- Age - The age of the individual performing the exercise.
- Gender - The gender of the individual (Male or Female)
- Duration - The duration of each exercise session in minutes.
- Heart Rate - The average heart rate during the exercise session.
- BMI - The body mass index of the individual, indicating body composition.
- Weather Conditions
- Exercise Intensity

## Data preparation
Let's start from loading all necessary libraries and dataframe:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import phik
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OrdinalEncoder, Normalizer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# RandomState
state = np.random.RandomState(12345)

In [None]:
df = pd.read_csv('dataset.csv')

Let's peek into our data:

In [None]:
df.head()

In [None]:
df.info()

Data is well interpretable.

Let's remove the ID column, stock pandas indexing is enough for us and proceed further:

In [None]:
df.drop('ID', axis=1, inplace=True)

And let's change 'Exercise' column to int:

In [None]:
df['Exercise'] = df['Exercise'].map(lambda x: ''.join([i for i in x if i.isdigit()]))

And give some columns more suitable dtypes:

In [None]:
ints = [
    'Exercise', 
    'Age', 
    'Duration', 
    'Heart Rate', 
    'Exercise Intensity'
]

categories = [
    'Gender', 
    'Weather Conditions'
]

for col in ints:
    df[col] = df[col].astype('int16')

for col in categories:
    df[col] = df[col].astype('category')

As we can see data may greatly vary from one individual to another.

P.S. It's always pleasant to work with clean and prepared data. Synthetic origin is too obvious.

## Exploratory data analysis

Let's start with checking descriptive statistics:

In [None]:
df.describe()

`Calories Burn`, `Duration` and `Age` differ greatly among respondents.

Now let's check our data distributions:

In [None]:
_, axs = plt.subplots(3, 2, figsize=[10,15])
cols = [
    (axs[0,0], 'Calories Burn'),
    (axs[0,1], 'Dream Weight'),
    (axs[1,0], 'Actual Weight'),
    (axs[1,1], 'Age'),
    (axs[2,0], 'Duration'),
    (axs[2,1], 'Heart Rate'),
]
for i, (ax, col) in enumerate(cols):
    ax.set(title=col)
    ax.hist(df[col], bins=40)
plt.figure(figsize=[10,4])
plt.title('BMI')
plt.hist(df['BMI'], bins=40)

_, axs = plt.subplots(2, 2, figsize=[11,11])
cols = [
    (axs[0,0], 'Exercise'),
    (axs[0,1], 'Exercise Intensity'),
    (axs[1,0], 'Gender'),
    (axs[1,1], 'Weather Conditions')
]

cmap = plt.get_cmap('Blues')

for i, (ax, col) in enumerate(cols):
    ax.set(title=col)
    values = df[col].value_counts().sort_index()
    ind = values.index
    colors = list(cmap(np.linspace(0.1, 0.8, len(values))))
    ax.pie(
        values, 
        labels=ind, 
        autopct='%.1f%%', 
        colors=colors
    )

The difference between the most common `Exercise` group (fifth one) and the rarest (second) is only 50.

`Actual Weight` is kinda normally distributed, that's nice.

If the data is random sample, then seeing so many people over 50 exercising is a good sign. Engaging in activities that promote physical health and maintaining a healthy lifestyle can indeed reduce the risk of developing certain age-related diseases.

Strange to see almost the same number of `Duration == 20` and `Duration == 60`. It's hard to imagine that the same number of people come to the gym for 20 and 60 minute workouts, given that 60 minute workouts are considered optimal and are more popular. (Author really should consider changing this distribution to exponential)

Men and women were almost equally divided among the respondents. And also people don't skip their workouts or rainy days, expected less entries on rainy days.

P.S. Let's try to guess from which region the statistics could've been collected (if weren't synthetic). Personally thinking it could be UK/Ireland/Pacific Northwest, USA (including Seattle)/Belgium/Netherlands/Germany/Vancouver, Canada/Scotland

Let's continue with linear correlation matrix:

In [None]:
cols = [
    'Calories Burn', 
    'Dream Weight',
    'Actual Weight',
    'Age',
    'Duration',
    'Heart Rate',
    'BMI',
    'Exercise Intensity'
]

sns.heatmap(df.phik_matrix(interval_cols=cols), cmap ='seismic')

As we can see, columns don't correlate with each other, except for `Actual Weight`/`Dream Weight` pair. It could be due to people relying on their own weight, when envisioning their weight preferences.

Now it's time for a little feature engineering. Let's add a Weight Difference column with `Actual Weight` - `Dream Weight` module values and a `Gain` column describing whatever respondent wants to gain some weight or lose it:

In [None]:
def gain(x):
    if x < 0:
        return 'Gain'
    else: return 'Lose'

df['Weight Difference'] = df['Actual Weight'] - df['Dream Weight']
df['Gain'] = df['Weight Difference'].apply(gain).astype('category')
df['Weight Difference'] = abs(df['Weight Difference'])

Now let's check new columns distributions:

In [None]:
_, axs = plt.subplots(1, 2, figsize=[11,4])
cols = [
    (axs[0], 'Weight Difference'),
    (axs[1], 'Gain')
]
for i, (ax, col) in enumerate(cols):
    ax.set(title=col)
    ax.hist(df[col], bins=40)

`Weight Difference` is kinda chaotic and `Gain` is almost equal, with little preference for weight loss.

Let's see if there are weight gain/loss preferences among men and women:

In [None]:
colors = cmap(np.linspace(0.1, 0.8, 2))
df.pivot_table(
    columns='Gain',
    index='Gender',
    values='Exercise',
    aggfunc='count'
).plot(
    kind='pie', 
    autopct='%.1f%%',
    colors=colors, 
    subplots=True, 
    figsize=[11,11], 
    legend=False
)

Almost equal.

Continue with median actual/dream weights: