# WHAT IMPACTS BODY PERFORMANCE?

![Body Performance](https://images.unsplash.com/photo-1502904550040-7534597429ae?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1700&q=80)

In this notebook, I will try to analyze the different features that makes an individual perform better than other and check possible correlations among other things. The dataset contains more than 13,000 observations from individuals aged from 20 to 64, its physical characteristics and their results in some tests.

The objective of this notebook is to perform some unsupervised learning to discover possible clusters regarding common characteristics. During the notebook, I will also try to analyze and explore the data visually.

## TABLE OF CONTENTS


1. [LIBRARIES AND PARAMETERS](#1)
2. [DATA LOADING](#2)
3. [DATA CLEANING](#3)
4. [DATA PREPROCESSING](#4)
5. [DIMENSIONALITY REDUCTION](#5)
6. [CLUSTERING](#6)
7. [EVALUATING MODELS](#7)
8. [PROFILING](#8)
9. [LINEAR REGRESSION](#9)
10. [CONCLUSIONS](#10)


<a id="1"></a>
## LIBRARIES AND PARAMETERS

In [1]:
# Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering
from yellowbrick.cluster import KElbowVisualizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [2]:
# Parameters
sns.set_theme(style=None, palette='pastel')

<a id="2"></a>
## DATA LOADING

In [3]:
df = pd.read_csv('/kaggle/input/body-performance-data/bodyPerformance.csv')

df.head()

<a id="3"></a>
## DATA CLEANING

In this section, I will try to find values that keep the data untidy and dirty. Let's start by looking at the features datatype and the most common statistics.

In [4]:
df.info()

The information regarding the dataset shows us that there aren't any missing values. The datatypes of the features seems to be correct as well. However, before continuing, I will set the correct order for the **class** feature.

In [5]:
df['class'] = pd.Categorical(df['class'], categories=['A', 'B', 'C', 'D'], ordered=True)

Once the order is set up, let's continue by analyzing its most significant statistics.

In [6]:
df.describe()

The description of the dataset shows us there is an impossible value. In the **sit and bend forward_cm** test, one or more individuals score less than 0. Also, some individual scored 213cm. That is impossible! Let's take care of it.

In [7]:
df.sort_values('sit and bend forward_cm', ascending=False).head(3)

As we can see, there are two individuals who scored an incredibly 185 and 213cm. This is impossible so I will filter the dataframe to remove this outliers.

In [8]:
df = df[df['sit and bend forward_cm'] < 50]
df.sort_values('sit and bend forward_cm', ascending=False).head(3)

Once that's taking care of. Let's analyze the values where the test was below 0.

In [9]:
df[df['sit and bend forward_cm'] < 0].head()

It is impossible to score less than 0 in that test so I'm going to filter the dataframe to remove those values as well.


In [10]:
df = df[df['sit and bend forward_cm'] >= 0]
df[df['sit and bend forward_cm'] < 0].head()

Now, let's check for possible missing outliers by plotting a histogram.

In [11]:
plt.figure (figsize=(16,16))

for i, column in enumerate(df.columns, 1):
    plt.subplot(4,3,i)
    sns.histplot(df[column])

The data looks pretty normally distributed. However, its seems that there are missing values in the  dataset. I will proceed to filter the dataframe to remove all the observations where the value is 0 except from **sit and bend forward_cm** because it's possible to score 0 in this test.

In [12]:
cols_check = ['systolic', 'diastolic', 'gripForce', 'sit-ups counts', 'broad jump_cm']

for i in cols_check:
    df = df[df[i] != 0]

Before proceeding, I´m going to create a variable to standarize the height, weight and bodyfat. The purpose of this variable is to take all three parameters into account when checking the test results. For example, an individual with high weight but low body fat can perform well in the situps test and a short guy perform worse in the broad jump test.

The new variable will be called **lean_BMI**. It's calculated by getting the lean body mass and dividing by the squared heihgt. This is like a lean BMI formula.

In [13]:
body_fat = df['body fat_%'] / 100
height_m = df['height_cm'] / 100
lean_mass = df['weight_kg'] * (1-body_fat)

df['lean_BMI'] = lean_mass / height_m**2

The data is now clean. Let's plot the **correlation** between the variables in a heatmap,

In [14]:
# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.show()

From the above chart we can see there are some variables that are strongly negative correlated like the age and test results or the bodyfat and test results. This means that the older or fatter an individual is, the worst he's going to perform in the tests.

<a id="4"></a>
## DATA PREPROCESSING

In this setion, I will preprocess the data to perform clustering operations.

The following steps are applied to preprocess the data:

* Label encoding for categorical features.
* Scale the features using the standard scaler.
* Create a subset dataframe for dimensionality reduction.

In [15]:
# Create a dataset copy for ML purposes
df_ml = df.copy()

df_ml['class'] = df_ml['class'].astype(str)

In [16]:
# Get the list of categorical variables
c = (df_ml.dtypes == 'object')
object_cols = (list(c[c].index))

print('Categorical variables in the dataset:', object_cols)

In [17]:
# Label Encoding the object dtypes
le = LabelEncoder()
for i in object_cols:
    df_ml[i] = df_ml[[i]].apply(le.fit_transform)

In [18]:
# Scaling
scaler = StandardScaler()
scaler.fit(df_ml)
scaled_df = pd.DataFrame(scaler.transform(df_ml),columns=df_ml.columns )

After label enconding the categorical variables and scale the features, the dataframe that I will use for dimensionality reduction is the following:

In [19]:
scaled_df.head()

<a id="5"></a>
## DIMENSIONALITY REDUCTION

In this dataset, there are a lot of factors that influence in the final classifications. These factors are features. The higher number of features, the harder it is to work with. As many of these features are correlated, as we saw before, they are redundant. This is why I will be performing dimensionality reduction before putting them through a classifier.

For this process, I will use the **Principal component analysis(PCA)**. This technique allow us to reduce the dimensionality of the dataset and increase the interpretability without minimizing information loss.

In [20]:
pca = PCA().fit(scaled_df)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(1, 14, step=1)
y = np.cumsum(pca.explained_variance_ratio_)

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('Number of Components')
plt.xticks(np.arange(0, 14, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()

_Source: https://www.mikulskibartosz.name/pca-how-to-choose-the-number-of-components/_

The above chart shows that, to achieve a 95% of variance explained, I need to get at least 8 variables so, **we will select 8 principal components** to carry the dimensionality reduction.



In [21]:
# Initiating PCA to reduce dimentions to 8
pca = PCA(n_components = 8)
pca.fit(scaled_df)
PCA_df = pd.DataFrame(pca.transform(scaled_df))

<a id="6"></a>
## CLUSTERING

After reducing the attributes to eight dimensions, I will perform clustering using Agglomerative clustering. This type of clustering is a hierarchical clustering method that involves merging examples until the desired number of clusters is achieved.

In [22]:
Elbow_M = KElbowVisualizer(KMeans(), k=10)
Elbow_M.fit(PCA_df)
Elbow_M.show()

The above chart indicates that the optimal number of clusters for this datasets is five. Let's fit the agglomerative clustering model to get the final clusters.

In [23]:
# Agglomerative Clustering model 
AC = AgglomerativeClustering(n_clusters=5)
# fit model and predict clusters
yhat_AC = AC.fit_predict(PCA_df)
PCA_df['clusters'] = yhat_AC
# Adding the Clusters feature to the orignal dataframe.
df['clusters'] = yhat_AC

After the model is fitted and the cluster data parsed in the original dataset, let's take a look at the new one:

In [24]:
df.head()

<a id="7"></a>
## EVALUATING MODELS

Since this is an unsupervised clustering, we don't have a tagged feature to evaluate or score our model. The purpose of this section is to study the patterns in the clusters formed and determine the nature of the clusters' patterns.

In [25]:
# Plot countplot of clusters
ax = sns.countplot(data=df, x='clusters')
ax.set_title('Distribution Of The Clusters')
plt.show()

As we can see in the above chart, the clusters are not equally distributed. Let's try to figure out what each one can mean.

First thing I'm going to do is to check if height and weight has something to do with the score. I will divide each plot by clusters.

In [26]:
# Plot scatterplots of height and weight divided by class and cluster type
ax = sns.relplot(data=df, x='height_cm', y='weight_kg', hue='clusters', col='class')
plt.show()

As we can see in the charts above, it's pretty clear that theere is a correlation between height, weight and score. **The individuals who weighted more, scored worst in the tests on average.** We can also see that majority of cluster 4 observations are in the class D. Let's check for the other three.

In [27]:
# Plot distribution of clusters by score
ax = sns.countplot(data=df, x='class', hue='clusters')
ax.set_title('Distribution of clusters per scoring')
plt.show()

The chart above doesn't provide much information we didn't know from before. Cluster 4 is related with people with the worst score. However, the other three clusters are distributed evenly between the other three scores. They need to have another meaning.

<a id="8"></a>
## PROFILING

In this section, I will try to deduce which individuals are in which clusters. To decide that, I will be plotting some of the features present in the dataset.

Let's first analyze the relation between the weight and the score on different tests.

In [28]:
tests = ['gripForce', 'sit and bend forward_cm', 'sit-ups counts', 'broad jump_cm']

# Weight comparison plot
for i in tests:
    plt.figure()
    sns.scatterplot(data=df, x=i, y='weight_kg', hue='clusters')
    plt.show()

Taking a look at the charts plotted we can say that cluster 4 individuals are heavier, on average, than the other individuals. It is also important to notice that, **the heavier the individual, the better gripForce score**. In the the sit-ups and broad jump tests, it seems to be a positive correlation between weight and better results but it is not as high as the previous one. 

Let's see know how the height compares to the score.

In [29]:
# Height comparison plot
for i in tests:
    plt.figure()
    sns.scatterplot(data=df, x=i, y='height_cm', hue='clusters')
    plt.show()

With this plot we can get some interesting insights! 

1. It seems that taller people perform better in broad jump and grip force tests. However, in the other two, height doesn't seem to provide any advantage at all. 
2. **Individuals from cluster 0 are taller than the average while individuals from cluster 3 are shorter**.

Now, let's compare the results with the age.

In [30]:
# Age comparison plot
for i in tests:
    plt.figure()
    sns.scatterplot(data=df, x=i, y='age', hue='clusters')
    plt.show()

From the above charts, we can also get two important insights:

1. The younger the individual, the better the score.
2. **Individuals from clusters 2 and 3 are, on average, older than the rest of the dataset**.

Let's compare it now with the lean BMI.

In [31]:
# Lean_BMI comparison plot
for i in tests:
    plt.figure()
    sns.scatterplot(data=df, x=i, y='lean_BMI', hue='clusters')
    plt.show()

From the charts above, we can also get two important insights:

1. Individuals with higher lean BMI performed better in strenght tests (grip force and broadjump).
2. **Individuals from cluster 0 have the highest lean BMI while individuals from cluster 1 have the lowest**.

Before trying to get more detailed information to profile each cluster, let's get the gender distribution for each one.

In [32]:
# Plot distribution of gender within the clusters
ax = sns.countplot(data=df, x='clusters', hue='gender')
ax.set_title('Cluster gender distribution')
plt.show()

Now, let's group all observations by cluster type and get the mean of all the features to get more details regarding each cluster.

In [33]:
df.groupby('clusters').mean()

Now we can draw some conclusions to determine which type of people form each cluster:

* **Cluster 0**: young tall males with high lean BMI.
* **Cluster 1**: young small females with low lean BMI.
* **Cluster 2**: old tall males with high lean BMI.
* **Cluster 3**: old small females with low lean BMI.
* **Cluster 4**: overweight males in its majority.

<a id="8"></a>
## LINEAR REGRESSION

At this point of the analysis, I thought it would be good to know what features determine the test results. For example, is a heavier person going to perform better in a strenght test? To answer these kind of questions, I'm going to run a linear regression model.

In [34]:
# Define dependent variable
X = df.drop(['gripForce', 'sit and bend forward_cm', 'sit-ups counts', 'broad jump_cm', 'class', 'clusters'], axis=1)

In [35]:
# Label encode variables
le = LabelEncoder()
X['gender'] = le.fit_transform(X['gender'])

In [36]:
# Create empty dictionary for coefs
coefs = {}

# Initialize linear regression
lr = LinearRegression()

# Iterate through all tests 
for i in tests:
    X_train, X_test, y_train, y_test = train_test_split(X, df[i], test_size=0.3, random_state=42)
    lr.fit(X_train, y_train)
    coefs[i] = lr.coef_
    print('The score for model', i, 'is: ', lr.score(X_test, y_test))

In [37]:
# Plot coefficients
fig, ax = plt.subplots(figsize=(12,8))

names = X.columns

for i in tests:
    ax.plot(range(len(names)), np.reshape(coefs[i], (8)), label=i)

ax.set_xticks(range(len(names)), names)

plt.legend()
plt.show()

As we can see in the above graph, **the most influencial parameter in all test is the gender**. It looks like male individuals perform better than female in strenght tests while women do better in flexibility ones. The second most important parameter is the **lean BMI** and it makes sense. The higher the lean BMI is, the lower the body fat is, the higher the muscle mass is, the better an individual can perform in physical tests. **Height and weight** are somewhat important and came make a different depending on the test. For example, a heavier guy is probably going to perform worse on the broad jump and better in the grip force than a lighter one. 

At this point, we know that gender plays a huge rol in explaining the test result. Let's remove this bias and try it again.

In [38]:
# Divide dataframes by gender
df_male = df[df['gender'] == 'M']
df_female = df[df['gender'] == 'F']

In [39]:
# Define unbiased dependent and independent variables for males
X_male = df_male.drop(['gripForce', 'sit and bend forward_cm', 'sit-ups counts', 'broad jump_cm', 'class', 'clusters', 'gender'], axis=1)
# Define unbiased dependent and independent variables for females
X_female = df_female.drop(['gripForce', 'sit and bend forward_cm', 'sit-ups counts', 'broad jump_cm', 'class', 'clusters', 'gender'], axis=1)

Once that the predictor variables are defined, let's build the models:

In [40]:
# Initialize LinearRegression
lr_males = LinearRegression()
# Create dictionary to append coefficients
coefs_males = {}
# Iterate through all tests 
for i in tests:
    X_train, X_test, y_train, y_test = train_test_split(X_male, df_male[i], test_size=0.3, random_state=42)
    lr_males.fit(X_train, y_train)
    coefs_males[i] = lr_males.coef_
    print('The score for model', i, 'on males is: ', lr_males.score(X_test, y_test))

In [41]:
# Initialize LinearRegression
lr_females = LinearRegression()
# Create dictionary to append coefficients
coefs_females = {}
# Iterate through all tests 
for i in tests:
    X_train, X_test, y_train, y_test = train_test_split(X_female, df_female[i], test_size=0.3, random_state=42)
    lr_females.fit(X_train, y_train)
    coefs_females[i] = lr_females.coef_
    print('The score for model', i, 'on females is: ', lr_females.score(X_test, y_test))

As we can see, the score has dropped down a lot. This is because we have removed the most important variable from the model: the gender. However, we just want to know all the other features perform when the gender is not taken into account.  Let's plot all the coefficients to check out the results!

In [42]:
# Plot coefficients
fig, axs = plt.subplots(1, 2, figsize=(24,8), sharey=True)

names = X_male.columns

for i in tests:
    axs[0].plot(range(len(names)), np.reshape(coefs_males[i], (7)), label=i)
    axs[1].plot(range(len(names)), np.reshape(coefs_females[i], (7)), label=i)


axs[0].set_xticks(range(len(names)), names)
axs[1].set_xticks(range(len(names)), names)

axs[0].set_title('Males')
axs[1].set_title('Females')

handles, labels = axs[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='upper center')

plt.show()

Both plots show similar results. **Height** influence positively in all results while **weight** does it negatively. This means that the taller an individual is, the better he is going to perform and, the heavier, the worse. **Lean BMI** seems to be the most influencial variable af all and it makes sense. This variable measures the amount of muscle an individual has depending on his height. The higher the lean BMI is, the more athletic he is going to be, therefore, the better he is going to perform in physical tests. 

<a id="10"></a>
## CONCLUSIONS

The unsupervised clustering provided a good data segmentation. Despite this technique wasn't really needed to carry on the analysis, it showed five different groups with common characteristics.

From the data analyzed, we can draw the conclusion that body performance decreases during the years and males and females have different strenghts. In this case, females were more flexible while males did better in strenght tests.

Also, a higher bodyfat can lead to bad body performance and higher blood pressure. 