# Principal Component Analysis

In [None]:
# import packages
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# load in data
df = pd.read_csv("unsupervised_ObesityDataSet_raw_and_data_sinthetic.csv")

In [None]:
# preprocessing
# Prepare features
X = get_features(df)

# Feature types
num_features, cat_features = get_feature_types()

# Preprocessing
preprocessor = create_preprocessor(num_features, cat_features)
X_processed = preprocessor.fit_transform(X)

Next, we train the PCA model! The training standardizes all of the variables using linear algebra so that they can be compared and consolidated into components.

In [None]:
# Train PCA
pca, X_pca = train_pca(X_processed)

## Visualizations

This first graph evaluates how many principal components we could reasonably include in a simplified version of the data. The graph shows that 80% of variance is explained within the first 5 variables, indicating there is a lot of overlap within the variables and there is definitely room for the model to simplify some of these variables into fewer components. We can certainly reduce the number of components given the results of this graph, and thus the next steps will look at just two components.

In [None]:
plot_explained_variance(pca)

We need to determine which variables should be attributed to which of the two components we are hoping to create next. The feature loadings table below shows to what extent each of the variables contribute to the primary components (PC1 and PC2).

PC1, or the component with the strongest impacts, seems to measure lifestyle and body mass, and is most impacted by: weight, height, frequency of physical activity, and dietary behaviors.

PC2 seems to be centered around consumption habits, and is most impacted by: hydration habits, transportation, and type spent sitting.

In [None]:
# PCA feature loadings
feature_names = (
    num_features +
    list(preprocessor.named_transformers_["cat"].get_feature_names_out(cat_features))
)

loadings = get_pca_loadings(pca, feature_names)
print(loadings.sort_values("PC1", key=abs, ascending=False).head(10))

We create a 2 dimensional graph of these two primary components. The fact that the data is spread further along the x axis (PC1) rather than the y axis (PC2) indicates that the components of PC1 has a larger impact on the data.

In [None]:
plot_pca_2d(X_pca)

Finally, you can see how the simplified model still accurately aligns with the data. The different obesity categories are loosely clustered along the x axis, or according to principal component 1. This shows that the model successfully simplified the variables into components that still maintained the integrity of the data.

In [None]:
plot_pca_by_label(X_pca, df["NObeyesdad"])