In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

**Unsupervised Learning and PCA**

In machine learning, as we've learned previously, unsupervised learning is a category of algorithms used to discover patterns and relationships within data without explicitly labeled outcomes (targets). Unlike supervised learning, where the algorithm learns from labeled data to predict outcomes, unsupervised learning deals with raw, unlabeled data and aims to extract meaningful insights or representations from it.

Previously, we learned about KMeans, which is a form of unsupervised learning that's used for finding clusters withing your data.

Principal Component Analysis (PCA) is another technique in unsupervised learning, but where the application is *dimensionality reduction* in data. We talke alot earlier about the importance of not having too many features in our data, since that could lead to the curse of dimensionality. In our previous labs, we actually worked alot with dimensionality reduction by iteratively removing features and re-training our models, in order to try to keep as few features as possible.

PCA can automatically reduce dimensions in our data. It identifies the directions of maximum variance in high-dimensional data and projects it onto a lower-dimensional subspace while preserving the essential structure of the data. By reducing the number of features or dimensions, PCA can help in simplifying the data and improving computational efficiency while retaining most of the information.


**Objective**

Principal Component Analysis (PCA) is a dimensionality reduction technique commonly used in machine learning and data analysis. Its primary goal is to simplify the complexity of high-dimensional data by transforming it into a lower-dimensional space while retaining most of the relevant information. PCA achieves this by identifying the directions (or principal components) that capture the maximum variance in the data.

In this context, we're once again working with the Iris dataset, a popular dataset in machine learning. It consists of measurements of various features of iris flowers, such as sepal length, sepal width, petal length, and petal width. Our goal is to visualize the data and apply PCA to reduce its dimensionality while preserving its underlying structure.

*Note*: PCA is usually used on datasets with many dimensions(features) in order to reduce that amount of features into something much lower, but we choose to apply it on the iris dataset here so that we clearly can see what's going on.

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA

iris_df = pd.read_csv('../data/IRIS.csv')

iris_df.head()



In [None]:
# note, the iris dataset has 4 features and 1 target, but just for simplicity and visualization purposes, we'll drop one of the features instantly

iris_df = iris_df[['sepal_length', 'petal_length', 'petal_width', 'species']]

iris_df.head()

Split the features from the targets

In [None]:
X, y = iris_df[['sepal_length', 'petal_length', 'petal_width']], iris_df['species']

Plot the 3 features, and the target as the color

In [None]:
iris_df

import plotly.express as px

df = px.data.iris()
fig = px.scatter_3d(iris_df, x='sepal_length', y='petal_length', z='petal_width', color='species')
fig.show()

Now we perform PCA on the **features** here, to find the 2 principle components (most important directions)

In [None]:
# Perform PCA on the features
# Note, we DONT use the targets here

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

iris_reduced_df = pd.DataFrame(X_pca)

iris_reduced_df

In [None]:
# Let's also add the targets from earlier to our newly reduced data, in order to be able to plot
iris_reduced_df['species'] = y

iris_reduced_df

As you can see, the dimensionality of our features have decreased from 3 to 2. 

**But**, which features are they?!

In [None]:
# Plot the results with colors representing different classes
fig = px.scatter(iris_reduced_df, x=0, y=1, color='species', title='PCA of Iris Dataset (2D)')
fig.update_xaxes(title_text='Prinpical Component 1')
fig.update_yaxes(title_text='Prinpical Component 2')
fig.show()

As you can see, we've reduced the feature dimensions from 3 to 2. PCA has mathematicall identified the 2 axis which account for the most variance in our original data, thus preserving as much information as possible.

However, we have lost *some* information!

**Importantly**, the 2 principle component axis we found are neither our original sepal_length, sepal_width or petal_width! Rather, it's a combination of all of them, mathematically found in a way that keeps as much information of the inherent structure of the data as possible!

A downside of PCA is thus that it makes our newly found features *less* interpretable since we can't immediatly say what they actually mean.

---

We are free to choose the amount of principle components we'd like to use. Let's repreat the experiment above, but this time only keep the single most important dimension in the data.

In [None]:
X

In [None]:
# Reduce the feature space into a single

pca = PCA(n_components=1)
X_pca = pca.fit_transform(X)

iris_further_reduced_df = pd.DataFrame(X_pca)

iris_further_reduced_df

In [None]:
iris_further_reduced_df['species'] = y

iris_further_reduced_df

In [None]:
plt.figure(figsize=(8, 6))
for species in iris_reduced_df['species'].unique():
    plt.scatter(iris_reduced_df[iris_reduced_df['species'] == species][0], 
                [0] * len(iris_reduced_df[iris_reduced_df['species'] == species][0]), 
                label=species)

plt.xlabel('PC1')
plt.title('PCA of Iris Dataset (1D)')
plt.legend()

plt.show()

We have now reduced our feature space into the single most important dimension in terms of the amount of varience of data. We have indeed lost information on the way, though - which is natural to do just as when you manually drop features during feature engineering.

However, sometimes, as mentioned many times before, it is important to reduce the dimensionality of your feature space. Especially if you have *alot* of features.

---

**In conclusion**, PCA is a great way to reduce dimensions of your data. However, if your stated goal is to also be able to *interpret* the inner workings of your data - then PCA is actually very bad since the dimensions it identifies can be any arbitrary combinations of your original features, making it impossible to understand what the model you'll train later on these new dimensions actually will base their predictions on.

**As a second point**, PCA is another one of those algorithms which are very sensitive to different scales of your features. Therefore, make sure to scale your features before you apply PCA - if you have large differences of scale in your features.

---

## Challenges

Let's use PCA in practice for a slightly larger dataset. We'll use the student performance dataset we worked with last week.

*Note* We'll not focus on EDA or any rigurous evaluation here. Rather, the task is just to showcase how PCA can be applied.

**Task 1**

Import the dataset, drop G1 & G2 and one-hot-encode categorical variables

In [None]:
student_df = pd.read_csv("../lab/student-por.csv", delimiter=";")

student_df.drop(columns=["G2", "G1"], inplace=True)

X, y = student_df.drop(columns=['G3']), student_df['G3']

X.head()

As mentioned, we're just going to do a very quick and dirty preperation of our data (super sloppy, not reccommended)

In [None]:
# Get only the string/categorical columns
categorical_columns = X.select_dtypes(include=['object']).columns

# One-hot encode the categorical columns
X = pd.get_dummies(X, columns=categorical_columns, dtype=int)

X.head()

In [None]:
X.info()

All features are approximately at the same scale also (except for maybe age, but the difference isn't that significant) so we'll omit feature scaling now aswell

**Task 2**

GridSearch a RandomForest to find a good performing set of hyperparameters on this dataset

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, mean_absolute_error

In [None]:
param_grid = {
                "n_estimators": [10, 20, 50, 100], 
                "max_depth": [10, 20, 30],  
                "min_samples_split": [2, 5, 10],
                "min_samples_leaf": [1, 2, 4]
             }

rf_classifier = RandomForestRegressor(random_state=42)

score = make_scorer(mean_absolute_error, greater_is_better=False)

grid_search = GridSearchCV(
    estimator=rf_classifier, 
    param_grid=param_grid,
    cv=3, 
    n_jobs=-1, 
    verbose=2,  
    scoring=score,
)

# Utför grid search över alla möjliga kombinationer av dina hyperparameters
grid_search.fit(X, y)

In [None]:
grid_search.best_score_ * -1

**Task 3**

Ok, now let's try to reduce our feature space dimension by applying PCA

Assume we want to reduce to, say 50 dimensions

In [None]:
X_reduced = 

Note that we now only have 50 dimensions in our feature space. 

That's great, but at the same time we've lost all common sense of what these columns actually represent!

All we know is that they represent the 50 directions in which we have the most variance, in our feature space.

In [None]:
# let's do the same gridsearch as above, but using these 50 principle components as our new features

grid_search = 

First, notice that it takes significantly longer to train here because the Tree's have many more questions to ask! Why is this?

Look at the features again, they have continous values now instead of discrete - making the number of possible questions expontantially more.

Have this training time increase in consideration, especially when training trees. Other families of models don't have the same problem.

In [None]:
grid_search.best_score_ * -1

We got comparable results as above.

**Task 5**

In [None]:
grid_best_score = []

Now do a foor-loop, where-in you each time reduce your original number of principle components (dimensionality of the feature space) to

50, 45, 40, 35, 30, 25, 15, 5, 2

Do a gridsearch for each of those case.

For each iteration of the loop, append the best score to grid_best_score list defined above  

**Task 6**

Plot the list grid_best_score (y-axis) against the number of principle components (x-axis) so that you can compare the results for the grid searches above.

Does the resulting plot look similar to what you find during previous week's lab?