# Visualizing High-Dimensional Data with Python

Instructor: [Jeroen Janssens](https://jeroenjanssens.com)

## PCA Exercise

### Exercise details

In this exercise, you'll apply PCA to the Luv dataset, which contains 657 colors in Luv color space. The code below is missing some pieces, denoted by three dots (`...`). Your job is to fill in the correct values.

In [None]:
# Load the Luv dataset from the plotnine package
from plotnine.data import luv_colours
luv_colours

In [None]:
# Remove all columns that shouldn't be used as features
luv = luv_colours.drop(..., axis=1)

In [None]:
# Import the PCA class from scikit-learn
from sklearn... import ...

In [None]:
# Apply PCA
pca = PCA(n_components=2)
luv_mapped = pca.fit_transform(luv)

luv_mapped

In [None]:
# Turn result into a DataFrame and plot result.
# Remember: you're free to use any plotting library you like.
from helpers import *
df = ...(...)

# Plot result
...(df)

### Bonus challenge 1: Use the actual colors in the scatter plot

#### Install the package colormath

In [None]:
! pip install colormath

#### Convert Luv colors to RGB

In [None]:
# This piece of code converts Luv to RGB in hex format
from colormath.color_objects import LuvColor, sRGBColor
from colormath.color_conversions import convert_color

def luv2hex(row):
    rgb = convert_color(LuvColor(*(row/100)), sRGBColor)
    rgb.rgb_r = rgb.clamped_rgb_r
    rgb.rgb_g = rgb.clamped_rgb_g
    rgb.rgb_b = rgb.clamped_rgb_b
    return rgb.get_rgb_hex()

# Get the hex color for all rows
df["..."] = luv_colours.drop("col", axis=1).apply(luv2hex, axis=1)

#### Plot the result

In [None]:
# We cannot use the helper function plot here because we need assign a unique
# color to each point
ggplot(df, aes(x="...", y="...", color="...")) +\
geom_...() +\
scale_color_identity() +\
theme_void()

### Bonus challenge 2: Implement a biplot

Draw the axes of the original features as lines, with a label, similar to:

![](https://blog.bioturing.com/wp-content/uploads/2018/11/PCA-bi-plot.png)

Hint: Think of the starts and ends of each line as data points and use the fitted PCA object to transform them to the lower-dimensional space.

In [None]:
import numpy as np
import pandas as pd

In [None]:
# This code can be used to plot both the data points and the original axes as lines with labels.
# ggplot expects a DataFrame called df_axes containing three rows, one for each original feature,
# and the columns: x, y, xo, yo, label. Again, feel free to use any other plotting library.

# Functions that might come in handy: np.zeros, np.diag, np.concatenate, luv.max

df_axes = pd.DataFrame(...)
df_axes["label"] = luv.columns.values

ggplot(df, aes(x="x", y="y", color="target")) +\
geom_segment(aes(xend="xo", yend="yo"), data=df_axes, color="black") +\
geom_point() +\
geom_label(aes(label="label"), data=df_axes, color="black") +\
scale_color_identity() +\
theme_void()