# Exercises on PCA

## Exercise 2: Data scaling

The goal of this exercise is to explore how data scaling affects the PCA projection.

In general the data matrix, $\mathbf{X}$, can be centered and scaled as follows:

$$
\tilde{\mathbf{X}} = (\mathbf{X} - \mathbf{C}) \mathbf{D}^{-1}
$$

where $\mathbf{C}$ is a matrix of centers and $\mathbf{D}$ is a diagonal matrix of scales. We will denote the $i$th variable in a dataset as $X_i$.

The matrix of centers is populated by the mean values of each variable:

$$
\begin{gather}
\mathbf{C} = 
\begin{bmatrix}
c_1 & c_2 &  \dots & c_n \\
c_1 & c_2 &  \dots & c_n \\
c_1 & c_2 & \dots & c_n \\
\vdots \\
c_1 & c_2 & \dots & c_n \\
\end{bmatrix}
\end{gather}
$$

where $c_i = mean(X_i)$.

In contrast, the diagonal of the matrix of scales can be populated in many different ways. In general, we can write:

$$
\begin{gather}
\mathbf{D} = 
\begin{bmatrix}
d_1 & 0 & 0 & 0 & 0 & \dots & 0 \\
0 & d_2 & 0 & 0 & 0 & \dots & 0 \\
0 & 0 & d_3 & 0 & 0 & \dots & 0 \\
\vdots \\
0 & 0 & 0 & 0 & 0 & \dots & d_n \\
\end{bmatrix}
\end{gather}
$$

where $d_i$ is the scaling factor applied on the $i$th variable $X_i$.

In this exercise, we will explore three ways of scaling the data:

- No scaling, where:

$$
d_i = 1
$$

- Standard (auto) scaling, where:

$$
d_i = std(X_i)
$$

- VAriable STability (VAST) scaling, where:

$$
d_i = std(X_i)^2 / mean(X_i)
$$



**Note, that `sklearn`'s PCA always centers the data, but does not scale the data! You have to handle the scaling by yourself.**

***

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import PCA

%matplotlib inline

We will use the wine dataset, which has 178 observations and 13 variables. 

In [None]:
from sklearn.datasets import load_wine

wine_data = load_wine().data
wine_names = load_wine().feature_names

In [None]:
wine_data.shape

There are three distinct wine classes in the data, and we can access the classification of each wine observation:

In [None]:
wine_target = load_wine().target

We will compute PCA using all 13 principal components:

In [None]:
n_components = 13

## PCA with no data scaling

We can inspect the wine data sets when variables are not scaled. For this purpose, the `pandas` library is very useful. We can show 10 first rows of the data matrix.

You can see that different wine parameters have different numerical ranges:

In [None]:
df = pd.DataFrame(wine_data, columns=wine_names)
df.head(10)

Perform the PCA transformation of the `wine_data` when no scaling is applied to the data and compute the eigenvectors and eigenvalues:

In [None]:
#pca = PCA(n_components=n_components)

# Complete the PCA steps...

Visualize the two-dimensional PCA projection of the data set. Color the projection by the three classes as per `wine_target`:

In [None]:
fig = plt.figure(figsize=(10,5))

# Create a scatter plot of the projection

plt.colorbar(scat, ticks=[0,1,2])

## PCA with standard (auto) scaling

Center and scale the data using standard (auto) scaling, according to the formula given at the top of the notebook.

Hint: Python's `numpy` library allows you to subtract, add, multiply or divide each column of a data matrix simply by specifying a vector of values.

For instance, if you want to access a mean value of each data **column**, you can write:

```
np.mean(X, axis=0)
```

If you want to access a mean value of each data **row**, you can write:

```
np.mean(X, axis=1)
```

Both `np.mean(X, axis=0)` and `np.mean(X, axis=1)` are vectors, not a matrices! The parameter `axis` controls along which axis of the data you want the value to be computed. 

In the cell below, remember to use the `axis` parameter correctly!

In [None]:
# wine_data_std = 

Inspect the first 10 rows of the wine data matrix after scaling. What change do you see?

In [None]:
# Inspect the scaled data set

Perform the PCA transformation of the `wine_data` when standard (auto) scaling is applied to the data and compute the new eigenvectors and eigenvalues:

In [None]:
# pca_std = PCA(n_components=n_components)

# Complete the PCA steps...

Visualize the two-dimensional PCA projection of the data set. Color the projection by the three classes as per `wine_target`. What changes?

In [None]:
fig = plt.figure(figsize=(10,5))

# Create a scatter plot of the projection

plt.colorbar(scat, ticks=[0,1,2])

## PCA with VAST scaling

Center and scale the data using VAST scaling, according to the formula given at the top of the notebook.

In [None]:
# wine_data_vast = 

Inspect the first 10 rows of the wine data matrix after scaling. What change do you see?

In [None]:
# Inspect the scaled data set

Perform the PCA transformation of the `wine_data` when VAST scaling is applied to the data and compute the new eigenvectors and eigenvalues:

In [None]:
# pca_vast = PCA(n_components=n_components)

# Complete the PCA steps...

Visualize the two-dimensional PCA projection of the data set. Color the projection by the three classes as per `wine_target`. What changes?

In [None]:
fig = plt.figure(figsize=(10,5))

# Create a scatter plot of the projection

plt.colorbar(scat, ticks=[0,1,2])

## Compare the eigenvectors

Now, we are going to visualize the eigenvectors resulting from no scaling, standard (auto) scaling and VAST scaling.

We are going to use the `pyplot`'s `bar` function which helps us generate bar plots. You can use this template to plot the eigenvectors:

```
plt.bar(x_ticks, height, width=bar_width, color='k', label='Scaling')
```

Where `x_ticks` specifies the locations on the x-axis where the bars will be plotted, and `height` specifies the height for each bar.

Also plot the legend that explains which eigenvector result from which data scaling option.

In [None]:
# The x_ticks parameter will help you label the x-axis:
x_ticks = np.array([i for i in range(0,13)])

# The bar_width parameter controls the width of each bar on the bar plot:
bar_width = 0.2

# Since we are going to compute three bar plots on one plot, we will offset the bar plots so that they are easier to visualize:
offset = 0.2

Plot the comparison of the **first** eigenvector from the three data scaling options:

In [None]:
fig = plt.figure(figsize=(10,3))

# Create a bar plot for no scaling
# Create a bar plot for standard (auto) scaling
# Create a bar plot for VAST scaling

plt.xticks(x_ticks, wine_names, rotation=90)
plt.ylabel('First eigenvector weight')
plt.grid(alpha=0.2)
plt.legend()

Plot the comparison of the **second** eigenvector from the three data scaling options:

In [None]:
fig = plt.figure(figsize=(10,3))

# Create a bar plot for no scaling
# Create a bar plot for standard (auto) scaling
# Create a bar plot for VAST scaling

plt.xticks(x_ticks, wine_names, rotation=90)
plt.ylabel('Second eigenvector weight')
plt.grid(alpha=0.2)
plt.legend()

## Compare the eigenvalues

Finally, we are going to visualize the scree plot of eigenvalues resulting from no scaling, standard (auto) scaling and VAST scaling.

Remember to also plot the legend that explains which eigenvalues result from which data scaling option.

What differences do you observe?

In [None]:
fig = plt.figure(figsize=(10,3))

# Create a line plot for no scaling
# Create a line plot for standard (auto) scaling
# Create a line plot for VAST scaling

plt.xticks(x_ticks)
plt.xlabel('Number of components')
plt.ylabel('Eigenvalue')
plt.legend()

Final tip: if you'd like to save the plots that you produced, you can add the following command after plotting:

```
plt.savefig('figure.png', dpi=200, bbox_inches='tight')
```

You can also control the file extension, e.g. you can save the plot to `.pdf` format instead of `.png`:

```
plt.savefig('figure.pdf', dpi=200, bbox_inches='tight')
```

***