# Principal component analysis

We use here PCA to perform dimensionality reduction, first to explore how the method works and how well its perform in compressing the data at hand; and then whether it is a useful preprocessing of other learning algorithms (here, logistic regression).

**There are 5 questions to answer.**

In [None]:
import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# set up the random number generator: given seed for reproducibility, None otherwise
# (see https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.default_rng)
my_seed = 1
rng = np.random.default_rng(seed=my_seed) 

## Loading the MNIST dataset

we consider performing PCA on the standard MNIST dataset already encountered in few times in the hands-on.

In [None]:
# load data from Keras, values between 0 and 255 initially
(x_train_full, y_train_full), (x_test_full, y_test_full) = tf.keras.datasets.mnist.load_data()
print('initial data type for images = ',x_train_full.dtype,', initial data shape = ',x_train_full.shape)
print('initial data type for labels = ',y_train_full.dtype,', initial label shape = ',y_train_full.shape,'\n')
    
# renormalize to have data between 0 and 1; could alternatively use built-in rescaling function
# such as https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
x_train_full = x_train_full/255. 
x_test_full = x_test_full/255.
print('Train set: data set size =',x_train_full.shape[0])
print('Test set:  data set size =',x_test_full.shape[0])

# reshape the data points, which are 28x28 tensors, into a single vector of size 28x28=784
x_train_full = x_train_full.reshape((x_train_full.shape[0], 784))
x_test_full = x_test_full.reshape((x_test_full.shape[0], 784))

# shuffle data
indices = np.random.permutation(x_train_full.shape[0])
x_train_full = x_train_full[indices]
y_train_full = y_train_full[indices]
indices = np.random.permutation(x_test_full.shape[0])
x_test_full = x_test_full[indices]
y_test_full = y_test_full[indices]

We can plot the first elements of the resulting data set in order to see what they looks like, in particular when binarization is performed.

In [None]:
plt.figure(figsize=(10, 10))
for i in range(25):
    plt.subplot(5, 5, i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    # color map = binary, other choices here https://matplotlib.org/stable/tutorials/colors/colormaps.html
    plt.imshow(x_train_full[i].reshape(28,28), cmap=plt.cm.binary)     
    plt.title(y_train_full[i])
plt.show()

## Performing PCA

We use the PCA as implemented in scikit-learn https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html, see also https://scikit-learn.org/stable/modules/decomposition.html#pca

In [None]:
# importing the required class
from sklearn import decomposition

# initializing the pca
pca = decomposition.PCA()
pca.n_components = 784
pca_data = pca.fit_transform(x_train_full)

We can then plot the various eigenvalues of the empirical covariance, and the percentage of the explained variance as a function of the eigenvalue number.

**Question 1.** Compute the percentage of explained variance.

In [None]:
percentage_var_explained = pca.explained_variance_ / np.sum(pca.explained_variance_)
cum_var_explained = np.cumsum(percentage_var_explained)
# alternatively, one could directly use 
# cum_var_explained = np.cumsum(pca.explained_variance_ratio_)

In [None]:
# Plot the PCA spectrum
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12, 4))
ax1.plot(pca.explained_variance_ratio_, linewidth=2)
#ax1.semilogy(pca.explained_variance_, linewidth=2)
ax1.grid()
ax1.set_xlabel('Index of component')
ax1.set_ylabel('Covariance eigenvalue')
ax1.set_xlim(0,100)
ax2.plot(cum_var_explained, linewidth=2)
ax2.grid()
ax2.set_xlabel('Number of components')
ax2.set_ylabel('Fraction of cumulative explained variance')
plt.show()

**Question 2.** Write a function which gives the number of components required to attain a certain percentage of the total variance. How many components are needed to retrieve 50% of the total variance? 80%?

In [None]:
fraction = 0.5
booleans_cum_var = (cum_var_explained < fraction).astype('uint8')
index = booleans_cum_var.argmin()+1
print("For a fraction",fraction,"of the total variance, one needs",index,"components")

We next study what images look like when dimensionality reduction is performed.

**Question 3.** Complete the code below.

In [None]:
# Try PCA on first images
n_images = 8 
vector_of_index_values = [700,200,120,80,30,10]

n_index_values = len(vector_of_index_values)
plt.figure(figsize=(n_images*2, n_index_values*2))

current_index_value = 0
for i in range(n_images):
    plt.subplot(n_index_values+1, n_images, current_index_value*n_images+i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    if i == 0:
        plt.ylabel('reference')
    plt.imshow(x_test_full[i].reshape(28,28), cmap=plt.cm.binary)     

for current_index_value in range(1,n_index_values+1):
  pca.n_components = vector_of_index_values[current_index_value-1]
  pca.fit(x_train_full)
  # pca.transform gives the score along the principal directions
  # pca.inverse_transform maps back to the original space
  outputs = pca.inverse_transform(pca.transform(x_test_full))
  for i in range(n_images):
      plt.subplot(n_index_values+1, n_images, current_index_value*n_images+i+1)
      plt.xticks([])
      plt.yticks([])
      plt.grid(False)
      if i == 0:
          plt.ylabel('n_comp ='+str(pca.n_components))
      plt.imshow(outputs[i].reshape(28,28), cmap=plt.cm.binary)     
    
plt.show()

**Question 4.** In which range of values are the pixels for the reconstructed images when 50 components are retained? What do you think of this result?

In [None]:
pca.n_components = 50
pca.fit(x_train_full)
outputs = pca.inverse_transform(pca.transform(x_test_full))
print("The minimal pixel value over all test images is",outputs.min(),"while the maximum is",outputs.max())

There is no mechanism in PCA ensuring that the values of the reconstructed images remain between 0 and 1. However, as the number of components is increased, the minimum of the pixels increases to 0, while the maximum decreases to 1 (somewhat slowly however... it would be better probably to plot the distribution of minimum/maximum for each image).

**Question 5.** Complete the code below to plot the projections of the test images onto the first factorial plane, by coloring each point according to its label. Can you separate the images in this representation?

In [None]:
# perform PCA, the first 2 components are sufficient
pca.n_components = 2
pca.fit(x_train_full)
test_scores = pca.transform(x_test_full)

In [None]:
digits = np.arange(0,10)
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plt.scatter(test_scores[:,0],test_scores[:,1],s=0.1,color='black')
plt.title('all digits')
plt.subplot(1, 2, 2)
for i in digits:
    indices = np.where((y_test_full == i))
    test_scores_extracted = test_scores[indices]
    plt.scatter(test_scores_extracted[:,0],test_scores_extracted[:,1],s=0.5)
    plt.legend(digits)
    #plt.title('subset of the digits')
plt.show()

The images cannot be separated in this representation. It is even difficult to form groups. Dimension has been too drastically reduced.

## Additional topics for project

One can revisit logistic regression after performing dimensionality reduction, and determine how the classification performance and the computational cost in training are impacted by the dimensionality reduction.
One can also check the impact of various alternative normalizations of the data (such as MinMaxScaler(), Standardize(), or other options described in the page https://scikit-learn.org/stable/modules/preprocessing.html).