# DSCI 563 - Unsupervised Learning

# Lab 2: Dimensionality Reduction

## Submission instructions <a name="si"></a>
<hr>
rubric={mechanics:2}

You will receive marks for correctly submitting this assignment. To submit this assignment, follow the instructions below:

- **Please add a link to your GitHub repository here: LINK TO YOUR  REPO**
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- Make at least three commits in your lab's GitHub repository.
- Push the final .ipynb file with your solutions to your GitHub repository for this lab.
- Upload the .ipynb file to Gradescope.
- If the .ipynb file is too big or doesn't render on Gradescope for some reason, also upload a pdf or html in addition to the .ipynb. 
- Make sure that your plots/output are rendered properly in Gradescope.

> [Here](https://github.com/UBC-MDS/public/tree/master/rubric) you will find the description of each rubric used in MDS.

> As usual, do not push the data to the repository. 

<br><br><br><br>

## Imports <a name="im"></a>

In [None]:
import pickle

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
from sklearn import datasets
from sklearn.decomposition import NMF, PCA
from sklearn.preprocessing import StandardScaler

<br><br><br><br>

## Exercise 1: Warm-up 
<hr>

### 1.1 Dimensionality reduction notation 
rubric={reasoning:4}

**Your tasks:**

Based on the notation we saw in class, what does each of the following matrix stand for in the context of Principal Component Analysis (PCA): $X, W, Z, \hat{X}$? What would be the shape of each of these matrices in terms of the variables below? 
    
- $n \rightarrow $ number of examples 
- $d \rightarrow $ number of features
- $k \rightarrow $ number of components         

<div class="alert alert-warning">

Solution_1_1
    
</div>

<br><br>

### 1.2 Picking the number of components $k$
rubric={reasoning:2}

**Your tasks:**
    
In PCA, why don't we pick the value for number of components $k$ (`n_components` in `sklearn`) which gives the lowest reconstruction error?     

<div class="alert alert-warning">

Solution_1_2
    
</div>


<br><br>

### 1.3 PCA by hand

Suppose you train the standard PCA algorithm on an already centered dataset `X` (not shown) and you get `Z` and `W` shown below.    

In [None]:
Z = np.array([[10, 10], [5, 2], [4, 3], [4, 3]])
W = np.array([[0.5, 0.5, -0.5, -0.5], [0.7, 0.1, 0.7, 0.1]])

#### 1.3.1 
rubric={accuracy:1}

**Your tasks:**

What would be the shape of the training data matrix `X`? 

<div class="alert alert-warning">

Solution_1_3_1
    
</div>

#### 1.3.2 
rubric={accuracy:1}

**Your tasks:**
    
Fill in the blanks: 
- Here we are reducing dimensionality from **__** dimensions to **__** dimensions. 


<div class="alert alert-warning">

Solution_1_3_2
    
</div>

#### 1.3.3
rubric={accuracy:1}

Find the low-dimensional representation of the already centered `X_new` below.    

In [None]:
X_new = np.array([[1, 1, 1, 1]])

<div class="alert alert-warning">

Solution_1_3_3
    
</div>

#### 1.3.4
rubric={reasoning:1}
    
The third and fourth rows of the transformed matrix `Z` are identical. Does this mean that the third and fourth rows of the training data matrix `X` also have to be identical?     

<div class="alert alert-warning">

Solution_1_3_4
    
</div>

<br><br>

### 1.4 Reconstruction error 
rubric={accuracy:3,reasoning:2}


Let's get an intuition for reconstruction error on a toy dataset.

The code below creates a toy dataset with a few outliers. The function `get_recon_error_df` calculates normalized reconstruction errors between the original and reconstructed data points. Run the code and answer the following questions. 

**Your tasks:**    
1. Write docstring for the function `get_recon_error_df`.
2. Fit `PCA` with `n_components=1` on `X_noise_scaled` and examine the reconstruction errors for different examples. You can follow the steps below to get reconstruction errors. 
    - create transformed data `Z` and reconstructed data `X_hat`
    - call `get_recon_error_df()` with `X_noise_scaled` and `X_hat`
4. The last 3 rows (rows from indices 100 to 102) in `X_noise_scaled` are outliers. Do you see any striking difference between reconstruction error for outliers vs. reconstruction error for other data points (i.e., non-outliers)?      

In [None]:
import mglearn

np.random.seed(42)
outliers = np.array([[4, -3], [2.8, -3], [-3, 3]])
x1 = np.random.randn(100)
x2 = x1 + np.random.randn(100) / 2
X = np.stack([x1, x2]).T
X_noise = np.vstack([X, outliers])
X_noise_scaled = StandardScaler().fit_transform(X_noise)
mglearn.discrete_scatter(X_noise_scaled[:, 0], X_noise_scaled[:, 1]);

<div class="alert alert-warning">

Solution_1_4_1

</div>


In [None]:
def get_recon_error_df(orig, recon):
    """
    
    """
    loss = np.sqrt(np.sum((orig - recon) ** 2, axis=1))
    loss = (loss - np.min(loss)) / (np.max(loss) - np.min(loss))  # normalization
    loss_df = pd.DataFrame(data=loss, columns=["recon_error"])
    return loss_df

<div class="alert alert-warning">

Solution_1_4_2
    
</div>

<div class="alert alert-warning">

Solution_1_4_3
    
</div>

<br><br>

### (optional) 1.5 Reconstruction error for anomaly detection
rubric={reasoning:1}

**Your tasks:**

1. Write a paragraph on how you might use PCA and reconstruction errors for anomaly detection. 

<div class="alert alert-info">
    
You might want to look up robust PCA for additional information on this topic.    
    
</div>    

<div class="alert alert-warning">

Solution_1_5_1
    
</div>

<br><br><br><br>

## Exercise 2: Implementing PCA using SVD
<hr>
rubric={accuracy:7}

In this exercise, you'll implement your own version of PCA using SVD. The class `MyPCA` below implements `init` and `fit` methods. 

**Your tasks:** 
    
1. Complete the `get_components` method of the class which returns the learned components. 
2. Complete the `transform` method of the class which returns transformed `Z` given `X`. 
> *Before applying transformation, center the data by subtracting the mean.*
3. Complete `reconstruct` method of the class which returns reconstructed `X_hat` given transformed `Z`.
> *Do not forget to add the mean back after reconstruction.*
4. Run your code and compare results of your PCA and `sklearn` PCA using the code below.  
   

<div class="alert alert-warning">

Solution_2
    
</div>

In [None]:
class MyPCA:
    """
    Solves the PCA problem min_Z,W (Z*W-X)^2 using SVD
    """

    def __init__(self, k):
        self.k = k

    def fit(self, X):
        self.mean = np.mean(X, 0)
        X = X - self.mean  # Centralize the data
        U, S, V = np.linalg.svd(
            X
        )  # SVD to get singular values and principal components
        self.W = V[: self.k, :]  # store only first k components in self.W

    def get_components(self):
        """
        Returns principal components.

        Parameters
        -----------
        None

        Returns
        -----------
        np.ndarray: an array containing k principal components.
        """
        ### Solution_2_1
        

    def transform(self, X):
        """
        Transforms X to Z and return Z.

        Parameters
        -----------
        X : np.ndarray
            Data to be transformed

        Returns
        -----------
        np.ndarray: transformed data
        """
        ### Solution_2_2

        

    def reconstruct(self, Z):
        """
        Given transformed data Z, returns reconstructed X_hat.

        Parameters
        -----------
        Z : np.ndarray
            PCA transformed data

        Returns
        -----------
        np.ndarray: X_hat which has dimensions of original X
        """
        ### Solution_2_3

        

Let's compare our implementation with `sklearn`'s PCA implementation on the toy dataset above. 

In [None]:
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=100, centers=3, n_features=20)  ## Generating toy data

In [None]:
for i in range(1, X.shape[1] + 1):
    print("PCA implementation with {} components: OK".format((i)))
    pca = PCA(n_components=i)
    pca.fit(X)

    mypca = MyPCA(k=i)
    mypca.fit(X)

    assert np.allclose(
        np.abs(pca.components_), np.abs(mypca.get_components())
    ), "W values do not match"

    Z = pca.transform(X)
    Z_prime = mypca.transform(X)

    assert np.allclose(np.abs(Z), np.abs(Z_prime)), "Z values do not match"

    X_hat = pca.inverse_transform(Z)
    X_hat_prime = mypca.reconstruct(Z_prime)
    assert np.allclose(
        np.abs(X_hat), np.abs(X_hat_prime)
    ), "reconstructed X_hat values do not match"

<br><br><br><br>

## Dimensionality reduction on animal faces 
<hr>
In the next few exercises, you'll explore dimensionality reduction on a subset of [Animal Faces dataset](https://www.kaggle.com/andrewmvd/animal-faces) from Kaggle. I have created a subset of this dataset and preprocessed it a bit. 
    
**Your tasks:**    
    
- Download animal_faces.pkl from [here](https://github.ubc.ca/mds-2021-22/datasets/blob/master/data/animal_faces.pkl) and save it locally under the lab directory. 
- Run the code below to unpickle the data and display first few images.   

In [None]:
import pickle

animals = pickle.load(open("animal_faces.pkl", "rb"))

In [None]:
import matplotlib as mpl

mpl.rcParams.update(mpl.rcParamsDefault)
plt.rcParams["image.cmap"] = "gray"

In [None]:
animals.shape

In [None]:
fig, axes = plt.subplots(2, 5, figsize=(12, 5), subplot_kw={"xticks": (), "yticks": ()})
for image, ax in zip(animals, axes.ravel()):
    ax.imshow(image)
plt.show()

Let's flatten the images. 

In [None]:
X_anims = animals.reshape(len(animals), -1)
X_anims.shape

The flattened representation of the data is of shape $1500 \times 10000$; Each image is represented with 10,000 pixel features. Let's define image_shape variable which will be handy later when we want to display images. 

In [None]:
image_shape = (100, 100)

<br><br>

## Exercise 3: Dimensionality reduction with PCA
<hr>

In this exercise, you will explore dimensionality reduction on the animal faces dataset using PCA. 

### 3.1 Choosing the number of components 
rubric={accuracy:3,reasoning:2}   

First, let's pick the appropriate value for the number of components. Recall that PCA finds principal components such that the first principal component has the highest variance, the second component has the next highest variance and so on. We can decide the value of number of principal components ($k$ or `n_components`) based on the amount of variance explained with $k$ components. 

**Your tasks:**

1. Using [scikit-learn's `PCA`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) implementation and the following hyperparameter values, fit a PCA model, and plot $k$ (for $k=1,\ldots, 300$) vs. the proportion of total variance explained by the first $k$ components. 
    - `n_components=300`
    - `random_state=42`
2. What range of values for $k$ (`n_components` in `scikit-learn`) seems reasonable based on this plot? Briefly explain.         

<div class="alert alert-info">

_Note that scikit-learn's PCA does the data centering for you. **Do not scale the data in this exercise and the next exercises**._ 

</div>

<div class="alert alert-warning">

Solution_3_1_1
    
</div>

<div class="alert alert-warning">

Solution_3_1_2
    
</div>

<br><br>

### 3.2 Data transformation
rubric={accuracy:3}   

**Your tasks:**

1. Train a pca model with the number of components you chose in 3.1 and `random_state=42`.    
2. Get $Z$ (transformed data), $W$ (principal components), and $X_{hat}$ (reconstructions).     

<div class="alert alert-warning">

Solution_3_2_1
    
</div>

<div class="alert alert-warning">

Solution_3_2_2
    
</div>

<br><br><br><br>

## Exercise 4: Interpreting PCA components 
<hr>

### 4.1 Visualizing PCA components
rubric={viz:4,reasoning:2}

Principal component matrix returned by PCA ($W$ matrix) is of shape number of components by number of features. So you can reshape each component and visualize it as an image (in our case $100 \times 100$ image). 

**Your tasks:**
1. Visualize the first 15 components as $100 \times 100$ images in a grid of 3 rows and 5 columns.
2. Observe the components and comment on the possible semantic themes captured by at least a couple of components. 

<div class="alert alert-info">
    
_Feel free to use code from lecture notes with appropriate attributions._
    
</div>    

<div class="alert alert-warning">

Solution_4_1_1
    
</div>

<div class="alert alert-warning">

Solution_4_2_1
    
</div>

<br><br>

### 4.2 Plotting strong component images
rubric={reasoning:4}

**Your tasks:**

1. Pick 2 to 4 principal components from 4.1 where you observed some possible semantic themes. For these principal components, display a few example images with strong principal component values for those principal components. You can write your own code for this or use the function `plot_strong_comp_images`, which I am providing below. 
2. Briefly comment on your results. Do the example images agree with your semantic interpretation from 4.1?  

In [None]:
def plot_strong_comp_images(Z, compn=1):
    """

    Parameters
    ----
    Z : numpy array
        PCA Transformed data

    compn : int
        component to examine

    Returns
    ----
    None
    """
    inds = np.argsort(Z[:, compn])[::-1]
    fig, axes = plt.subplots(
        2, 5, figsize=(10, 5), subplot_kw={"xticks": (), "yticks": ()}
    )
    fig.suptitle("Large component %d" % (compn))
    for i, (ind, ax) in enumerate(zip(inds, axes.ravel())):
        ax.imshow(X_anims[ind].reshape(image_shape))

<div class="alert alert-warning">

Solution_4_2_1
    
</div>

<div class="alert alert-warning">

Solution_4_2_2
    
</div>

<br><br>

### (optional) 4.3 Reconstructions
rubric={reasoning:1}   
   
**Your tasks:**
        
1. Get reconstruction errors for images using the function `get_recon_error_df` from Exercise 1. 
2. Write code to display a few original and reconstructed image pairs, where the reconstruction error is very high and where it is very low.
3. Briefly comment on the reconstructions.  

<div class="alert alert-warning">

Solution_4_3_1
    
</div>

<div class="alert alert-warning">

Solution_4_3_2
    
</div>

<div class="alert alert-warning">
    
Solution_4_3_3
    
</div>

<br><br><br><br>

### Exercise 5: More interpretation of PCA components
<hr>

### 5.1 Visualization
rubric={viz:2}    

One of the use cases of PCA is visualization. It's not possible to visualize high dimensional data. But since the first couple of components of PCA usually capture a lot of information from the data, we can plot PCA the first two components, for example, to get an intuition of the patterns in the data. 
    
**Your tasks:**       

1. Make a scatterplot of the first two dimensions in the transformed data $Z$ (from 3.1).       

<div class="alert alert-warning">
    
Solution_5_1_1
    
</div>

<br><br>

### 5.2 Image tiling 
rubric={accuracy:5,reasoning:3}    

In this exercise you will attempt to interpret PCA components. One way to interpret the components is as follows:
- Create a $m \times m$ grid which roughly spans scatterplot region from the previous exercise. For example, if `Z` is your transformed data, the following range of values will get you five points spanning the first principal component. 
```
np.linspace(np.min(Z[:, 0]), np.max(Z[:, 0]), 5))
```
- Once you have representative points which span the first principal component, get the indices of the data points closest to these points. 
- Plot the images corresponding to these indices as a grid and observe whether you see any pattern as the first principal component increases (left to right of the grid) and as the second principal component increases (bottom to top of the grid). 

Let's try this out with $m=5$, i.e., a $5 \times 5$ grid. 
    
**Your tasks:**
    
1. Make a $5 \times 5$ grid that roughly spans the scatterplot region from the previous exercise (but stays within it). For each point on that grid, select the animal face whose first two principal components are closest to the grid point. Plot a $5 \times 5$ tiling of these animal faces, corresponding to the $5 \times 5$ principal component grid using the `img_tiling` function below. 
    
2. What happens to the images as the first principal component increases (i.e., go from left to right of the grid)? What about the second principal component?    

<div class="alert alert-info">
    
Hint: The function to make the $5 \times 5$ tiling plot is provided below   
    
</div>


In [None]:
def img_tiling(idx, size=10, image_shape=(100, 100)):
    """
    Plots a 5x5 tiling of faces.

    Parameters:
    -----------
    idx: the indexes of the faces to be plotted. This should be a 5x5 matrix, where each
         elements is an index corresponding to the closest animal face of that grid point.

    size: the desired size of the plot;
    """
    idx = np.array(idx, dtype="int32")  # Just making sure the indexes are int

    plt.figure(figsize=(size, size))  # Creating the image with the desired size

    tile_size = 5
    # Ploting the 5x5 tiling
    for i in range(tile_size):
        for j in range(tile_size):
            face = np.reshape(
                animals[idx[i, j]], (image_shape)
            )  # Obtain the closest face
            plt.imshow(
                face, extent=(i * 32, (i + 1) * 32, j * 32, (j + 1) * 32)
            )  # Plot the closest animal face

    plt.xlim((0, 160))
    plt.ylim((0, 160))
    plt.xticks([])
    plt.yticks([])

<div class="alert alert-warning">
    
Solution_5_2_1
    
</div>

<br><br><br><br>

## Exercise 6: Non-negative Matrix Factorization (NMF)
<hr>

### 6.1 Dimensionality reduction with NMF
rubric={accuracy:2}

**Your tasks:**

1. Carry out dimensionality reduction with [NMF](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) with `n_components=15` and `random_state=42` and create transformed data `Z_nmf`. 

<div class="alert alert-warning">
    
Solution_6_1_1
    
</div>

<br><br>

### 6.2 Examining NMF components 
rubric={accuracy:3,reasoning:3}

**Your tasks:**

1. Write code to display NMF components and show images corresponding to some of the interesting components using the function `plot_strong_comp_images` from 4.2.  
2. Compare the components and representative images with the ones you got with PCA in Exercises 4.1 and 4.2. Briefly discuss your observations. 

<div class="alert alert-warning">
    
Solution_6_2_1
    
</div>

<div class="alert alert-warning">
    
Solution_6_2_2
    
</div>

<br><br><br><br>

### (optional) Exercise 7: Clustering animal faces   
<hr>

rubric={reasoning:1}

**Your tasks:**

1. Reduce dimensionality of the data using PCA with `n_components=15` and `random_state=123` and 
explore clustering animal faces with KMeans or other clustering methods of your choice.  
2. Comment on the clustering results. 

<div class="alert alert-warning">
    
Solution_7_1
    
</div>

<div class="alert alert-warning">
    
Solution_7_2
    
</div>

<br><br><br><br>

**PLEASE READ BEFORE YOU SUBMIT:** 

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Push all your work to your GitHub lab repository. 
4. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 
5. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope. 

Well done!! Have a great weekend! 

In [None]:
from IPython.display import Image

Image("eva-happy-caturday.png")