# Assignment 3

This assignment has one main part:

**PCA** : In this part the goal is to implement the dimensionality reduction technique *Principal Component Analysis (PCA)* to a very high dimensional data and apply visualization. Note that you are not allowed to use the built-in PCA API provided by the sklearn library. Instead you will be implementing from the scratch. Use the data in data/train.csv for generating the PCA. See the detailed intructions below.


For this task we use the  MovieLens dataset. The data is in train.csv.


In [145]:
import numpy as np
import pandas as pd
from scipy.linalg import sqrtm

# Part-1a: Convert data to user-movie rating matrix (10 points)
    - Read the train.csv file and movies.dat file and use user_id and movie_id to create user-movie rating matrix


In [146]:
def readMovieRatingData():
    df = pd.read_csv('./train.csv')

    # for each user
    # for each movie
    max_user_id = df["user_id"].max()
    max_movie_id = df["movie_id"].max()

    return df.pivot_table(index="user_id",columns="movie_id",values="rating",fill_value=0)



rating_data=readMovieRatingData()

In [147]:
def split_categories(categories):
    return categories.split('|')

def readMovieData():
    # Read the movie data from data/movies.dat
    # Mind the header row in the train.csv
    df = pd.read_csv('./movies.dat',delimiter="::", encoding="latin_1", engine="python",header=None)
    df.iloc[:, -1] = df.iloc[:, -1].apply(split_categories)
    return df

In [148]:
movie_data = readMovieData()

movie_data

Unnamed: 0,0,1,2
0,1,Toy Story (1995),"[Animation, Children's, Comedy]"
1,2,Jumanji (1995),"[Adventure, Children's, Fantasy]"
2,3,Grumpier Old Men (1995),"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),"[Comedy, Drama]"
4,5,Father of the Bride Part II (1995),[Comedy]
...,...,...,...
3878,3948,Meet the Parents (2000),[Comedy]
3879,3949,Requiem for a Dream (2000),[Drama]
3880,3950,Tigerland (2000),[Drama]
3881,3951,Two Family House (2000),[Drama]


## We are going to compute PCA for movies so transpose the matrix using X=readMovieRatingData().T


# Part-1b: Preprocessing  (10 points)
Before implementing PCA you are required to perform some preprocessing steps:
1. Mean normalization: Replace each feature/attribute, $x_{ji}$ with $x_j - \mu_j$, In other words, determine the mean of each feature set, and then for each feature subtract the mean from the value, so we re-scale the mean to be 0 
2. Feature scaling: If features have very different scales then scale make them comparable by altering the scale, so they all have a comparable range of values e.g. $x_{ji}$ is set to $(x_j - \mu_j) / s_j$  Where $s_j$ is some measure of the range, so could be  $\max(x_j) - \min(x_j)$ or Standard deviation $stddev(x_j)$.

In [149]:
# TODO We can see features have very different scales. So we apply feature scaling with Standard 
# deviation as measure of the range, using StandardScaler from scikit-learn

X=rating_data.T
X_orig=rating_data.T

def subtract_mean(row):
    row_mean = row.mean()
    return row - row_mean
# Iterate over each row in the DataFrame
X = X.iloc[:, 1:].apply(lambda row: subtract_mean(row), axis=1)


# Part-2: Covariance matrix  (15 points)
Now the preprocessing is finished. Next, as explained in the lecture, you need to compute the covariance matrix https://en.wikipedia.org/wiki/Covariance_matrix. Given $n \times m$ $n$ rows and $m$ columns matrix, a covariance matrix is an $n \times n$ matrix given as below (sigma)
$\Sigma = \frac{1}{m}\sum{\left(x^{i}\right)\times \left(x^{i}\right)^{T}}$
You may use the "numpy.cov" function in numpy library 

In [150]:
# Compute X to covariance matrix cov_matrix.
cov_matrix = np.cov(X.to_numpy(),rowvar=False)

print(cov_matrix)



[[ 3.74023577e-01 -7.65463008e-05  3.44030740e-03 ... -2.61420216e-03
  -1.41125695e-02  1.34225740e-02]
 [-7.65463008e-05  1.61473221e-01  2.36579409e-02 ...  1.90611688e-02
   2.53589216e-03 -3.84333210e-02]
 [ 3.44030740e-03  2.36579409e-02  9.06833526e-02 ...  2.73558228e-02
   1.21319114e-02 -2.98406591e-02]
 ...
 [-2.61420216e-03  1.90611688e-02  2.73558228e-02 ...  8.78953387e-02
   2.92481781e-02 -3.52144546e-02]
 [-1.41125695e-02  2.53589216e-03  1.21319114e-02 ...  2.92481781e-02
   3.65219387e-01 -3.54441253e-03]
 [ 1.34225740e-02 -3.84333210e-02 -2.98406591e-02 ... -3.52144546e-02
  -3.54441253e-03  8.16282421e-01]]


# Instructions for part 3, 4, and 5
- getSVD() function is expected to return 3 values. For example: ```U, S, V = getSVD(cov_matrix)```
- You can follow the skeleton below to have an idea on how the autograder's test calls your functions:
```
U, S, V = getSVD(cov_matrix)
z = getKComponents(U, X, k)
ratio = getVarianceRatio(z, U, X, k)
```
- Using the built-in PCA implementation in sklearn, the approximate X matrix can be obtained by function ```inverse_transform```

# Part-3: SVD computation  (10 points)
Now compute the SVD on the covariance matrix $SVD(\Sigma)$. You may use the svd implementation in numpy.linalg.svd

In [151]:
from scipy.sparse.linalg import svds

def getSVD(cov_matrix,k=2):
    return svds(cov_matrix, k=k)

In [152]:
def getKComponents(U,X):
    return np.dot(X,U)

# Part-4: Compute PCA matrix (K dimensional)  (10 points)
Now select the first $k$ columns from the matrix $U$ and multiply with $X$ to get $k$ dimensional representation.

In [163]:
from sklearn.decomposition import PCA
from scipy.sparse.linalg import svds


k=1000

U,S,V = getSVD(cov_matrix,k)

Z = getKComponents(U,X)

r_X = np.dot(Z, U.T)


# Part-5: Compute Reconstruction Error  (15 points)
Implement a function to compute the variance ratio (from reconstruction error)

In [164]:
def getVarianceRatio(Z, X):
    z_var = np.sum((X-Z)**2)
    X_var = np.var(X) * X.shape[0]
    ratios= z_var/X_var
    return np.mean(ratios)

In [165]:
ratio = getVarianceRatio(r_X, X)

ratio

0.11670552036285839

Compare the variance ration to the built-in PCA implementation in sklearn https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html (this step is optional)

In [156]:
from sklearn.decomposition import PCA
pca = PCA(n_components=k)
z_pca = pca.fit_transform(X)
X_approx_pca = pca.inverse_transform(z_pca)
ratio_pca = np.mean((X-X_approx_pca).T.dot(X-X_approx_pca))/np.mean(X.T.dot(X))
ratio_pca

  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


user_id
2       0.465742
3      -0.740479
4       0.126669
5      -1.654703
6       0.036748
          ...   
6036    0.128054
6037    0.348989
6038   -0.070896
6039    0.240885
6040    0.106858
Length: 6039, dtype: float64

# Part-6: Scatter plot 2-dimensional PCA  (10 points)
Using matplotlib plot the 2-dimensional scatter plot of the first 2 compoenents with y (movie genre from movies.dat file) as labels. Remember you are plotting movies in dimensions so you can label them with movie generes.

In [157]:
import matplotlib.pyplot as plt

def plotFunction(PCA, movie_data):


SyntaxError: incomplete input (2519213865.py, line 3)

In [None]:
plotFunction(z, movie_data)

# Part-7 Find best $K$  (10 points)
Find the minimum value of $K$ with which the ratio between averaged squared projection error with total variation in data is less than 0.1% in other words we retain 99.9% of the variance. You can achieve this by repeating getKComponents with $K=1$ until the variance ratio is <= 0.1%.

In [None]:
def findBestK(initial, step):
    #TODO use the getVarianceRatio to find the best K
    return 1

# Part-8: TSNE visualization (10 points)
Finally, having found an optimal $K$ use these components as an input data to another dimensionality reduction method called tSNE (https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) and reduce it to 2 dimensions.

In [None]:
from sklearn.manifold import TSNE
tsne_pca_results = None

Finally, scatter plot the components given by the tSNE using matplotlib compare it to the earlier scatter plot.

In [None]:
# Scatter plot the 2-dimensional tsne compoents with target as labels
plotFunction(tsne_pca_results, movie_data)