# Implementación de PCA en NumPy

## Objetivos
* Implementación de PCA en NumPy paso a paso
* Comparación de resultados con Scikit-learn

## Implementación

In [1]:
# Use black formatter
# %load_ext lab_black

import numpy as np
from sklearn.decomposition import PCA as SklearnPCA

1. Dado un dataset $X \in \mathbb{R}^{n, d}$, con $n$ muestras y $d$ features, queremos reducir sus dimensiones a $m$. Para ello, el primer paso es centrar el dataset (Hint: usen np.mean)

In [2]:
n = 10
d = 5
m = 3
X = np.random.uniform(size=(n, d))
norm_X = X - np.mean(X, axis=0)
norm_X

array([[-0.12719469,  0.08470232, -0.44863779,  0.00920519, -0.40952666],
       [-0.07620155,  0.27900397, -0.02912941, -0.24172075, -0.08545116],
       [-0.18128711, -0.53777687,  0.13518053, -0.12975136, -0.38734422],
       [ 0.38223733,  0.04800193,  0.27454986, -0.15366669,  0.18206238],
       [-0.06659283, -0.14569582,  0.04592173, -0.27459537,  0.45049785],
       [-0.28021031,  0.30820375,  0.18497272,  0.03703955,  0.40990637],
       [ 0.10580334,  0.27887353, -0.33716434, -0.02364514, -0.11007374],
       [-0.33973419, -0.01287795,  0.24588706, -0.19469135, -0.49407078],
       [ 0.20017935, -0.29258074, -0.240896  ,  0.55194595,  0.13695499],
       [ 0.38300067, -0.00985412,  0.16931563,  0.41987996,  0.30704498]])

2. Obtener la matriz de covarianza de $X^T$, revisar en la teoría por qué utilizamos la transpuesta. Buscar en la documentación de NumPy qué funciones se pueden utilizar.

In [3]:
cov_mat = np.cov(norm_X.T)
cov_mat

array([[ 0.06636477, -0.0023684 , -0.00197248,  0.03609834,  0.03941926],
       [-0.0023684 ,  0.07293106, -0.00930077, -0.0136167 ,  0.01686489],
       [-0.00197248, -0.00930077,  0.06588205, -0.01826059,  0.02388602],
       [ 0.03609834, -0.0136167 , -0.01826059,  0.07723827,  0.02599478],
       [ 0.03941926,  0.01686489,  0.02388602,  0.02599478,  0.12204703]])

3. Calcular los autovalores y autovectores de la matriz de covarianza. Revisar la documentación de NumPy.

In [4]:
eig_values, eig_vectors = np.linalg.eig(cov_mat)
eig_values, eig_vectors

(array([0.16318074, 0.0342833 , 0.02720429, 0.10046836, 0.07932649]),
 array([[-0.46365911,  0.80317257, -0.25834784,  0.27049314,  0.00489022],
        [-0.06315572,  0.2160573 ,  0.38902658, -0.3927399 ,  0.80233677],
        [-0.10146272,  0.28960477,  0.55001334, -0.49774253, -0.59629866],
        [-0.39962133, -0.234209  ,  0.63563339,  0.61706613,  0.02546611],
        [-0.78169028, -0.41171384, -0.2745364 , -0.37956633, -0.00334438]]))

4. Ordernar los autovectores en el sentido de los autovalores decrecientes, revisar la teoría de ser necesario.

In [5]:
sorted_eig_values = np.argsort(eig_values)[::-1]
sorted_eig_vectors = eig_vectors[:, sorted_eig_values]
sorted_eig_vectors

array([[-0.46365911,  0.27049314,  0.00489022,  0.80317257, -0.25834784],
       [-0.06315572, -0.3927399 ,  0.80233677,  0.2160573 ,  0.38902658],
       [-0.10146272, -0.49774253, -0.59629866,  0.28960477,  0.55001334],
       [-0.39962133,  0.61706613,  0.02546611, -0.234209  ,  0.63563339],
       [-0.78169028, -0.37956633, -0.00334438, -0.41171384, -0.2745364 ]])

5. Proyectar el dataset centrado sobre los $m$ autovectores más relevantes (Hint: usen np.dot).

In [6]:
m_eig_vectors = sorted_eig_vectors[:, :m]
proy_ds = np.dot(norm_X, m_eig_vectors)
proy_ds

array([[ 0.41558997,  0.31675758,  0.33646392],
       [ 0.18405951, -0.23241234,  0.23498242],
       [ 0.45893794,  0.16184208, -0.5149815 ],
       [-0.2890239 , -0.2160421 , -0.12785296],
       [-0.20699707, -0.32408686, -0.15310534],
       [-0.24353222, -0.42163847,  0.1351863 ],
       [ 0.06303315,  0.11410502,  0.42508451],
       [ 0.59739894, -0.14183133, -0.16192163],
       [-0.37752062,  0.5775631 , -0.07652552],
       [-0.60194569,  0.16574333, -0.09733021]])

6. Consolidar los pasos anteriores en una función o clase PCA.

In [7]:
class PCA:
    """
    Principal Component Analysis
    """

    def __call__(self, X, m):
        """
        Reduce the dimensionality of the data X from (n)x(d) to (n)x(m) using PCA
        """
        norm_X = X - np.mean(X, axis=0)
        cov_mat = np.cov(norm_X.T)
        eig_values, eig_vectors = np.linalg.eig(cov_mat)
        sorted_eig_values = np.argsort(eig_values)[::-1]
        sorted_eig_vectors = eig_vectors[:, sorted_eig_values]
        return np.dot(norm_X, sorted_eig_vectors[:, :m])


X = np.random.uniform(size=(8, 4))

pca = PCA()
pca(X, 2)

array([[ 0.20410476, -0.26977774],
       [ 0.09866819,  0.24047168],
       [-0.11402813,  0.42601736],
       [-0.81694402, -0.20032535],
       [ 0.18196192,  0.17046892],
       [ 0.06203172,  0.19147816],
       [-0.06620393, -0.24165323],
       [ 0.45040949, -0.3166798 ]])

7. Comparar los resultados obtenidos con el modelo de PCA implementado en Scikit-learn ([ver documentación](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)). Tomar como dataset:

$X=\begin{bmatrix}
0.8 & 0.7\\
0.1 & -0.1
\end{bmatrix}$

Se debe reducir a un componente. Verificar los resultados con np.testing.assert_allclose

In [8]:
X = np.array([[0.8, 0.7], [0.1, -0.1]])
n, d = X.shape
m = 1

pca = PCA()
pca_result = pca(X, m)

sklearn_pca = SklearnPCA(n_components=m)
sklearn_pca_result = sklearn_pca.fit_transform(X)

pca_result, sklearn_pca_result

print("My PCA\n", pca_result)
print()
print("Sklearn PCA\n", sklearn_pca_result)
print()

np.testing.assert_allclose(pca_result, sklearn_pca_result, atol=1e-5)
print("\x1b[32mTest passed\x1b[0m")

My PCA
 [[-0.53150729]
 [ 0.53150729]]

Sklearn PCA
 [[-0.53150729]
 [ 0.53150729]]

[32mTest passed[0m
