# Principal Components Analysis

En este archivo detallamos los pasos aplicados para hacer el análisis de componentes principales siguiendo el documento *A tutorial on Principal Components Analysis* [[1]](#referencia-1) :

- [Primer paso: obtención de los datos](#primer-paso)
- [Segundo paso: resta de los promedios](#segundo-paso)
- [Tercer paso: cálculo de la matriz de covarianzas](#tercer-paso)
- [Cuarto paso: cálculo de valores y vectores propios de la matriz de covarianzas](#cuarto-paso)
- [Quinto paso: elección de componentes para formar el vector de características](#quinto-paso)
- [Sexto paso: derivación del nuevo dataset](#sexto-paso)

## Imports y configuraciones

In [1]:
import sys
import matplotlib.pyplot as plt
import numpy as np
from pandas import options, DataFrame
from sklearn.utils.extmath import randomized_svd

In [2]:
sys.path.append('../../../lab3/ej7/src') # permite importar modulos de otros directorios

In [3]:
from ds_preprocessing import DataSetPreprocessor
from arff_helper import DataSet

In [4]:
options.display.max_rows = None #permite mostrar todas las filas de los DataFrame

## Primer paso: obtención de los datos <a id='primer-paso'></a>
Cargamos el dataset completo (a diferencia del laboratorio anterior en el cual se partió en dos subset, uno de entrenamiento y otro de test):

In [5]:
ds = DataSet()
ds.load_from_arff('../../../lab2/ej5/datasets/Autism-Adult-Data.arff')

Sacamos el atributo result que ya vimos en tareas anteriores que no servía: 

In [6]:
ds.remove_attribute('result')

Transformamos el dataset en otro con atributos solamente numéricos :

In [7]:
target_attribute = 'Class/ASD'
preprocessor = DataSetPreprocessor(ds, target_attribute)
df = preprocessor.transform_to_rn()

Sacamos el atributo objetivo :

In [8]:
df = df.drop(columns=target_attribute)

## Segundo paso: resta de los promedios <a id='segundo-paso'></a>
En primer lugar obtenemos los promedios de cada atributo para analizar si ya están centrados en 0. Recordar que luego de aplicar la técnica one-hot, cada atributo discreto paso a representarse en varios atributos binarios.

In [9]:
means = []
for a in df.columns.values:
    means.append([a, df[a].mean()])
means = DataFrame(means, columns=['attribute', 'mean'])
means

Unnamed: 0,attribute,mean
0,A1_Score,0.721591
1,A2_Score,0.453125
2,A3_Score,0.457386
3,A4_Score,0.495739
4,A5_Score,0.49858
5,A6_Score,0.284091
6,A7_Score,0.417614
7,A8_Score,0.649148
8,A9_Score,0.323864
9,A10_Score,0.573864


Vemos que para el atributo 'age' el promedio ya es 0. Esto es porque al invocar ```preprocessor.transform_to_rn()``` ya se normalizó este atributo. Para cada valor se aplico la fórmula $\frac{v-\mu}{\sigma}$ (con $\mu$ el promedio y $\sigma$ la desviación estándar).

Para todos los atributos menos 'age' se le resta su promedio:

In [10]:
for a in [a for a in df.columns.values if a != 'age']:
    df[a] = df[a] - means[means['attribute'] == a]['mean'].values[0]

## Tercer paso: cálculo de la matriz de covarianzas <a id='tercer-paso'></a>
Calculamos la matriz de covarianzas $C^{103x103}=(c_{i,j},c_{i,j}=cov(Dim_i,Dim_j))$ en donde :

- $Dim_i$ corresponde al i-esimo atributo en el DataFrame
- $cov(X,Y)=\frac{\sum_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})}{n-1}$ donde $n$ es la cantidad de datos para $X$ e $Y$

In [11]:
C = df.cov()
C

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,...,contry_of_res_Uruguay,contry_of_res_Viet Nam,used_app_before_no,used_app_before_yes,age_desc_18 and more,relation_Health care professional,relation_Others,relation_Parent,relation_Relative,relation_Self
A1_Score,0.201183,0.002578,0.016569,0.028684,0.038011,0.022307,0.048154,0.031626,0.030551,0.026283,...,0.000396,-0.002287,0.00236,-0.00236,0.0,0.000162,-0.000865,-0.000113,-0.000291,0.017538
A2_Score,0.002578,0.248155,0.05561,0.039629,0.03834,0.041785,-0.010268,0.008424,0.04792,0.016981,...,0.000778,0.001045,0.000622,-0.000622,0.0,0.000267,0.002467,0.000489,0.006134,0.004934
A3_Score,0.016569,0.05561,0.248537,0.102948,0.066084,0.060488,0.019244,0.004231,0.073565,0.041559,...,0.000772,0.001014,-0.003572,0.003572,0.0,0.000242,-0.000408,-0.009771,0.010232,0.020262
A4_Score,0.028684,0.039629,0.102948,0.250337,0.076808,0.066646,0.037344,0.002059,0.076773,0.052236,...,0.000717,0.000742,-0.001495,0.001495,0.0,2.4e-05,0.000742,0.000303,0.001592,0.018811
A5_Score,0.038011,0.03834,0.066084,0.076808,0.250354,0.088598,0.058916,0.024394,0.092922,0.06625,...,0.000713,-0.000701,-2.4e-05,2.4e-05,0.0,-0.001414,-0.002124,-0.004166,0.004324,0.026659
A6_Score,0.022307,0.041785,0.060488,0.066646,0.088598,0.203673,0.039086,0.02158,0.101319,0.065757,...,0.001018,-0.002021,-0.005108,0.005108,0.0,0.001229,0.000824,0.005399,0.0086,0.003847
A7_Score,0.048154,-0.010268,0.019244,0.037344,0.058916,0.039086,0.243558,0.020129,0.04379,0.061571,...,0.000828,-0.001548,0.001439,-0.001439,0.0,0.000469,-0.001548,-0.002675,-0.002409,0.015655
A8_Score,0.031626,0.008424,0.004231,0.002059,0.024394,0.02158,0.020129,0.228079,0.022752,0.023818,...,0.000499,-0.00035,0.002546,-0.002546,0.0,-0.000849,-0.00035,-0.004918,-0.001673,0.021543
A9_Score,0.030551,0.04792,0.073565,0.076773,0.092922,0.101319,0.04379,0.022752,0.219287,0.06566,...,0.000962,-0.002303,-0.003007,0.003007,0.0,0.001002,0.000542,0.003993,0.00417,0.005609
A10_Score,0.026283,0.016981,0.041559,0.052236,0.06625,0.065757,0.061571,0.023818,0.06566,0.244892,...,0.000606,0.000186,0.002683,-0.002683,0.0,-0.001843,0.000186,0.000436,-0.002942,0.024812


## Cuarto paso: cálculo de valores y vectores propios de la matriz de covarianzas <a id='cuarto-paso'></a>

Aplicando descomposición SVD [[2]](#referencia-2) buscamos tres matrices $U$, $\Sigma$ y $V$ para las cuales se cumple que $C = U \Sigma V^T$ en donde $C$ es nuestra matriz de covarianzas. En el caso general, la diagonal de $\Sigma$ contiene los valores llamados valores singulares de $C$. Además, SVD nos asegura que para el vector $u_i$ correspondiente a la columna $i$ de $U$, el vector $v_i$ correspondiente a la fila $i$ de $V$ y el valor singular $\sigma_i$ se cumplen las siguientes igualdades: 

- $Cu_i = \sigma_iv_i$
- $C^Tv_i = \sigma_iu_i$

En nuestro caso, al ser **$C$ simétrica**, se cumple que $U = V^T$ lo que significa que $Cu_i = \sigma_iu_i$ ($u_i = v_i$) o sea que **los vectores propios de $C$ se pueden obtener tanto de las filas de $U$ como de las columnas de $V$ y además sus valores propios correspondientes coinciden con los valores singulares**.

In [18]:
U, Sigma, V = randomized_svd(C.as_matrix(), n_components=C.shape[0], n_iter=5, random_state=None)
Sigma

array([1.04165433e+00, 7.57102217e-01, 5.04719841e-01, 3.35674069e-01,
       2.66411006e-01, 2.55283093e-01, 2.37837675e-01, 2.18997858e-01,
       2.06315311e-01, 2.02154466e-01, 1.78471961e-01, 1.73674207e-01,
       1.61103238e-01, 1.49604025e-01, 1.42325886e-01, 1.34227986e-01,
       1.28132120e-01, 1.09692130e-01, 1.07433050e-01, 9.16652159e-02,
       7.60793364e-02, 7.51918238e-02, 6.01558482e-02, 5.61046435e-02,
       4.65421460e-02, 3.92638363e-02, 3.75561390e-02, 3.22441302e-02,
       3.09363158e-02, 2.82839497e-02, 2.17324054e-02, 1.89127670e-02,
       1.74828097e-02, 1.61779117e-02, 1.51845201e-02, 1.36310033e-02,
       1.22251704e-02, 1.11976769e-02, 1.03052577e-02, 9.57023390e-03,
       9.42936071e-03, 8.02543221e-03, 7.06273507e-03, 6.93659457e-03,
       6.67412895e-03, 6.57812398e-03, 6.14742552e-03, 6.05148005e-03,
       5.65556071e-03, 5.61166481e-03, 5.22331809e-03, 4.96026652e-03,
       4.73171169e-03, 4.18618275e-03, 4.11098799e-03, 4.01256386e-03,
      

En realidad lo que obtenemos aquí es una aproximación de la diagonal de $\Sigma$ es decir los valores propios de $C$ [[3]](#referencia-3)[[4]](#referencia-4). Notar que el parámetro ```n_components``` se seteó en la cantidad de filas (igual a la cantidad de columnas) de $C$. Esto es porque al ser simétrica se sabe que existe un valor propio positivo por cada fila.

## Quinto paso: elección de componentes para formar el vector de características <a id='quinto-paso'></a>

El objetivo es elejir los vectores propios asociados a los valores propios con mayor valor. Por eso antes tenemos que ordenar los valores propios de mayor a menor así como también reubicar las filas de $V$ (o las columnas de $U$) acordes a este orden. Viendo el resultado de la matriz $\Sigma$, pareciera que los valores propios ya estuvieran ordenados pero de todas maneras lo verificamos :

In [39]:
order = (-Sigma).argsort()
print(np.array_equal(Sigma, Sigma[order]))
print(np.array_equal(V, V[order]))

True
True


En la función ```pca``` del módulo ```pca.py```, se compactan todos los pasos para el análisis de componentes principales. Uno de los keyword arguments de esta función, ```eigenvalues_condition```, es una función booleana para permtir seleccionar los valores propios con sus correspondientes vectores propios. A modo de ejemplo, una condición podría ser: todos menos los valores con exponente -17. Expresada como función lambda de Python (no confundir con los valores propios $\lambda$) sería :

In [1]:
cond = lambda x: x > 1e-4

## Sexto paso: derivación del nuevo dataset <a id='sexto-paso'></a>

El objetivo es, una vez elejido el subconjunto de valores y vectores propios, generar un nuevo dataset siguiendo la fórmula : 

$FinalData$ = $RowFeatureVector$ $x$ $RowDataAdjust$

en donde $RowFeatureVector$ es la matriz con los vectores propios seleccionados y **puestos en forma de filas**, y $RowDataAdjust$ es el dataset luego de restar los promedios en el [segundo paso](#segundo-paso) y **transponerlo**. Recordar que **en $FinalData$ no se encuentra la columna del atributo objetivo la cual tiene que agregarse luego de aplicar todos los pasos de PCA**. 

Con la función ```pca``` mencionada en el paso anterior generamos y guardamos el dataset transformado $FinalData$ en un archivo tipo csv. 

## Referencias

[1] - <a id='referencia-1'></a> Lindsay I Smith, February 26, 2002. A tutorial on Principal Components Analysis

[2] - <a id='referencia-2'></a> Algebrá Lineal Númerica, Facultad de Ingeniería, UdelaR, 2013. ALN-SVD, [https://www.fing.edu.uy/inco/cursos/numerico/aln/SVD_2013.pdf]

[3] - <a id='referencia-3'></a> Scikit Learn, v0.19.1. TruncatedSVD, [http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html]

[4] - <a id='referencia-4'></a> StackOverflow, [https://stackoverflow.com/questions/31523575/get-u-sigma-v-matrix-from-truncated-svd-in-scikit-learn]
