# Dimensionality Reduction
## Introduction to Data Science 
##### Cristobal Donoso O.

November 23, 2018<br>
Universidad de Concepción, CHILE

## Curse of Dimensionality 

- Time and space are reduced as the number of dimensions decreases.
- It could be useful to remove redudant features
- You could visualize high dimensional data by moving to a visible lower space

## Feature Selection
do or not do it ? Missing values - low variance filter - high correlation filter - random forest 

## Dimensionality Reduction

There are tecniques to reduce the number of dimensions using a fewer number of **new variables**. It assumes we are focusing on the most important features into a variable summary - it can be obtained by applying different math transformations

### Principal Components Analysis (PCA)
PCA is a technique which helps us in extracting a new set of variables from an existing large set of variables.
> The idea behind PCA method is to maximize the variance in the dataset using the first principal components

Ok, but what are the principal components?

In [49]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
from util import *
%matplotlib notebook

In [50]:
dat3 = np.fromfile('./data/dat3.dat', dtype=np.float32, sep='\n').reshape((1600,3))
labels = np.fromfile('./data/preswissroll_labels.dat', sep='\n')

In [51]:
feature1 = (dat3[:, 0]-np.mean(dat3[:, 0]))/np.std(dat3[:, 0])
feature2 = (dat3[:, 1]-np.mean(dat3[:, 1]))/np.std(dat3[:, 1])
feature3 = (dat3[:, 2]-np.mean(dat3[:, 2]))/np.std(dat3[:, 2])
scaled_data = np.vstack([feature1, feature2, feature3]).T

In [52]:
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(121, projection='3d')
ax.scatter(dat3[:,0], dat3[:,1], dat3[:,2], c = labels, s=1)
plt.title('Data before standarization')

ax = fig.add_subplot(122, projection='3d')
ax.scatter(scaled_data[:,0], scaled_data[:,1], scaled_data[:,2], c = labels, s=1)
plt.title('Data after standarization')
plt.show()

<IPython.core.display.Javascript object>

### Eigendecomposition 
The eigenvalues explain the variance of the data along the new feature axes. We can make the eigendecomposition using the covariance matrix $\Sigma_{dxd}$:<br><br>
<center>
$\begin{equation}
\Sigma = \frac{1}{n-1}((X-\bar{x})^T(X-\bar{x}))
\end{equation}$
</center>

In [53]:
mean_vec = np.mean(scaled_data, axis=0) # The mean is calculated for each column 
cov_mat = (scaled_data - mean_vec).T.dot((scaled_data - mean_vec)) / (scaled_data.shape[0]-1) 
# cov_mat2 = np.cov(scaled_data.T) # Numpy style
print('Covariance matrix dimension: {}'.format(cov_mat.shape))

Covariance matrix dimension: (3, 3)


In [54]:
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

Eigenvectors 
[[-0.70553952  0.70797908  0.03129889]
 [ 0.09876342  0.05449649  0.99361759]
 [-0.70175475 -0.70412767  0.10837186]]

Eigenvalues 
[ 0.79439366  1.20543575  1.00204682]


In [55]:
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(scaled_data[:,0], scaled_data[:,1], scaled_data[:,2], c = labels, s=1)
for v in eig_vecs.T:
    a = Arrow3D([mean_vec[0], v[0]], [mean_vec[1], v[1]], [mean_vec[2], v[2]], mutation_scale=20, lw=3, arrowstyle="-|>", color="r")
    ax.add_artist(a)
plt.title('Data Projection')
plt.show()

<IPython.core.display.Javascript object>

In [56]:
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
eig_pairs.sort(key=lambda x: x[0], reverse=True)

In [57]:
eig_pairs

[(1.2054358, array([ 0.70797908,  0.05449649, -0.70412767], dtype=float32)),
 (1.0020468, array([ 0.03129889,  0.99361759,  0.10837186], dtype=float32)),
 (0.79439366, array([-0.70553952,  0.09876342, -0.70175475], dtype=float32))]

In [71]:
matrix_w = np.hstack((eig_pairs[0][1].reshape(3,1), eig_pairs[1][1].reshape(3,1)))

In [72]:
new = []
for s in scaled_data:
    transformed = matrix_w.T.dot(s)
    new.append(transformed)

In [73]:
new_samples = np.array(new)

In [75]:
plt.clf
plt.scatter(new_samples[:,1], new_samples[:,0], c=labels, marker='+')
plt.show()

<IPython.core.display.Javascript object>

In [76]:
from sklearn.decomposition import PCA

In [84]:
pca = PCA(n_components=2)
pca.fit(dat3)
xx = pca.transform(dat3)

In [86]:
plt.clf
plt.scatter(new_samples[:,1], new_samples[:,0], c=labels, marker='+')
plt.show()

<IPython.core.display.Javascript object>