# Compressing Data via Dimensionality Reduction

There are various techniques to reduce dimensionality and summarize the information content

## Principal Component Analysis

**PCA** is an unsupervised linear transformation technique that is widely used across different fields. PCA helps us identify patterns in data based on the correlation between features. 

To perform PCA we have to follow these steps:

- Standardize the $d$ dimensional dataset
- Construct the covariance matrix
- Decompose the covariance matrix into eigenvectors and eigenvalues
- Select $k$ eigenvectors that correspond to the $k$ largest eigenvalues, where $k$ is the dimensionality of the new feature subspace
- Construct a projection matrix $W$ from the top $k$ eigenvectors
- Transform the $d$ dimensional input dataset $X$ using the projection matrix $W$

### Total and explained variance

We will start by tackling the first 4 steps descripted above. We will use the **Wine** dataset of the previous chapter

In [2]:
import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)

Next let's split data into training and test sets and standardize it to unit variance

In [3]:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler

x, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.3, random_state=0)
sc = StandardScaler()
x_train_std = sc.fit_transform(x_train)
x_test_std = sc.transform(x_test)