In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir('../notebook_format')
from formats import load_style
load_style()

In [2]:
os.chdir(path)
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 8, 6 # change default figure size

# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
%load_ext watermark
%load_ext autoreload 
%autoreload 2

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,matplotlib

Ethen 2016-08-09 15:46:44 

CPython 3.5.2
IPython 4.2.0

numpy 1.11.1
pandas 0.18.1
matplotlib 1.5.1


The workflow common to many data analysis problems is: We start with some raw data and we want to process the data to extract simpler, more compact representations. We then perform analyses to identify patterns in the data, visualize and explore them, and hopefully derive insight which we can share.

The starting point most data analysis problems is to perform some sort of exploratory data analysis. It is a broad field covering many topics in both machine learning and visualization. These methods are valuable because until we understand the basic structure of our raw data, it might be hard to know whether the data is suitable for the task at hand.

Unsupervised methods such as dimensionality reduction and and clustering are common machinelearning methods used in exploratory analysis. These methods help us identify simpler, more compact representations of the original raw data to either aid our understanding or provide useful input to other stages of analysis. Here, we'll be focusing on dimensionality reduction.

Dimensionality reduction, which includes a well-known analysis technique called PCA. Its goal is to reduce complex data into a simpler, more compact representation. These simpler representations can often elucidate the underlying pattern or structure of the data.

Imagine that we're trying to understand some underlying phenomenon, and in order to do so we measure various quantities potentially related to it. If we knew exactly what to measure in advance, we might be able to find some simple relationships in our data. But we typically don't, and so we often measure anything that might be relevant, and end up having irrelevant or redundant signals in our measurement.

Let's say we want to gather people's shoe sizes to find some trends associated it. When we go to a shoe store, we notice that the store uses two different measurements (American size and European size) and we decided measure both of them.

<img src='images/pca1.png', width='30%'>

The plotted data unsurprisingly shows a strong correlation between the two measurements (they're just sizing in different measurements). But the data doesn't lie perfectly on the $y=x$ line possibly because we made some errors during our collection process. Given this data how can we potentially find an alternative representation of the original data? To do this, one idea would be to pick a single direction in 2D, and project our points on to the single direction. But what line should we pick? One possible choice is shown below(the blue line).

<img src='images/pca2.png', width='30%'>

Intuitively it seems like a pretty good candidate. Looking at the projections we see that the points projected onto this line all seem close to their initial representations. 

We can formalize this idea via the idea of reconstruction error. Specifically, our goal is to minimize the Euclidean distances between our original points and their projections. And this is exactly what PCA does. PCA aims to find the projections that minimize the length of the black lines between the original points in red, and the projected points in blue.

Another way to think about this is that, in order to identify patterns in our data, we often look for variation across observations. So it seems reasonable to find a succinct representation that best captures variation in our initial data. In particular, we could look to explain our data via it's maximal directions of variance.

<img src='images/pca3.png', width='30%'>

Let's look at some visualizations
to see what this means.
If we consider the direction shown
by the arrow on the slide, we see
that the variation is quite small in this direction.

If we look at the direction shown by this new arrow we see a large degree of variation. It turns out that the PCA solution represents the original data in terms of it's directions of maximal variation.

- [Generalized Low Rank Models](http://docs.h2o.ai/h2o-tutorials/latest-stable/tutorials/glrm/glrm-tutorial.html)

- https://github.com/madeleineudell
- [h2o glrm video page](http://www.h2o.ai/verticals/algos/glrm/)
- [glrm python](https://github.com/cehorn/GLRM)
- [h2o glrm example](https://github.com/pmnyc/Tools/blob/d2e82ee282d704aabe3faa1256c604afa280fdcd/codes_by_other_people/h2o-3-source-codes/h2o-py/tests/testdir_algos/glrm/pyunit_benign_glrm.py)

http://arxiv.org/pdf/1410.0342v4.pdf