# Non-Linear Manifold Learning

In the last class we explored some linear manifold learning techniques, such as PCA, and saw how they can be applied to problems in Biomedical Imaging.

In this class we will look at **non-linear manifold learning**. 


The motivation for our interest in manifold learning was that, in many biomedical applications, our datasets have a **very high dimensionality** which makes analysis challenging. 

However, the data will usually be generated by some **natural process with fewer degress of freedom** than the dimensionality of the data would suggest. 

What we want are methods to extract these underlying degrees of freedom in the data.

To understand this better, let's look at a simple example in which linear methods like PCA fail to uncover the data's natural **non-linear structure**.

In [None]:
""" Non-Linear Manifold Learning notebook

James Clough 2018-2019

"""

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D # for 3D plotting with matplotlib
import numpy as np

from sklearn.decomposition import PCA   # sklearn has a nice PCA implementation
from scipy.linalg import eigh           # eigendecomposition

# display plots in the notebook
%matplotlib inline                      

First, we will generate a swiss-roll dataset, like the one we saw in the previous class.

We start with a set of $N$ points scattered in a 2D unit square, and will then map those points to lie on a spiral shape in 3D.

If you like, you can try changing the variables here to generate different datasets.

The variable $N$ sets the number of datapoints in the swiss-roll. 

num_rotations sets the number of full rotations of the swiss-roll spiral.

In [None]:
""" Let's generate a dataset in which linear ML methods will fail to uncover the data's natural structure """

N = 2000                       # number of datapoints
num_rotations = 1.2
X_m = np.random.random((N, 2)) # random points scattered in [0,1]^2

def create_spiral(X, num_rotations):
    """ Take 2D manifold X and output 3D spiral made from curling up X in 3D space """
    N = X.shape[0]
    r = np.exp(X[:,1] * num_rotations) * 0.5
    theta = X[:,1] * (2 * np.pi) * num_rotations
    Y = np.zeros((N, 3))
    Y[:,0] = X[:,0] * 6
    Y[:,1] = r * np.cos(theta)
    Y[:,2] = r * np.sin(theta)
    return Y

Y = create_spiral(X_m, num_rotations)

Now we have generated the swiss-roll dataset, we can plot it.

Think of the original 2D data as the 'underlying degrees of freedom'.

Think of the 3D data as the high-dimensional data we have measured.

In [None]:
""" Plot the manifold curled up into a 3D swiss-roll"""

fig = plt.figure(figsize=(16,8))
ax = fig.add_subplot(121)
ax.set_xticks([])
ax.set_yticks([])
_ = ax.scatter(X_m[:,0], X_m[:,1], c=X_m[:,1], marker='o')

ax = fig.add_subplot(122, projection='3d')
ax.set_xticks([])
ax.set_yticks([])
ax.set_zticks([])
_ = ax.scatter(Y[:,0], Y[:,1], Y[:,2], c=X_m[:,1], marker='o')

## Principal Component Analysis

As we saw last week, PCA projects the data onto the hyperplane minimising the approximation error, or, equivalently, maxmising the variance along the hyperplane.


### What do you expect to happen if we use PCA to reduce the dimensionality of this swiss-roll dataset from 3 dimensions to 2?

In [None]:
""" Try using PCA on this dataset and see if it can recover the underlying 2D structure """

# In scikit-learn, we first create a PCA object...
# n_components refers to the number of principal components we're extracting
# ie. the dimensionality of the space we are trying to find
pca = PCA(n_components=2)

# Then apply in to our dataset
X_pca = pca.fit_transform(Y)

In [None]:
plt.figure(figsize=(6,6))
plt.xticks([])
plt.yticks([])
_ = plt.scatter(X_pca[:,0], X_pca[:,1], c=X_m[:,1], marker='o')

### Was this what you expected? 

### Can you explain what's going on here?

The exact shape of the data you see will vary depending on the random points you started with, but it is unlikely that you will recover something very much like our original 2D dataset, and instead the spiral structure we are trying to unfold will still be there.

We have failed to uncover the 2D manifold structure because PCA can only project the data on to a hyperplane
and it can't unfurl curved manifolds. 

Another way of thinking about this problem is to notice that we want to measure the distances between points as the distance *along the manifold* (ie. along the swiss-roll) but if the manifold is curved, that is not the same as the distance between the points in the high-dimensional space (ie. the distance in 3D space not confined to the surface of the swiss-roll).

# Laplacian Eigenmaps

To try and solve this problem we need non-linear manifold learning.

There are lots of non-linear manifold learning algorithms, but we'll be using one called Laplacian Eigenmaps, which is a simple but effective and commonly used algorithm.

**Laplacian Eigenmaps for Dimensionality Reduction and Data Representation**, *Belkin and Niyogi, Neural Computation, 2003*

The key idea is this: even if the manifold that the data lies on is curved, we can assume that it is ***locally flat***. 

This means that the distances between points in the original data are still meaningful as long as those points are close to each other. 

If we use only the nearest neighbours of each point in the embedding we can get around this problem.

## Nearest neighbours
The first step in many manifold learning methods is to find the k-nearest neighbours of each point. 

This just means that we choose a number, $k$, and for each point in our data, find the $k$ points which are nearest to it.

The result can be described mathematically as a ***graph***. 

Here, a graph is not the thing with an x-axis and a y-axis and a line - rather it is a mathematical object (studied in Graph Theory - https://en.wikipedia.org/wiki/Graph_theory) that consists of some points, or **vertices**, and some relations between them, or **edges**.

In this case, the vertices in the graph are our datapoints, and there will be an edge between two points if one of them is in the k-nearest neighours of the other.

<img src="imgs/scatter.png">

<img src="imgs/scatter_knn.png">

Let's write a function that finds the k-nearest neighbours of each point in our dataset.

## Exercise 1

Write a function, ***my_knn*** that takes in an array of size NxD of the datapoints, and an integer k, and returns an NxN array where the [i,j] element of the array is a 1 if j is a k-nearest-neighbour of i (excluding i itself), and 0 otherwise.

Don't worry too much about doing this in a clever or fast way, just make sure your function gives you the correct answer.

*Hint - numpy has a function called **argsort** that you might find useful*

In [None]:
def my_knn(X, k):
    """ Finds k-nearest neighbours in X"""
    N, D = X.shape
    A = np.zeros((N,N))
    ###
    ###
    ###
    ###
    return A

## Solution

In [None]:
def my_knn(X, k):
    """ Finds k-nearest neighbours in X """
    N, D = X.shape
    A = np.zeros((N, N))
    for i in range(N):
        i_sq_distances = np.sum((X - X[i])**2, axis=1)
        nearest_points = np.argsort(i_sq_distances)
        # [1:k+1] is because the nearest point to i is i itself - but we don't want it
        k_nearest      = nearest_points[1:k+1]
        for j in k_nearest:
            A[i,j] = 1
    return A

In [None]:
A = my_knn(Y, 20)

## Nearest neighbours check
Lets check our function is doing what we expect it to by plotting a point with its nearest neighbours highlighted.

In [None]:
i = 2
knn_i = A[i].nonzero()

fig = plt.figure(figsize=(16,8))

ax = fig.add_subplot(121, projection='3d')
ax.set_xticks([])
ax.set_yticks([])
ax.set_zticks([])
_ = ax.scatter(Y[:,0], Y[:,1], Y[:,2], c=X_m[:,1], marker='o')
_ = ax.scatter(Y[i,0], Y[i,1], Y[i,2], c='r', marker='o', s=300) # plot point i
_ = ax.scatter(Y[knn_i,0], Y[knn_i,1], Y[knn_i,2], c='b', marker='o', s=100) # plot i's nearest neighbours


ax = fig.add_subplot(122)
ax.set_xticks([])
ax.set_yticks([])
_ = ax.scatter(X_m[:,0], X_m[:,1], c=X_m[:,1], marker='o')
_ = ax.scatter(X_m[knn_i,0], X_m[knn_i,1], c='b', marker='o', s=100) # plot i's nearest neighbours
_ = ax.scatter(X_m[i,0], X_m[i,1], c='r', marker='o', s=300) # plot point i

## Laplacian Eigenmaps

Our aim is to find low-dimensional coordinates which maintain the near-neighbour relationships from the original data.

We will say that we are trying to minimise the following cost function:

$C = \sum_{i,j} (x_i - x_j)^2 \mathbf{A}_{ij}$

Where $\mathbf{A}_{ij}$ is our nearest-neighbour matrix which tells us whether two points are close together in the original dataset, and the $x_i$ and $x_j$ represent the low-dimensional coordinates we're trying to find.

$\mathbf{A}$ is fixed, and determined by our input data.
$x$ is the solution we are trying to find.

$C = \sum_{i,j} (x_i - x_j)^2$  <span style="color:red"> $\mathbf{A}_{ij} $</span>

You can hopefully see that this <span style="color:red"> </span> cost function is minimised if we place points which are close together in the original data (ie. $\mathbf{A}_{ij}$ is high), close together in the new low-dimensional representation (so that $(x_i - x_j)^2$ is low).

Now we know what we are trying to minimise we can mathematically describe our algorithm.

$C = \sum_{i,j}$ <span style="color:red"> $(x_i - x_j)^2 $</span> $\mathbf{A}_{ij}$

You can hopefully see that this cost function is minimised if we place points which are close together in the original data (ie. $\mathbf{A}_{ij}$ is high), close together in the new low-dimensional representation (so that $(x_i - x_j)^2$ is low).

Now we know what we are trying to minimise we can mathematically describe our algorithm.

$C = \sum_{i,j} (x_i - x_j)^2 \mathbf{A}_{ij}$

Question: what is the problem with this cost function?

$C = \sum_{i,j} (x_i - x_j)^2 \mathbf{A}_{ij}$

Question: what is the problem with this cost function?

Hint: It has a trivial solution we do not want

$C = \sum_{i,j} (x_i - x_j)^2 \mathbf{A}_{ij}$

Question: what is the problem with this cost function?

Hint: It has a trivial solution we do not want

Solution: $C$ is minimised by setting all of the coordinates to be equal: $x_i=x_j$

Therefore we also require that the points $x$ are spaced out from each other - ie.

$\mathbf{x}^T \mathbf{D} \mathbf{x} = 1$

where $\mathbf{D}$ is a diagonal matrix given by $\mathbf{D}_{ii} = \sum_j \mathbf{A}_{ij}$.

$C = \sum_{i,j} (x_i - x_j)^2 \mathbf{A}_{ij}$

$C = \sum_{i,j} (x_i - x_j)^2 \mathbf{A}_{ij}$

$C = \sum_{i,j} (x_i^2 + x_j^2 - 2 x_i x_j) \mathbf{A}_{ij}$

$C = \sum_{i,j} (x_i - x_j)^2 \mathbf{A}_{ij}$

$C = \sum_{i,j} (x_i^2 + x_j^2 - 2 x_i x_j) \mathbf{A}_{ij}$

$C = \sum_i x_i^2 \mathbf{D_{ii}} +  \sum_j x_j^2 \mathbf{D_{jj}} - 2 \sum_{i,j} x_i x_j \mathbf{A}_{ij}$

where $\mathbf{D}$ is a diagonal matrix given by $\mathbf{D}_{ii} = \sum_j \mathbf{A}_{ij}$.

$C = \sum_{i,j} (x_i - x_j)^2 \mathbf{A}_{ij}$

$C = \sum_{i,j} (x_i^2 + x_j^2 - 2 x_i x_j) \mathbf{A}_{ij}$

$C = \sum_i x_i^2 \mathbf{D_{ii}} + \sum_j x_j^2 \mathbf{D_{jj}} - 2 \sum_{i,j} x_i x_j \mathbf{A}_{ij}$

$C = 2 \sum_i x_i^2 \mathbf{D_{ii}} - 2 \sum_{i,j} x_i x_j \mathbf{A}_{ij}$

$C = 2 \mathbf{x}^T \mathbf{L} \mathbf{x}$

We can find

$\min \mathbf{x}^T \mathbf{L} \mathbf{x}$ 

subject to the constraint

$\mathbf{x}^T \mathbf{D} \mathbf{x} = 1$ 

by finding the eigenvectors of the matrix $\mathbf{L}$


This matrix is called the **Laplacian** and has many useful properties.

## Graph Laplacian
Graphs can be represented in many ways, usually by matrices. The most common representation is an **adjacency matrix**, $\mathbf{A}$ which is what our ***my_knn*** function above generates.

Another matrix representation is the Laplacian matrix, $L$, which is calculated by:

$\mathbf{L} = \mathbf{D} - \mathbf{A}$

where $\mathbf{D}$ is a diagonal matrix given by $\mathbf{D}_{ii} = \sum_j \mathbf{A}_{ij}$.

The Laplacian matrix can be thought of as a discrete version of the *Laplace operator* which describes diffusion on a surface. 

By looking at the eigenvectors of this matrix (equivalent to the eigenmodes of the Laplace operator) we can see the most significant modes of variation in the graph representing our original data.

The result of your k-nearest-neighbours function should be an NxN matrix, which has a 1 for nearest-neighbours and a 0 otherwise. 

To use in the Laplacian Eigenmaps method, this matrix needs to be symmetric. 

Does your knn algorithm always return a symmetric matrix? If not, we can symmetrise it.

In [None]:
def symmetrise(X):
    """ Symmetrises the matrix X.
    
    Notes
    -----
    np.maximum returns the element-wise maximum of its arguments."""
    return np.maximum(X, X.T)
A = symmetrise(A)

In [None]:
# plot first few eigenmodes of this swiss-roll dataset
fig = plt.figure(figsize=(8,8))

D = np.diag(np.sum(A, axis=0))
L = D - A
v, C = eigh(L, eigvals=(1,6))

ax = fig.add_subplot(111, projection='3d')
ax.set_xticks([])
ax.set_yticks([])
ax.set_zticks([])
_ = ax.scatter(Y[:,0], Y[:,1], Y[:,2], c=C[:,0], marker='o')

In [None]:
# plot first few eigenmodes of this swiss-roll dataset
fig = plt.figure(figsize=(16,8))

D = np.diag(np.sum(A, axis=0))
L = D - A
v, C = eigh(L, eigvals=(1,6))

ax1 = fig.add_subplot(121, projection='3d')
ax1.set_xticks([])
ax1.set_yticks([])
ax1.set_zticks([])
_ = ax1.scatter(Y[:,0], Y[:,1], Y[:,2], c=C[:,0], marker='o')

ax2 = fig.add_subplot(122, projection='3d')
ax2.set_xticks([])
ax2.set_yticks([])
ax2.set_zticks([])
_ = ax2.scatter(Y[:,0], Y[:,1], Y[:,2], c=C[:,1], marker='o')

In [None]:
# plot first few eigenmodes of this swiss-roll dataset
fig = plt.figure(figsize=(24,8))

D = np.diag(np.sum(A, axis=0))
L = D - A
v, C = eigh(L, eigvals=(1,6))

ax1 = fig.add_subplot(131, projection='3d')
ax1.set_xticks([])
ax1.set_yticks([])
ax1.set_zticks([])
_ = ax1.scatter(Y[:,0], Y[:,1], Y[:,2], c=C[:,0], marker='o')

ax2 = fig.add_subplot(132, projection='3d')
ax2.set_xticks([])
ax2.set_yticks([])
ax2.set_zticks([])
_ = ax2.scatter(Y[:,0], Y[:,1], Y[:,2], c=C[:,1], marker='o')

ax3 = fig.add_subplot(133, projection='3d')
ax3.set_xticks([])
ax3.set_yticks([])
ax3.set_zticks([])
_ = ax3.scatter(Y[:,0], Y[:,1], Y[:,2], c=C[:,2], marker='o')

Now we can implement the Laplacian Eigenmaps method.

1) Find the k-nearest-neighbours matrix of your data - call this A

2) Find the Laplacian of A - call this L

3) Find the eigenvectors corresponding to the d smallest non-zero eigenvalues of L

These d eigenvectors are the coordinates of each of your points in a d-dimensional representation of your data.
For now we can choose d=2.

## Exercise 2
Write a function that takes in an adjacency matrix and implements Laplacian eigenmap method described above.

*Hint - scipy has a function called scipy.linalg.eigh which can help you fund the eigenvectors of a matrix efficiently*

In [None]:
def my_laplacian_eigenmap(Y, k=20, d=2):
    A = my_knn(Y, k)
    ###
    ###
    ###
    ###
    return X

## Solution

In [None]:
def my_laplacian_eigenmap(Y, k=20, d=2):
    A = my_knn(Y, k)
    A = symmetrise(A)
    D = np.diag(np.sum(A, axis=0))
    L = D - A
    v, X = eigh(L, eigvals=(1,d))
    return X

In [None]:
Z = my_laplacian_eigenmap(Y)

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
ax.set_xticks([])
ax.set_yticks([])
_ = ax.scatter(Z[:,0], Z[:,1], c=X_m[:,1], marker='o')

Can your method recover the original manifold structure? Try changing the the number of nearest neighbours $k$. What happens when $k$ is set very high, or very low?