# Multi-dimensional Scaling (MDS)

Author: Matt Smart

[Overview](#linkOverview)  
[Details](#linkDetails)  
[Algorithm](#linkAlgorithm)  
[Example](#linkExample)  
[Resources](#linkResources)  

### Overview <a id='linkOverview'></a>
- Non-linear dimension reduction technique  
- Rough idea - given high dimensional data $X$, find a lower dimensional representation $Y$ such that the global distance structure is preserved
- Two subtypes of MDS: metric (quantitative) and non-metric (qualitative)
- Common "first resort" technique, like PCA

### Details <a id='linkDetails'></a>

#### Metric MDS
Setup:
- Suppose one has $p$ samples of N-dimensional data points, $x_i\in\mathbb{R}^N$
- Store these samples columnwise as $X\in\mathbb{R}^{N\,\times\,p}$
- We call this the original data matrix, or simply the data
- Assumption: there is a meaningful metric (e.g. Euclidean distance) on the data space (high dim)
- Assumption: there is a meaningful metric (e.g. Euclidean distance) on the latent space (low dim)

Goal:
- Given N-dim data $X$, a metric $d(\cdot,\cdot)$ on $\mathbb{R}^N$, a target dimension $k<N$, and a metric $g(\cdot,\cdot)$ on $\mathbb{R}^k$
- FInd an embedding $Y\in\mathbb{R}^{k\,\times\,p}$ (i.e. a $y_i\in\mathbb{R}^k$ for each $x_i\in\mathbb{R}^N$) such that distances $d_{ij}$, $g_{ij}$ are preserved between representations

Objective function: $$Y^\ast=\operatorname*{arg\,min}_Y {\sum_{i<j}{w_{ij}\left|d_{ij}\left(X\right)-g_{ij}\left(Y\right)\right|}}$$

Notes:
- Define the "stress"  $f(W,X,Y)\equiv{\sum_{i<j}{w_{ij}\left|d_{ij}\left(X\right)-g_{ij}\left(Y\right)\right|}}$, then  $Y^\ast=\operatorname*{arg\,min}_Y f(W,X,Y)$
- Use the free weights $w_{ij}\geq 0$ to specify the confidence (or precision) of $d_{ij}(X)$ measurements
- Solution degeneracy: if $Y^\ast$ is optimal, so is any translation/rotation (as these will not affect distances)
- This means for any orthogonal matrix $O$, and constant-column matrix $C$, we have $f(W,X,Y^\ast)=f(W,X,OY^\ast+C)$

Limitations:
- What would happen if we tried to embed an equilateral triangle in 2D into 1D?

Questions:
- Can one show the objective function monontonically increases as target dimension decreases?

#### Non-metric MDS
Setup:
- One has $p$ objects, $\{x_i\}_{i=1}^p$
- Assumption: there is a notion of dissimilarity between the objects
    - note this is weaker, or more general, than specifying a metric
    - e.g. a ranking of dissimilarities may be sufficient, but is clearly weaker than specifiying distance
- Assumption: one can construct a $p \times p$ dissimilarity matrix $D$ from the data

Goal: 
- Preserve ordination of the dissimilarity
- E.g. If $d_{12}\left(X\right)<d_{13}\left(X\right)$, then should have $d_{12}\left(Y\right)<d_{13}\left(Y\right)$

Notes:
- mention stress idea

#### Interpolating between them
Ch4 Cox text p1

### Algorithm <a id='linkAlgorithm'></a>

#### Metric MDS

Input:
- data $X\in\mathbb{R}^{N\,\times\,p}$
- an embedding or target dimension $1\leq k<N$
- a high-dim metric $d:\mathbb{R}^N \times \mathbb{R}^N\to R$
- a low-dim metric $g:\mathbb{R}^k \times \mathbb{R}^k\to R$ (typically Euclidean)
- optional: upper triangular weight matrix $W$ (default is all $w_{ij}=1$)

Initialize step: compute $D$, the $p\times p$ distance matrix using input data, $d_{ij}=d(x_i,x_j)$

Optimization:
- solve $Y^\ast=\operatorname*{arg\,min}_Y f(W,X,Y)$ by "stress majorization"
- standard method details: https://en.wikipedia.org/wiki/Stress_majorization
- procedure monotonically decreases cost function (finds local minima)
- repeat for multiple initial conditions Y_0, choose best local minima
- note many local minima, $\{Y_{candidate}\}$, each invariant under translation + rotation

Output:
- locally optimal embedding (k-dim representation) $Y\in\mathbb{R}^{k\,\times\,p}$

Runtime:
- MDS $\approx O\left(p^3\right)$  (where $p$ is the number of $\mathbb{R}^N$ data points)
- compare vs. e.g. PCA $\approx O(p^2)$

#### Non-metric MDS
...mention stress idea

### Example <a id='linkExample'></a>
Placeholder MNIST example from Eugene's notebook

In [26]:
import numpy as np
import matplotlib.pyplot as plt
import time
from keras.datasets import mnist
from sklearn.manifold import MDS
from matplotlib.pyplot import imshow

In [27]:
# Load training + test data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print("training data shape:", x_train.shape, y_train.shape)
print("test data shape:", x_test.shape, y_test.shape)

# Take subset (1k of the 10k test images)
X = x_test[:1000]
Y = y_test[:1000]

# Show sample image, note its a 28x28 array
example_image = X[986]
print("example image shape:", example_image.shape)
imshow(X[986], cmap='gray')


# Flatten data elements from 28x28 array to 784-dim vector
X = X.reshape(1000, 784)
print("flattened X shape:", X.shape)

training data shape: (60000, 28, 28) (60000,)
test data shape: (10000, 28, 28) (10000,)
example image shape: (28, 28)
flattened X shape: (1000, 784)


In [19]:
# compute MDS embedding (2D)
# Docs: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html
# Note: they use format of "rows are samples from R^N or R^k"
# Algorithm: https://en.wikipedia.org/wiki/Stress_majorization
t0 = time.perf_counter()
mds_2d = MDS(n_components=2, max_iter=300, verbose=1).fit_transform(X)
total_time = time.perf_counter() - t0
print ("runtime:", total_time)
print ("output shape:", mds_2d.shape)

runtime: 51.809746460175724
output shape: (1000, 2)


In [23]:
# unlabelled plot
fig_raw = plt.figure()
ax_raw = fig_raw.add_subplot(111)
ax_raw.scatter(mds_2d[:,0], mds_2d[:,1], marker='.', color='black')
fig_raw.show()

# labelled plot
fig_coloured = plt.figure()
ax_coloured = fig_coloured.add_subplot(111)
for label in set(Y):
    mask = Y==label
    ax_coloured.scatter(mds_2d[:,0][mask], mds_2d[:,1][mask], marker = '.', label = label)   #, color =  label + 1) #, label = str(label))
ax_coloured.legend()
fig_coloured.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [24]:
# compute MDS embedding (3D)
t0 = time.perf_counter()
mds_3d = MDS(n_components=3, max_iter=300, verbose=1).fit_transform(X)
total_time = time.perf_counter() - t0
print ("runtime:", total_time)
print ("output shape:", mds_3d.shape)

runtime: 52.68067187240467
output shape: (1000, 3)


In [25]:
from mpl_toolkits.mplot3d import Axes3D

# enable interactive plots
%matplotlib notebook  

# unlabelled plot
fig_raw = plt.figure()
ax_raw = fig_raw.add_subplot(111, projection='3d')
ax_raw.scatter(mds_3d[:,0], mds_3d[:,1], mds_3d[:,2], marker='.', color='black')
fig_raw.show()

# labelled plot
fig_coloured = plt.figure()
ax_coloured = fig_coloured.add_subplot(111, projection='3d')
for label in set(Y):
    mask = Y==label
    ax_coloured.scatter(mds_3d[:,0][mask], mds_3d[:,1][mask], mds_3d[:,2][mask], marker = '.', label = label)   #, color =  label + 1) #, label = str(label))
ax_coloured.legend()
fig_coloured.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Resources <a id='linkResources'></a>
- Mehta et al., 2017. A high-bias, low-variance introduction to Machine Learning for physicists. https://arxiv.org/abs/1803.08823
- Cox and Cox, 2001. (MDS textbook, see Ch2, Ch3, Ch4)
- Borg and Groenen, 2005. (MDS textbook)