In [1]:
## settings 
import numpy as np
import matplotlib.pylab as plt
import scipy, scipy.stats
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
%matplotlib inline
plt.rcParams['figure.figsize'] = (8.0, 4.0)

$\usepackage{amssymb} \newcommand{\R}{\mathbb{R}} \newcommand{\vx}{\vec{x}} \newcommand{\vw}{\vec{w}}$

## 3.6. Generalization: Principal Axis as Optimization problem

Good descriptions often result from a minimization of an error function.

### 3.6.1 The Mean as Principal Point

The mean vector 
    $$\vec{\mu} = \langle \vx \rangle_{\vx} = \int\limits_{\R^d} \vx p(\vx) d\vx$$ 
    ($ = \frac{1}{N}\sum_\alpha \vec{x}^\alpha$ in the case of a discrete dataset)
    
minimizes the squared error $E(\vw)$:
$$ \vec{\mu} = \arg \min_{\vw} \underbrace{\langle \|  \vx - \vw \|^2 \rangle_{p(x)}}_{E(\vw)} $$

Proof:
$$  E(\vw) = \langle(\vx-\vw)^\tau (\vx-\vw)\rangle_{\vx} = 
\langle\vx^\tau\vx\rangle - 2 \vw^\tau \langle\vx\rangle + \vw^\tau\vw $$

A minimum has to fulfill the conditions for stationarity $\nabla_{\vw} E = \vec{0}$, i.e. here

$$\nabla_{\vw} E = -2\langle\vx\rangle + 2\vw = \vec{0} \Leftrightarrow \vw = \langle\vx\rangle ~~~ \text{q.e.d.} $$

In [5]:
# principal point example: 
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
d = np.random.randn(100, 2)

def pltprj(wx=1, wy=1):
    plt.figure(figsize=(4,4))
    plt.plot(d[:,0], d[:,1], "o", markersize=5)
    plt.axis([-4,4,-4,4])
    E = 0 
    for r in d: 
        E += (r[0]-wx)**2 + (r[1]-wy)**2
        plt.plot([r[0],wx], [r[1],wy], 'r-', linewidth=0.5)
    print("Error: ", E)

interact(pltprj, wx=(-4, 4, 0.1), wy=(-4, 4, 0.1));

interactive(children=(FloatSlider(value=1.0, description='wx', max=4.0, min=-4.0), FloatSlider(value=1.0, desc…

### 3.6.2. Principal Axes

**Given**:
* centered data (with mean vector $\langle \vx \rangle = \vec{0}$)

**Wanted**:
* Axis that minimizes the squared distance to the data
$$
E(\vw) = \frac{1}{N} \sum\limits_{\alpha=1}^N\min_y\left[
  \| \vx^\alpha - \vw\cdot y\|^2
\right] = \frac{1}{N} \sum\limits_{\alpha=1}^N \min_y F_\alpha(y)
$$
<img src="images/PCA-as-min-problem.png" width="50%">

with $$F_\alpha(y) =  \| \vx^\alpha - \vw\cdot y\|^2 = (\vx^\alpha - \vw\cdot y)^2$$

<img src="images/PCA-as-min-problem-2.png" width="20%">


** Step S1**: Minimize
\begin{eqnarray}
F_\alpha(y) &=& (\vx^\alpha - \vw y)^\tau (\vx^\alpha - \vw y)\\
            &=& \vx^{\alpha \tau}\vx^\alpha  - 2 \vx^{\alpha\tau} \vw y + \vw^\tau\vw y^2
\end{eqnarray}

Necessary condition is  
$$\frac{\partial F_\alpha}{\partial y} = 0 \Leftrightarrow 2(\vw^\tau\vw y - \vx^{\alpha\tau}\vw) = 0$$.

$$\Rightarrow~y^\alpha = \frac{\vw^\tau\vx^\alpha}{\|\vw\|^2}$$ is a latent variable, resp. the projection index for the data point $\vx^\alpha$.

Inserting $y^\alpha$ into $E$ gives 
\begin{eqnarray}
E(\vw) = \frac{1}{N}\sum_\alpha F_\alpha(y^\alpha) &=& \frac{1}{N}\sum_\alpha \vx^{\alpha \tau}\vx^\alpha
            - \frac{2}{N}\sum_\alpha \vw^\tau\vx^\alpha\frac{\vw^\tau\vx^\alpha}{\vw^\tau\vw}
            + \frac{\vw^\tau\vw}{N}\sum_\alpha\frac{\vw^\tau\vx^\alpha\vw^\tau\vx^\alpha}{(\vw^\tau\vw)^2}\\
        &=& \frac{1}{N}\sum_\alpha \vx^{\alpha \tau}\vx^\alpha - \frac{1}{N}\vw^\tau\sum_\alpha
            \frac{\vx^\alpha\vx^{\alpha \tau}\vw}{\|\vw\|^2}
\end{eqnarray}

**Step S2**: Minimize $E$ according to $\vw$

(the first term does not depend on $\vec{w}$, so only the second remains)
\begin{eqnarray*}
\arg\min_{\vw} E(\vw) &=& \arg\max_{\vw} \left(
    \frac{\vw^T}{\|\vw\|}\left(
        \frac{1}{N}\sum_\alpha \vx^\alpha\vx^{\alpha \tau}
    \right)\frac{\vw}{\|\vw\|}
\right)\\
 &=& \arg\max_{\hat w} \left(\hat{w}^\tau\mathbf{C}\hat{w}\right)
\end{eqnarray*}


The axis in direction of the first eigenvector $\hat{u}_1$ belonging to the largest eigenvalue $\lambda_1$
of the dataset covariance matrix $\mathbf{C}$ minimizes the mean-square error (MSE).

**Remarks:**
* Generalization on $q$-dimensional linear manifolds results in principal (hyper-)plains which are spanned by the first $q$ eigenvectors of $\mathbf{C}$ corresponding to the first $q$ eigenvalues in descending order.
* While in 2D (features $x,y$) linear regression minimizes the $y$-distance as depending variable over $x$, here $x$ and $y$ are treated symmetrically, the distance **perpendicular** to the subspace is minimized.

### 3.6.3. Principal Curve and Principal Surface

Generalization to nonlinear manifolds are considerably more difficult: depending on the dimention $q$ of the latent variable the result is a principal curve ($q=1$), principal surface ($q=2$), etc.

First we look at the special case $q=1$.

**Wanted**:

Curve $\vec{f}:\R\to\R^d$, $\vec{f}(y)$, which minimizes the Mean-Square Error (MSE)
$$ E = \langle \min_y \| \vx - \vec{f}(y) \|^2\rangle$$

<img src="images/PCurve-english.png" width="60%">

* Problem 1: Parameterization of $\vec{f}(y)$
 * polygonal line (i.e. concatenated linear segments)
 * Polynomial component functions 

* Problem 2: limitation of flexibility to avoid trivial interpolations
 * Regularization terms to reward smoothness, e.g.
 $$ E_G = E + \int\left|\frac{d^2f}{d\lambda^2}\right|^2 d\lambda$$
 * length constraints
 
* Problem 3: Form of the optimzation landscape: Minimization necessary by all parameters
 * Nonlinear optimization problem 
 * $\to$ risk of getting stuck in local optima.

#### Principal Curve by Hastie and Stuezle (1989)

Principal Curve was introduced as a curve which fulfills the so called self-consistency condition
$$ \vec{f}(y) = \langle \vx | y_f(\vx) = y \rangle_{\vx} $$

* Note that each point of the curve is also the mean of all data points that have this point as their nearest representant.

<img src="images/PCurveHastie-english.png" width="40%">

This means that each point of the curve is the extremal point of a quadratic distance function
$$ E(y^\star) = \frac{1}{2}\int\limits_{y_f(\vx) = y^\star} (\vx-f(y^\star))^2  p(\vx) d\vx $$
* here the integration is over all points that project on $y^\star$
* that this is an optimization problem follows from our first section (principal point)

**Algorithm**:
1. start with $\vec{f}(y) = \langle\vx\rangle + \hat u_1 y$ (1st axis of PCA)
 * i.e. set for each data point $\vx^\alpha: ~~~ y^\alpha = \hat u_1^\tau (\vx^\alpha - \bar x)$
2. fix $y$ and minimize $\langle\| \vx - \vec{f}(y)\|^2\rangle$ by setting 
$$ f_j(y) = \langle x_j | y_f(\vx) = y \rangle ~\forall~ j $$
 * Note that in case of finite (discrete) data sets we need a neighborhood window so that all data points in the vicinity of $y$ are averaged.
A practical choice for that is:
$$
\vec{f}(y) = \frac{\sum\limits_\alpha \vx^\alpha \exp\left(-\frac{(y(\vx^\alpha)-y)^2}{2\sigma^2} \right)}{\sum\limits_\alpha \exp\left(-\frac{(y(\vx^\alpha)-y)^2}{2\sigma^2}\right)}
$$
where $\sigma$ is the width of the neighborhood window. 
Practically this is not done for all possible $y$, but only for the $y^\alpha$ on which data points currently project so that the curve is represented by a polygonal line of $N$ samples.
3. Hold $\vec{f}$ fixed and calculate new $y = y_{\vec{f}}(\vx^\alpha)~\forall~ \alpha$
4. goto 2 as long as the error is reduced by more than a threshold difference  

### 3.6.4 Principal Surfaces in general

* are the result of a generalization of above principal curve optimization
* instead of a linear combination of eigenvectors we regard a parameterized $q$-surface $\tilde\vx(y_1,\dots y_q ; \vw) \in \R^d$.
* Determination of parameters $\vw$ according to an extremality requirement 
\begin{eqnarray}
	D[\vw] &:=& \frac{1}{2} \left\langle\left(\vx- \tilde{\vx}(y_1 \dots y_q;\vw) \right)^2\right\rangle
		\stackrel{!}{=}\text{minimal}  \\	
	&=& \frac{1}{2}\int\left(\vx- \tilde{\vx}(y_1 \dots y_q;\vw) \right)^2 
	p(\vx)~d\vx ~~\text{for given density} ~~ p(\vx)
\end{eqnarray}
* Optimization gives 'Principal Surfaces' (-curves, -manifolds)
* Note: Caution: additional smoothness is to be required, otherwise the manifold will overfit. 

Smoothness constraints are sometimes implicit in the chosen parameterization of $\tilde{x}$
* e.g. for instance, if a polynom of low order is chosen.

** Computational Procedure**:

* Gradient descent 
$$
\frac{\partial D}{\partial w_j}  = -\left\langle
	\left(\vx-\tilde{\vx}(y_1,\ldots,y_q; \vw)\right)^\tau \cdot 
		\frac{\partial}{\partial w_j} \tilde{\vx}(y_1,\ldots,y_q; \vw)\right\rangle_{p(\vx)}
$$ 
* In case of finite data sets: "`stochastic approximation"'
\begin{eqnarray} 
\left.\frac{\partial D}{\partial w_j}\right|_{\vx^\alpha} &\approx& 
  	-\left(\vx^\alpha - \tilde{\vx}(y_1^\alpha,\ldots, y_q^\alpha; \vw)\right)^\tau \cdot \frac{\partial}{\partial w_j}\tilde{\vx}(y_1^\alpha,\ldots,y_q^\alpha;\vw)~~\text{und damit} \\
\Delta\vw & = & 
	-\eta \left.\frac{\partial D}{\partial w_j} \right|_{\vx^\alpha} = 
	\eta \cdot (\vx^\alpha - \tilde{\vx^\alpha}(y_1^\alpha,\ldots,y_q^\alpha; \vw))\cdot \frac{\partial}{\partial w_j}\tilde{\vx}(y_1^\alpha,\ldots,y_q^\alpha;\vw) 
\end{eqnarray}
* Attention: $y_i(\vx)$ depend on $\vx$ and $\vw$! (_best-match-point_)

**Example:**
algorithmic procedure e.g. on a discrete grid A (e.g. $q=2$): $\tilde{x}(y_1,y_2;\vw)$.
* surface parameters $y_1,y_2$ become discrete grid index points $r \in A$, 
* function parameter $\vw$ become total set of $\vw=\{\vw_{\vec{r}} | \vec{r} \in A\}$ of lattice point positions $\vw_r$ in the embedding space $\R^d$.

With this we can represent it in a 'lattice representation':
$$\vx(y_1,y_2;\vw) \equiv \vw_{\vec{r}} \in \R^d $$

<img src ="images/gitter.png" width="30%">

Mean distance between grid and data

\begin{eqnarray}
D[\vw] & = & 
	\frac{1}{2} \sum\limits_r \int\limits_{F_r} (\vx-\vw_{\vec{r}})^2 p(\vx)d\vx \\
F_{\vec{r}} & = & 
	\{ \vx |~ \|\vx-\vw_r\| \leq \|\vx - \vw_{r'}\| ~\forall~ \vec{r}' \not= \vec{r}\}
\end{eqnarray}
* i.e. 'winner takes all'

$F_{\vec{r}}$ is called _Voronoi cell_ around $r$.

\begin{eqnarray}
D^*[\vw] & = & 
	\frac{1}{2} \sum\limits_r \int\limits_{F_r}(\vx-\vw_r)^2 p(\vx)d\vx\\
\frac{\partial D^*}{\partial \vw_r} & = & 
	-\int\limits_{F_r}(\vx-\vw_r)\cdot p(\vx)d\vx\\
\Delta \vw_r & = & 
	-\epsilon \frac{\partial D^*}{\partial \vw_r}   =  \epsilon \langle\vx-\vw_r \rangle|_{\vx \in F_r}
\end{eqnarray}

**Stochastic approximation**: 
$$\Delta \vw_r \stackrel{def}{=} \epsilon(\vx^\alpha-\vw_r)$$ where $r$ = index of the center $\vw_r$ with smallest distance to the actual learning example $\vx^\alpha$.

* Integration of an additional smoothness constraint / condition for the elimination of the problem of lattice folding:
"`blurring"' of each adaptation step over the neighborhood of $r$.
$$ \Delta \vec{w_{r'}} = \epsilon \cdot \underbrace{\exp\left(-\frac{(r-r')^2}{2\sigma^2}\right)}_{\text{blurring function}}(\vec{x}-\vec{w_{r'}})~, $$
where $r$ is the index of the Voronoi cell in which $\vec{x}$ falls. 
* $\to$ learning rule of Kohonen nets / self-organizing networks
* In result we find that the SOM (self-organizing map) algorithms is obtained from an optimization problem (if smoothness constraints are added).

[ws18EOT1207]