# t-distributed stochastic neighbor embedding (t-SNE)

Object: Embed high-dimensional data for visualization in a low-dismensional space of 2 or 3 dimensions in such a way that similar (definition?) objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

1. a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points lower.
2. t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback-Leibler divergence(?) between the 2 distributions with respect to the locations of the points in the map.

## Kullback–Leibler divergence

$$
D_{\mathrm{KL}}(P \| Q)=\sum_{x \in \mathcal{X}} P(x) \log \left(\frac{P(x)}{Q(x)}\right)
$$

It is supposed to mesure the surprise from using Q as a model when the actual distribution is P.

The surprise should be positive, indeed:

$$
\sum P(k)\log({P(k)\over Q(k)}) \ge \sum P(x) (1 - {Q(x) \over P(x)}) = 0
$$

It's the average difference of the number of bits required for encoding samples of P using a code optimized for Q rather than one optimized for P.

## t-SNE details

$$
p_{j \mid i}=\frac{\exp \left(-\left\|\mathbf{x}_i-\mathbf{x}_j\right\|^2 / 2 \sigma_i^2\right)}{\sum_{k \neq i} \exp \left(-\left\|\mathbf{x}_i-\mathbf{x}_k\right\|^2 / 2 \sigma_i^2\right)}
$$
and set $p_{i \mid i}=0$.

The exp(...) thing is the Gaussian distribution.

Now define
$$
p_{i j}=\frac{p_{j \mid i}+p_{i \mid j}}{2 N}
$$

Note that $p{i i} = 0$ and $\sum_{i j} p_{i j} = 1$