# Transductive Centroid Projections - Part 1

### Classifier weights as normals of decision hyperplane

- A deep neural network can be regarded as a classifier attached to a feature extractor
- The feature extractor consists of the all the layers but the final dense layer and outputs an embedding $f$ for an example $x$.
- The output of the final classifier layer takes input the embedding $f$ and outputs the prediction $\hat{y} = W^Tf$.
- Each element of the prediction $\hat{y}$ is given by $\hat{y}_n = (W^T)_nf = W_{:n}^Th_L = w_n^Tf$ i.e. dot product of the the $n$-th column of $W$ with the output of the .
- The predicted class $n'$ will be the index $n'$ at which the dot product is the highest, meaning that out of all $w_n$ $f$ is closest to $w_n'$.
- This means that the weights $W$ of the final dense layer lie in the direction of the normal vectors of the decision hyperplane learned by the model.
- We call them *anchors* of each class.

### How does the model work
- Unsupervised examples are clustered some some clustering algorithm.
- Minibatch consists of labelled data $\mathcal{X}_p^L \subset \mathcal{X}^L$ and unlabelled data $\mathcal{X}_q^U \subset \mathcal{X}^U$
- The labelled part of the minibatch is constructed as usual by selecting $\mathcal{X}_p^L$ at random.
- However $\mathcal{X}_q^U$ constructed by randomly selecting $l$ unlabelled clusters with $o$ samples in each cluster such that $q = l \times o$
- The layer prior to the classification layer outputs the vectors $f_1,...,f_N$ for a batch of size $N$ which split into two groups of vectors $[f^L, f^U]$
- Similarly the weight matrix can be split into two matrices $W^M, W^l$
- $W^M$ consists of $M$ column vectors corresponding to anchors for each the $M$ classes whilst $W^l$ has $l$ column vectors corresponding to centroids of $l$ clusters.
- The centroids are for the unlabelled data are obtained as follows

    $$c_i^U = \alpha \sum_{i=1}^o \frac{f_{c,\iota}^U}{\lVert {f_{c,\iota}^U} \rVert_2} \\
    \alpha = \frac{1}{M} \sum_{j=1}^M \lVert {c_j^L} \rVert_2$$

### Why use centroids

- They show that the anchors i.e. the columns of $W$ converge to the centroids of the features $f$ of the layer prior to the classification layer for different datasets and different dimensions of $f$.
- The weight update for $w_n$, the columns of $W$(i.e. learning rate $\eta$ times loss gradient with respect to $w_n$) can be shown to be:

    $$\Delta w_n = -\eta\nabla_{w_n}l 
    = -\eta\sum_{f \in I_n}(1-p_n)f + \eta\sum_{f \notin I_n}p_n'f   $$
    
    $$ p_n =  \frac{\exp y_n}{\sum_{n'=1}^{N}{\exp y_{n'}}} \text{ }\text{(i.e. predicted probabilty that the class of the example is $n$)}$$
    
    $$ y = W^Tf$$
    
    
- The first term involves a weighted sum of the features of the examples belonging to class $n$.
- We can think of this term as approximately pointing along the direction of the centroid.
- On way to consider this is to note that for the examples with high predicted probabilities for class $n$ the dot product between $w_n$ and $f$ would have been large and positive.
- So initially $w_n$ is more aligned with the features of these examples
- However the weights for the gradient update are $1 - p_n$ so this causes $w$ to move closer to the features for the other examples