# Example By Hand of Doc2Vec

This notebook carries out a by-hand iteration of Doc2Vec to aid with understanding. The general form can be infered easily from the examples and may be refferenced throught this exercises.

## Word2Vec (CBOW) and one hot examples


Lets start first from the continuous bag of words model and the sentence to be converted to a list of vectors will be "hello its me" . In relation to gensim this is the build vocabulary phase.

### One hot set up

First of all we map our words in to a one-hot representation which is not much other than ML terminilogy for an orthonormal basis set; in this example our one-hot vectors live in $\mathbb{R}^{3}$. We define the mapping $\phi: V \rightarrow \{ 0, 1\}^{|V|}$ which maps from the vocabulary $V$ to the set of orthonormal vectors that span $\mathbb{R}^{|V|}$ :

\begin{equation}
\phi(hello) =\begin{pmatrix}
            1 \\ 0 \\ 0 
           \end{pmatrix} = \mathbf{e_{1}}, \phi(its) =\begin{pmatrix}
            0 \\ 1 \\ 0 
           \end{pmatrix} = \mathbf{e_{2}}  ,  \phi(me) =\begin{pmatrix}
            0 \\ 0 \\ 1 
           \end{pmatrix} = \mathbf{e_{3}}
\end{equation}

More generally for vocabulary $V$  we have:

\begin{equation}
\left \{ \phi(v) \right\}_{v \in V} = \left \{ \mathbf{e_{j}} \right\}_{j=1}^{|V|}
\end{equation}

Now we are ready to procede on the Neural model.


### CBOW  (dummy example continued, Forward propagation only)

Moving on to the continuous bag of words model first for to simplyfy me drawing lets say we choose to encode the words in a 2 dimensional space (weight matrix $W$ becomes two by three). For each word in my vocabulary I predict it based on just the word before it (we could pick a bigger window for before and after but using one word is an easier example):

\begin{equation}
W=\begin{pmatrix}
1  & 2 & 3 \\
4 & 5 & 6
\end{pmatrix}
\end{equation}

then the hidden layer outputs for each word are respectively: 

Hidden layer output used to predict its

\begin{equation}
\mathbf{h_{hello}}=W\mathbf{e_{1}}=\begin{pmatrix}
1  & 2 & 3 \\
4 & 5 & 6
\end{pmatrix}\begin{pmatrix}1 \\ 0 \\ 0 \end{pmatrix}=\begin{pmatrix}1 \\ 4 \end{pmatrix}
\end{equation}


Hidden layer output used to predict me

\begin{equation}
\mathbf{h_{its}}=W\mathbf{e_{2}}=\begin{pmatrix}
1  & 2 & 3 \\
4 & 5 & 6
\end{pmatrix} \begin{pmatrix}
            0 \\ 1 \\ 0 
           \end{pmatrix}=\begin{pmatrix}2 \\ 5 \end{pmatrix}
\end{equation}


Not used to predict a word but could be in concatenation to the one above if we used a window of 2

\begin{equation}
\mathbf{h_{me}}=W\mathbf{e_{3}}=\begin{pmatrix}
1  & 2 & 3 \\
4 & 5 & 6
\end{pmatrix} \begin{pmatrix}
            0 \\ 0 \\ 1 
           \end{pmatrix}=\begin{pmatrix}3 \\ 6 \end{pmatrix}
\end{equation}

The generalization here is that this neural models weight matrix has a collumn for each word since the one-hot representation combined with the linear activation functions just make the hidden layer copy out a particular collumn. Post to this hidden layer a softmax follows and backprop takes place as usual. One iteration of backprop updates all words (all collumns) even if a vector is not being predicted or not used in the prediction the contextual set up updates them all at once.

## Doc2Vec (PV-DM)

This is where it gets a little bit ugly calculation wise but it follows directly from word2vec. Lets expand our example to be $\{hello, its, me\}$ and $\{its, me \}$. Lets treat these two as two documents note I did not expand the vocabulary to minimise the amount of Latex I have to type ...  


We now augment our neural network which originally had only one weight matrix $W$ now lets create a weight matrix $D$ whose number of collumns is the number of documents and whose rows is the dimensionality  we encode the document (for this example lets say our document vectors are of size 2 aswell). Further more let the symbol $\otimes$ be concatenation or elementwise sum (and not tensor product).

Just like in the Word2Vec we use a similar definition of $\phi$ namely $\theta$ to avoid confusions that maps from the document domain to a one hot representation in the same fashion as $\phi$ :

\begin{equation}
\theta(\{hello, its, me\}) = \mathbf{e'_{1}} = \begin{pmatrix}1 \\ 0 \end{pmatrix} , \theta(\{its, me\}) = \mathbf{e'_{2}} =  \begin{pmatrix}0 \\ 1 \end{pmatrix}
\end{equation}

Now we can do a forward pass (up to the hidden layer) in our network of D and W  when predicting "its" using "hello" in  document 1  ( remember for all predictions within document 1 we use the one hot representation of document 1  as input ) :

(I am going to randomly intialize D to a diagonal matrix with 2 and 0.5 to simplify calculations )

\begin{equation}
\mathbf{h_{hello|doc1}}=D\mathbf{e'_{1}} \otimes W\mathbf{e_{1}}=\begin{pmatrix}2 & 0 \\ 0 & 0.5 \end{pmatrix}   \begin{pmatrix}1 \\ 0 \end{pmatrix} \otimes \begin{pmatrix}
1  & 2 & 3 \\
4 & 5 & 6
\end{pmatrix} \begin{pmatrix}
            1 \\ 0 \\ 0 
           \end{pmatrix}=\underbrace{\begin{pmatrix}2 \\ 0 \end{pmatrix}}_{docVec} \otimes \overbrace{\begin{pmatrix}1 \\ 4 \end{pmatrix}}^{wordVec}
\end{equation}

Lets do a pass for the second document when predicting me based on its :

\begin{equation}
\mathbf{h_{its|doc2}}=D\mathbf{e'_{2}} \otimes W\mathbf{e_{2}}=\begin{pmatrix}2 & 0 \\ 0 & 0.5 \end{pmatrix}   \begin{pmatrix}0 \\ 1 \end{pmatrix} \otimes \begin{pmatrix}
1  & 2 & 3 \\
4 & 5 & 6
\end{pmatrix} \begin{pmatrix}
            0 \\ 1 \\ 0 
           \end{pmatrix}=\underbrace{\begin{pmatrix}0 \\ 0.5 \end{pmatrix}}_{docVec} \otimes \overbrace{\begin{pmatrix}2 \\ 5 \end{pmatrix}}^{wordVec}
\end{equation}

The pattern becomes apparent now ? (maybe try generalizing these closed forms). So from a high level perspective $D$ retains the memory of the paragraph we are in when we predict words and its not quite the same as a recurrent connection when we formulate it with one-hot vectors but inderictly it can be thought of as a reccurrent connection.

## Why use this Representation ?

Lets say we have 3 docuemtns one about brittish economy one about brittain and one about economy the high level idea is that their vector representations would encode the following airthmetic:

\begin{equation}
brittishEconomy - brittishFacts \approx economy
\end{equation}

Maybe not so much as approximately equal but it makes it take closer steps to economy in this euclidean space.