# Neural Networks - Representations

### Recap: Feed-forward Neural Network

We have seen how a neural network can be formalized, both algebraically and graphically. 

$$y= NN(\mathbf{x}) $$

For example, we have seen last time that the following network can be formalized as:
<img src="pics/nn.png" width=300> 

$$NN_{MLP1}(\mathbf{x})=g(\mathbf{xW^1+b^1})\mathbf{W^2}+\mathbf{b^2}$$



However, what is the input $\textbf{x}$?

## Recap: Logistic Regression


Before we go further, lets make a detour and recap: How do we represent a training instance in a traditional classifier?

For instance, recall our example from week 1: training a Logistic Regression classifier for sentiment classification. 

* Describe in words: what were the features we used?
* Describe how the features of a single training instance got encoded as $\textbf{x}$
* How can you now describe the entire training data set as one matrix $X$ that keeps all training instances?  i.e.,  what are the rows and columns of $X$? $$ X = \{\mathbf{x_1}, ... , \mathbf{x_n}\} $$ 

(to be precise, our training data also contain labels, therefore, the entire training data is often also represented as tuples $\langle \mathbf{x},\mathbf{y} \rangle$) 

## Feature representation

Probably the biggest jump when moving from traditional linear models with sparse inputs to deep neural networks is to stop representing each feature as a unique dimension, but instead represent them as **dense vectors** (Goldberg, 2015).

In deep learning we usually work with dense representations. Each feature is a vector of numbers.

**a) sparse representation vs b) dense representation**  (Figure 1 in Yoav Goldberg's primer)
<img src="pics/sparsevsdense.png">

The values of the *embedding vectors* (values of the vectors in Fig 1 b)) are treated as model parameters and trained together with the other parameters of the model (weights).

The common pipeline of extracting features **for an NLP model with a Neural Network** then becomes:

* extract a set of core linguistic features $f_1,..f_n$
* define a vector **for each feature** (lookup table)
* **combine** vectors of features to get the vector representation for the **instance** $\mathbf{x}$ (**dense representation**)
* use $\mathbf{x}$ as representation for an instance, train the model


    

Lets compare this to our traditional approach - the common pipeline of extracting features for an NLP model is:

* extract a set of core linguistic features $f_1,..f_n$
* define a vector whose length is the total number of features with a 1 at position k if the k-th feature is active; this feature vector represents the **instance** $\mathbf{x}$  (**sparse representation**)
* use $\mathbf{x}$ as representation for an instance, train the model

Now it should be clear why it is called sparse vs dense feature representation.


### How do you combine different feature vector representations?


In an NLP application, $\mathbf{x}$ is usually composed of various embedding vectors.


Following the notation in Goldberg (2015), chapter 4, lets use the function $c(\cdot)$ as **feature combiner** that creates our input embeddings layer.

A common choice for $c$ is **concatenation**:

$\mathbf{x} = c(f_1, f_2, f3) = [v(f_1); v(f_2); v(f_3)] $

Alternatively, $c$ could be the **sum of the embeddings vector**:

$\mathbf{x} = c(f_1, f_2, f3) = [v(f_1)+v(f_2)+v(f_3)] $

or the **mean**:

$\mathbf{x} = c(f_1, f_2, f3) = [mean(v(f_1),v(f_2),v(f_3))] $

In many papers $v$ is often referred to as the embeddings layer or lookup layer.

For instance, let us explicitly state the input representation. Suppose we use the concatentation operator, then our network above becomes:

<img src="pics/nn.png" width=300> 

since: 
$\mathbf{x} = c(f_1, f_2, f3) = [v(f_1); v(f_2); v(f_3)] $

then: 

$NN_{MLP1}(\mathbf{x})=g(\mathbf{[v(f_1); v(f_2); v(f_3)]W^1+b^1})\mathbf{W^2}+\mathbf{b^2}$



As computational graph:
<img src="pics/yg-compgraph2.png">

Unrolled (graph with concrete input, expected output, and loss node, Goldberg Figure 3 c):
<img src="pics/yg-compgraph3.png">

### Word Embeddings

So, in deep learning approaches to NLP words are represented as dense vectors. Where do these word vectors (embeddings) come from?

* **randomly initialized** (small numbers around 0) and *trained with the network*
* **off-the-shelf embeddings**: you can also use already trained, available embeddings (e.g. estimated with *word2vec*) and *initialize* the embedding layer of the network with your pretrained (unsupervised) word embeddings
* **task-specific embeddings**: you could also train your embeddings, read them off the network, and use them for another task (or in a multi-task setup, more later)

### Example: animacy classification


### References