# **Homework 3: Questions**

This file is meant to contain all of your answers to the written questions of HW3. Please submit this alongside the rest of your code as a separate pdf labelled **questions.pdf**. 

> Note: Feel free to add code blocks and outputs within this file if you believe they help you in answering the questions but do **not** only screenshot your code and outputs from the homework notebook as a response!

## **Part 3: Encoder-Decoder Model**

For all the questions below, we would like to make it known that we expect your answers to be in the form of generics, not the specific numerical dimensions you have tuned your models to possess. Express your answers in terms of h: hidden dimension size, v: vocab size, e: embedding size, o: output dimension size, b: batch size and s: sentence length.

For example, in question 3.4 when discussing the dimensions of last_hidden and last_cell_state, we state the dimensions are `(2, b, h)`, such that it is true for any general, correct, implementation of the model.

It **is** valid, and sometimes expected, to express the generics in terms of a dimension multiplied by a constant e.g. write something such as `(2*e, 5*b)` (this is just an example no tensor in the model actually has this dimension), however all dimensions should only be expressed in terms of dimensional variables and constants. Any answer that only give specific numerical dimensions will not be given credit.

### Q3.1:
What are the limitations of converting semantic role labeling task to Question & Answer task (Model 2) using an Encoder-Decoder model? (max 4 sentences)


*... add your answer here*

### Q3.2
In the initialization of your encoder model, you initialize a LSTM encoding layer and two linear projection layers. Give the dimensions of each of these layers and explain your reasoning. Please reference specific parts of your code in your answer. (max 6 sentences)

First, we let `e = embed_size`, `h = hidden_size`. The dimension of the LSTM layer is given by `(h, e)`; this is pretty straightforward, as the LSTM takes in an embedding and its output is a hidden layer.
```
self.encoder = nn.LSTM(
            input_size=embed_size,
            hidden_size=hidden_size,
            batch_first=True,
            bias=True,
            bidirectional=True,
        )
```
Then, the dimension of the `h_projection` is given by `(h, 2h)`. For each token, it takes in the concatenated hidden states from the forward and backward LSTM (which results in a `2h x 1` vector), and outputs the first hidden state of the decoder, which is in $\mathbb{R}^h$.
```
self.h_projection = nn.Linear(
            in_features=2 * hidden_size, out_features=hidden_size, bias=True
        )
```
Similarly, the dimension of the `c_projection` is also `(h, 2h)`, as the cell state is essentially another hidden layer for LSTMs, and we also concatenate the cells states of the forward and backward LSTM as we do for the hidden state.
```
self.c_projection = nn.Linear(
            in_features=2 * hidden_size, out_features=hidden_size, bias=True
        )
```

### Q3.3
In the forward step of your encoder, you construct your tensor X by embedding the input. What should the resulting shape be?

If we let `b = batch_size` and `L` be the max length of a sentence, and since we are processing using `batch_first=True` with embedding dimension `e`, passing through the embedding layer will simply add the second dimension `e` to the input, giving a shape of `(b, L, e)`.

### Q3.4
In the forward step of your encoder, why are last_hidden and last_cell_state of size (2, b = batch_size, h = hidden_size)? Why is the first dimension 2? Furthermore, why do we want to concatenate the forward and backwards tensors (and what is the shape of the concatenated output)? (3 sentences)

The `last_hidden` and `last_cell_state` are of size `(2, b, h)` because the `nn.LSTM` forward function with `bidirectional=True` outputs the hidden state/cell state tensors of the forward and backward LSTMs stacked on top of each other, and each output tensor of each LSTM has dimension `(b, h)` as the LSTM layer is dimension `(h, e)`, and we are only grabbing the final hidden/cell state for each element in the sequence. We want to concatenate the forward and backward tensor in dimension `1` instead of stacking so that we can match the dimensions of the projection as given by `h_projection` and `c_projection` layers. The ultimate shape of the concatenated output should be `(b, 2h)`.

### Q3.5
In the forward step of your encoder, what is the dimension of the resulting tensor when we pass through the h_projection layer? (One sentence)

The dimension of the resulting tensor when passed through the `h_projection` layers is `(b, h)`.

### Q3.6
In the forward step of your encoder, what is the dimension of the resulting tensor when we pass through the c_projection layer? (One sentence)

Similarly to `h_projection`, the dimension of the resulting tensor when passed through `c_projection` is `(b, h)`.

### Q3.7
In the initialization of your decoder layer, you initialize your actual decoder, and two projection layers. What are the dimensions with which you initialize all of these variables, and why do these dimensions make sense? Please have your answer include no more than two sentences per variable, and please include your code in the answer as you discuss the initialization.

If `e = embed_size` and `h = hidden_size`, our actual decoder, an `nn.LSTMCell`, is `(h, e+h)`. Again, this is pretty easily explainable, as the input to the $t^{th}$ step of the decoder is a concatenation of the embedding for the $t^{th}$ token and a tuple of the previous $h^{dec}$ and $c^{dec}$ states, which are separately handled by the `nn.LSTMCell` module. 
```
self.decoder = nn.LSTMCell(
            input_size=embed_size + hidden_size, hidden_size=hidden_size, bias=True
        )
```
The `combined_output_projection` layer has dimension `(h, 3h)`. This is because the combined-output vector is of dimension `(h, 1)`, but the input $u_t$ vector is a concatenation of the decoder hidden state, of dimension `(h, 1)`, and the attention vector, of dimension `(2h, 1)`. 
```
self.combined_output_projection = nn.Linear(
            in_features=3 * hidden_size, out_features=hidden_size, bias=False
        )
```
Finally, the `target_vocab_projection` layer has dimension `(v, h)`, where `v = output_vocab_size`. The final output is the probability distribution $P_t$, which is a softmax over a `(v, 1)` vector, so that from the combined-output vector of shape `(h, 1)`, a `(v, h)` transformation is sufficient.

### Q3.8
In the forward step of your decoder you should construct tensor Y by embedding the input. What should the resulting shape of Y be? (1 sentence)

Each index in `source_padded` in decoder's `forward` should generate an embedding vector, so the shape of Y should be `(b, L, e)`, where `L` is the max length of a sentence (the number of time steps).

### Q3.9
In the forward step of your decoder, what is the dimension of the resulting tensor when passed through self.att_projection? (One sentence)

As the dimension of `enc_hiddens` passed from the encoder is `(b, L, 2h)`, we have that passing it through `self.att_projection` results in a tensor of size `(b, L, h)`.

### Q3.10
In the forward step of your decoder, over what dimension do we iterate, and what should the resulting shape of the new tensor be in relation to b and e where b is batch size and e is embedding size? (One sentence)

We iterate over the second dimension which is the time step, or the max sentence length, so that after squeezing along this dimension, the shape of the new tensor should be `(b, e)`.

### Q3.11
In the forward step of your decoder, what is the dimension of Ybar_t? (One sentence)

The dimension of `Ybar_t` should be `(b, e+h)`, as `o_prev` is initialized with size `(b, h)`, and from the last response, `Y_t` has size `(b, e)` after squeezing, and we concatenate over dimension `1`.

### Q3.12
At the end of the forward step of your decoder, what is the shape of the single tensor made from stacking combined_outputs? (One sentence)

As mentioned in the documentation of the code, at each time step, the size of `combined_outputs` is `(b, h)`, so that stacking them across all time steps will give us a tensor of size `(L, b, h)`.

### Q3.13
Discuss the manipulations to tensor dimensions that you performed in the attention step of the decoder step function (TODO 2), and why these were necessary. Please include direct references to the code you wrote, as well as the code you are referencing. (Maximum 4 sentences)

```
alpha_t = F.softmax(torch.matmul(enc_hiddens_proj, dec_hidden.unsqueeze(dim=2)), dim=1) #alpha_t = (b, L, 1)
a_t = torch.matmul(torch.transpose(enc_hiddens, dim0=1, dim1=2), alpha_t) #a_t = (b, 2h, 1)
a_t = a_t.squeeze(dim=2)
U_t = torch.cat((dec_hidden, a_t), dim=1)
```
The code from the attention step of the decoder step function is above. We first need to add a dimension to `dec_hidden` because we want to find the dot product between it and `enc_hiddens_proj`, so we need matrix multiply them. We see that `enc_hiddens_proj` has dimension `(b, L, h)` and `dec_hidden` has dimension `(b, h)`, so we add a dimension to make it `(b, h, 1)`, then we get `alpha_t` with dimension `(b, L, 1)`. Next, in order to matrix multiply `alpha_t` with `enc_hiddens` where the dimension is `(b, L, 2h)`, we must transpose the second and third dimensions `enc_hiddens` to get a tensor of dimension `(b, 2h, L)`, which now lets us multiply it with `alpha_t`. Now, `a_t` is dimension `(b, 2h, 1)` so we squeeze out the extra dimension to get `a_t` with dimension `(b, 2h)`.

### Q 3.14
In the decoder step function, what is the dimension of U_t? (One sentence)

The dimension of `U_t` is `(3h, b)`. 

## **Part 4: Model Comparison & Analysis**

### Q4.1:

Compare two models above either using quantitative or qualitative analysis.

The descriptive analysis can take one of two forms:

1. _Nuanced quantitative analysis_ \
If you choose this option, you will need to further break down the quantitative statistics you reported initially. We provide some initial strategies to prime you for what you should think about in doing this. One possible starting point is to consider: if model $X$ achieves greater accuracy than model $Y$, to what extent is $X$ getting everything correct that $Y$ gets correct? For example, what's model's performance on each semantic role types?

2. _Nuanced qualitative analysis_ \
If you choose this option, you will need to select individual examples and try to explain or reason about why one model may be getting them right whereas the other isn’t. Are there any examples that both models get right or wrong and, if so, can you hypothesize a reason why this occurs?


**NOTE:** The report should be written keeping both of the models in mind, discussing and comparing both of their performances, as well as doing the nuanced analysis with both of the models. Due to this, we won't be setting a hard limit on length of the report, but your report should be a substantial analysis.

**CLARIFICATION:** Whichever option you take, we expect the following (at a minimum):

1. A minimum of 3 clearly stated examples (for qualitative analysis) or statistics (for your quantitative analysis)
2. An explanation as to why you think the phenomena you observed or talked about above is occurring.
3. A discussion as to what conclusions you can draw about your models, particularly in comparison with each other, as a result of these examples or statistics.

Please be clear with your responses, as we will grade according to the presence of the above.


From our experiments, we found that the the LSTM encoder model has an F1 entity score of 0.2377 on the validation set and 0.2427 on the test (milestone submission) set. The encoder-decoder has an accuracy of .1788 on the validation set and 0.3072 on the test set. Thus, we can see that the LSTM encoder model generally performs worse than our encoder-decoder. With both models, the errors arise with O labels, due to the model both misassigning roles as O or labelling true O entities something different. With LSTM, the model misses a lot of the true O labels, instead labelling it with something else. It has a macro precision score of .37 and a recall score of .4617. Out of the 16646 true O labels, the model only predicts 'O' 15206 times and correctly labels 13935 of them. So even within the 15206 O perdictions, 1271 of them are incorrect. 

With encoder-decoder, the main issue is that the model often incorrectly predicts O when the true role is something else, predicting 18744 O labels total, over 2000 more 'O' labels than there actually are. It has a macro precision score of .53 but a recall score of .24. In fact, for very single category of true labels, the model predicts 'O' the majority of the time over any other label. While this is an issue across all categories, it does notably poorly with I-ARG0, having a f1-score of .04. Out of the 287 true I-ARG0 labels, it only predicts I-ARG0 6 of the times, predicting O 267 of the times and B-ARG0 11 of the times.