# Question Answering

- Lecture: https://www.youtube.com/watch?v=yIdF-17HwSk&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z&index=11&t=0s

- Slide: http://web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture10-QA.pdf

![](images/question_answering_1.png)

- SQUAD and SQUAD2.0 dataset (question answering) and its limitation (span answer)
- Stanford Attentive Reader (SAR) and SAR++
- BiDAF from AI institute (self attention?)
- So many way to calculate and use attention
- FusionNet: combine all those attention usages and stack 5 of them together

# CNN

Slide: http://web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture11-convnets.pdf

Video: https://youtu.be/EAJoRA0KX7I?t=1407

# Why RNN to CNN

- RNN: 
    - have to go through the **entire sequence** (e.g. the entire sentence in classification task) **from one end to another** for a final vector (hidden vector h), and **normally capture too much of last words inside this vector** (assuming no LSTM or attention tweak)
    - RNN always go through prefix context before getting to a subsequent short phrases, e.g. the country of my birth ..., **RNN cannot capture the phrase "my birth" independently without considering "the country of"**
    
    => No explicit modeling of long and short range dependencies
    
- CNN: can compute "vectors" for every possible "word subsequence" of a certain length
    - e.g: for the example above: the country, country of, of my, my birth ...
    - Regardless of whether phrase is grammatical
    - Exploits local dependencies (short-range)
    - However, long-distance dependencies require many layers (similarly to CNN in computer vision, deeper is better at capture "big picture" information)

# CNN in NLP

## steps and padding

![](images/cnn_1.png)

You can add padding 1 (start word and end word with embedding 0s) to maintain number of outputs (# of rows) and add more filters (2 more) to increase dimension of output matrix (# of columns)

Also if you want to increase number of outputs (# of rows), you can increase padding (padding = 2), aka wide convolution

![](images/cnn_2.png)

## pooling

Similary to computer vision, each filter should be able to learn (by SGD) to specialize different thing, such as filter 1 can specialize if a phrase is "polite" (produce high value) or "rude" (low value)

In this sentence, for each phrase, is it in a polite tone and talk about food and blabla

=> In a sense, we can summarize the whole sentence with respect to these features (produced by filters), by doing **max pooling** (to be precise: **global max pooling**)

![](images/cnn_3.png)

So if the first 2 features is "polite" and "food", then this whole text is not polite (0.3 politeness) but is talking about food (1.6 food)

You can also do an **average pooling** to show "what is the average amount of politeness in this text"

**Imporant**: Sometimes **max pooling** is better, because of the nature of natural language:
- signals in language are often sparse (you will express politeness in some words, not every word)

**Local max pooling (similar to CNN: using another filter (pool) to compute max pooling)**

![](images/cnn_4.png)

Or for each column, you can just extract the max k values, in order. 

Global max pooling jus just k-max pooling with k = 1

![](images/cnn_5.png)

Another type of convolution: dilate convolution
- add a second cnn layer by using a second cnn filter after previous padding 1 stride 1 kernel size 3 filter
- help see a bigger spread of the sentence without having too many parameters

![](images/cnn_6.png)

# A simple cnn architecture for NLP

- Note: In the paper, the number of each region sizes (2,3,4) is 100 instead of 2
- Also he used 'multi channel input' idea of
    - Initialize with pre-trained word vectors (word2vec or Glove)
    - **Start with two copies**
    - **Backprop into only one set, keep other “static”** => keep both the original version and the updated version, in which the idea is somewhat similar to **skip connection** (in skip connection you add the original and the processed output)
    - Both channel sets are added to ci before max-pooling

![](images/cnn_7.png)

![](images/cnn_8.png)

# Advices on using different architectures for NLP tasks

![](images/cnn_9.png)

# Run CNN on character-level to build word embeddings

Used in assignment 5 to build better machine translation model

TODO: remove! (move to submodel notebook)

![](images/novel_3.png)

# Deep CNN model for NLP classification task

![](images/cnn_10.png)

Each convo block contains 2 convo sub-block, each sub-block contain (in order)
- Conv layer: kernel size 3. Padded to preserve dimension
- batchnorm
- RELU for nonlinearity

For this architecture, skip connection is not inside this block. It should be between each block, as shown in pic

# QRNN

https://medium.com/mlreview/understanding-building-blocks-of-ulmfit-818d3775325b

https://youtu.be/EAJoRA0KX7I?t=4679

QRNN addresses some of the problems CNNs and RNNs have; 
- convolutions being time invariance and 
- LSTMs being non-parallelized. 

We can say that QRNN combines best of two worlds: parallel nature of convolutions and time dependencies of LSTMs.

![](images/cnn_11.png)

![](images/cnn_12.png)