# cons of "one representation" of words

Up until now, we’ve basically had one representation of words: The word vectors that we learned about at the beginning
- Word2vec, GloVe, fastText
    
Problems:
- **Problem 1**: a word can have different meanings ("sense"), depending on the context, and now we try to collapse all the meanings of a word into 1 single vector, and hope that your model is complex enough that it can pick out the correct word meaning
- quick solution: define different word-senses for each word and build a vector for each

=> what you want is to **build not a perfect word vector that can capture all of its meanings, but a word vector that is correct in your given context**

- **Problem 2**: on a more general view, not just different meanings but words have different aspects, including semantics, syntactic behavior, and register/connotations.
    

However, these problems have been solved by the process of building a language model that predicts next word; **by doing so you already generate a context-specific representation of words**

# Tag LM (pre ELMO) for Name Entity Recognition (NER)

![](images/context_1.png)

Note that "word embedding model" is the word embedding from w2v/glove/fasttext

![](images/context_2.png)

Token embedding is from w2v/glove/fasttext

Note: 
- only concat the hidden state, not the embeddings of LM model and w2v model
- This is **not end-to-end training**: by 'pretrained', we only use the output hidden state of LM model as input for the Sequence-tagging feature. **In another word, the LM model is trained first ('pretrained') and is frozen after we extract the hidden states**


# ELMO

Peters et al. (2018): ELMo: Embeddings from Language
Models
- Train a bidirectional LM
- Aim at performant but not overly large LM:
    - Use 2 biLSTM layers
    - Use character CNN to build initial word representation (only)
        - 2048 char n-gram filters and 2 highway layers, 512 dim projection
    - User 4096 dim hidden/cell LSTM states with 512 dim projections to next input
    - Use a residual connection
    - Tie parameters of token input and output (softmax) and tie these between forward and backward LMs
    
    
    
- TagLM uses only the top layer of LSTM stack; **ELMO utilizes all the layers of LSTM stack, and assign a learnable scaling factor to determine how much of a layer (which can contribute to a "task") to take in**
    - First run biLM to get representations for each word
    - Then let (whatever) end-task model use them
        - Freeze weights of ELMo for purposes of supervised model
        - Concatenate ELMo weights into task-specific model
        - Details depend on task
            - Concatenating into intermediate layer as for TagLM is typical
            - Can provide ELMo representations again when producing outputs, as in a question answering system

## weighting of layers for different tasks

The two biLSTM NLM layers have differentiated uses/meanings
- Lower layer is better for lower-level syntax, etc.
    - Part-of-speech tagging, syntactic dependencies, NER
- Higher layer is better for higher-level semantics
    - Sentiment, Semantic role labeling, question answering, SNLI

# Transformers

http://nlp.seas.harvard.edu/2018/04/03/attention.html

We want parallelization but RNNs are inherently sequential
- Despite GRUs and LSTMs, **RNNs still need attention mechanism to deal with long range dependencies** – path length between states grows with sequence otherwise
- But if attention gives us access to any state… **maybe we can just use attention and don’t need the RNN?**


![](images/transformer_1.png)

![](images/transformer_2.png)

![](images/transformer_3.png)

![](images/transformer_12.png)

# Transformer's Encoder

## Self-attention in the **encoder**

- The input word vectors are the queries, keys and values
- In other words: the word vectors themselves select each other
- Word vector stack = Q = K = V
- We’ll see in the decoder why we separate them in the definition

## Multihead attention

![](images/transformer_4.png)

Note: mapping Q,K,V to multiple **lower dimensional spaces** using **fully-connected linear layer**
- for head 1 (i=1), use weight matrix WiQ for Q, W1K for K and W1V for V

$$
W^Q_i \in
\mathbb{R}^{d_{\text{model}} \times d_k},
W^K_i \in
\mathbb{R}^{d_{\text{model}} \times d_k},
W^V_i \in
\mathbb{R}^{d_{\text{model}} \times d_v}, 
W^O \in \mathbb{R}^{hd_v \times
d_{\text{model}}}
$$

$$
d_k=d_v=d_{\text{model}}/h=64
$$

- Instead of doing this for the entire columns of Q,K,V
$$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V
$$
we will chop each matrix with originally d_model columns into h sub-matrices by doing matmul with h weight matrices as stated above

In practices, for each Q,K and V, we combine h weight matrices into 1 matrix with shape (d_model,d_model) to optimize calculation 

In [1]:
def attention(query, key, value, mask=None, dropout=None):
    """
    Compute 'Scaled Dot Product Attention'
    query: shape (bs,h, n, d_k) as n is # of rows in 1 batch of "query", h is number of heads for attention
    key: shape (bs,h, m, d_k) as m is # of rows in 1 batch of "key"
    value: shape (bs, h, m, d_k)
    
    Note that n = m, and n can be sent_length (number of words for each batch)
    """ 

    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    # (bs,h, n, d_k) @ (bs,h, m, d_k) = (bs,h,n,m) aka QKt
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9) # fill mask for softmax
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn # (bs,h,n,m) @ (bs, h, m, d_k) = (bs,h,n,d_k)

In [None]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4) # 4 linear layers
        # we use the first 3 linears (Wq,Wk and Wv) to calculate attention,
        # and the last linear (Wo) to project attention 
        
        # note that technically Wq (or Wk,Wv) is a combination of 
        # h sub matrices shape (d_model,d_k) for projection to h lower dimensional space

        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, query, key, value, mask=None):
        """
        Implements Figure 2
        
        query: (bs,n,d_model)
        key: (bs,m,d_model)
        value: (bs,m,d_model)
        
        Note that d_model = h * d_k
        and d_k = d_v 
        Also n = m
        n can be sent_length (number of words for each batch)
        """
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
        
        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = \
            [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]
        # for query: (bs,n,d_model)@(d_model,d_model) = (bs,n,d_model) = (bs,n,h,d_k) = (bs,h,n,d_k)
        # do the same for key and value
        
        # 2) Apply attention on all the projected vectors in batch. 
        x, self.attn = attention(query, key, value, mask=mask, 
                                 dropout=self.dropout)
        # x has shape (bs,h,n,d_k)
        
        # 3) "Concat" using a view and apply a final linear. 
        x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)
        # (bs,h,n,d_k) = (bs,n,h,d_k) = (bs,n, h*d_k) aka (bs,n,d_model) then @ (d_model,d_model)
        return self.linears[-1](x)

## Complete transformer block

![](images/transformer_5.png)

In [None]:
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        """
        x shape: (bs,n,d_model)
        n can be sent_length (number of words for each batch)
        """

        return x + self.dropout(sublayer(self.norm(x))) #residual

In [None]:
class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn # which is MultiHeadAttention as defined above
        self.feed_forward = feed_forward # probably just nn.Linear
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections."
        """
        x shape: (bs,n,d_model)
        n can be sent_length (number of words for each batch)
        """
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward) 

## Layer norm

Layer norm formula is essentially the same as batchnorm, but
- calculated for EACH TRAINING POINT
- No moving average
- Average over the hidden dimension to make the norm strategy independent of the batch size

In [None]:
class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        """
        x shape: (bs,n,d_model)
        n can be sent_length (number of words for each batch)
        """
        mean = x.mean(-1, keepdim=True) # taking the mean of each word embedding (?) individually, each word is just a vector
        # similar to layer norm in CV: m = x.mean((1,2,3), keepdim=True), each image is a 3D matrix
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

## Multi-block with multi-head attention vs LSTM/RNN

- RNN:
    - \+ get recurrent information carried in a sentence by feeding 1 batch of words at a time
    - \- CANNOT parallelize using GPU, aka cannot feed the entire batch of sentences at once
- multi-block with multi-head attention
    - \+ With block stacking on top of each other, each block will try to capture information in a chain: 1st block can capture 1st info of a chain, 2nd block can capture 2nd info of a chain => similar to RNN
    - \+ CAN feed the entire batch of sequences => parallelize well using GPUs
    - \- disregard the position of words in a sentence: 2 of the same words in different position might be treated the same way, or dont know where you are at in the sentence
        - Solution: **positional encoding** so same words at different locations have different overall representations:

## The complete encoder

![](images/transformer_6.png)

## TODO: visualization in encoder

Attention visualization in layer 5

![](images/transformer_7.png)

# Transformer's Decoder

![](images/transformer_11.png)

# BERT

Pre-training of Deep Bidirectional Transformers for Language Understanding

Problem: Language models only use left context or right context, why?
- Reason 1: Directionality is needed to generate a well-formed probability distribution.
- Reason 2: Words can “see themselves” (cheating) in a bidirectional encoder.

**but language understanding is bidirectional.**


Solution: Mask out k% of the input words, and then predict the
masked words
- They always use k = 15%
                  store          gallon
                    ↑              ↑ 
the man went to the [MASK] to buy a [MASK] of milk
- Too little masking: Too expensive to train
- Too much masking: Not enough context

TODO: can we do a multiclass prediction and choose synonyms of [MASK] words?

## BERT next sentence prediction

![](images/transformer_8.png)

- a binary classification task

## a typical embedding input of BERT

![](images/transformer_9.png)

Note that you will add these 3 embeddings since you will use multi-head encoder (Transformer) which take attention from words (it won't be ideal to concatenate them and perform attention?)

Noted that wordpieces are not simply word embeddings: i
- Wordpiece model tokenizes inside words
- BERT uses a variant of the wordpiece model
    - (Relatively) use common words that are in the vocabulary: at, fairfax, 1910s
- Other words are built from wordpieces:
    - hypatia = h ##yp ##ati ##a

## losses

You will have 2 losses: one from predict the [MASK] words and one binary prediction's loss from predict whether correct next sentence or not

## BERT model architecture and training


- Transformer encoder (as before)
- Self-attention ⇒ no locality bias
    - Long-distance context has “equal opportunity”
- Single multiplication per layer ⇒ efficiency on GPU/TPU
- Train on Wikipedia + BookCorpus
- Train 2 model sizes:
    - BERT-Base: 12-layer, 768-hidden, 12-head
    - BERT-Large: 24-layer, 1024-hidden, 16-head
- Trained on 4x4 or 8x8 TPU slice for 4 days

## BERT model finetuning

You can take the model, pre-train, and use that same architecture + weights for different tasks
- Simply learn a classifier built on the top layer for the task that you fine tune for (similarly to ULMfit)

![](images/transformer_10.png)