# Chapter 9: Transformers

## 9.1: Introduction
* Bidirectional Encoder Representations from Transformers (BERT)
    * BERT aims to derive word embeddings from raw textual data by using both the left and the right context when learning vector representations for words.
        * As opposed to word2vec which just uses a single context

### 9.1.1 BERT up close: Transformers
* Transformers: encoder-decoder models developed by Google
* Like Word2Vec, BERT models are trained to produce a prediction based on an internal representation of contextual info.
    * Word2Vec predicts a word given it's immediate behavior (CBOW) or a context given a certain word (Skip-gram)
    * CBOW: Infer a missing word from its context
    * Skip-Gram: Predict contexts from single words

* Auto-encoder: When a network encodes input data into an intermediate representation and is trained to minimize reconstruction loss
    * Input -> Hidden -> ... -> Encoded -> ... -> Hidden -> Decoded
    * Reconstruction loss: The error the network makes when attempting to reconstruct the original input from the encoded representation
    * This type of learning is called 'bottleneck learning' as the lower dimension data means that the network learns to eliminate noise and focus on important dimensions.
    
### 9.1.2 Transformer encoders
* A transformer encoder has two internal layers:
    * Self-Attention Layer
    * Fully-connected feedforward network layer
* Every sublayer has access to the incoming input vector
    * Skip connection: A connection between two layers bypassing the intervening layers.
* As the input x goes into each layer (1), the original input x is combined with the output of the original layer x1 and so on.
* With the output of the original layer + the input, the data gets normalized based on the man / stdev for all the summed inputs to the neurons in that layer.
    * Standard normalization: Z-score normalization
* All encoder layers are equipped with attention:
    * Pays attention to every other input
    * Attention Heads: Derive embeddings of every separate input element in the context of surrounding input elements.
        * Triple of 3 matrices (K,Q,V) containing separate and (initially) randomized weights.
    * The attention weights are computed via the softmax from Chapt 8 whicere it encodes the attention i pays to j
        * But not j -> i which is why we have a query matrix Q and a key matrix K.
    * What if we had multiple attention heads?
        * Different heads specialize in different attention focuses
            * Some could focus on prepositional phrase patterns
            * Others could focus on verb - direct object relations
    * When we combine all of this together into one big matrix multiplied by the original vector, the created embeddings incrementally get new self-attention context.
        * Words pay attention to themselves and each other!
    * With multiple encoder layers stacked with their own attention heads, we get hierarchies of attention.
* Positional Encoding is a way to encode word positions into vectors by using sine/cosine functions.
    * Since it uses sine/cosine, the vectors can be seen as a continous alternative to discrete bits.

### 9.1.3 Transformer decoders
* The encoder goes and embeds the words + their positions into machine-readable vectors for the input layer
* The decoder goes and attempts to generate an output sequence word by word based on the encoder representation of the input data.
* It does help if we give it more information!
    * The decoder does have access to the desired output symbols excluding what it should currently predict.
    * Autoregressive: Generating a current word based on the previous words it generates.
* The decoder needs to start at a *shifted* position compared to the encoder
    * 1 to the right.
    * Needs this otherwise it would be trained to copy a sequence rather than infer next words.
* Also, in comparison to the encoder, we need to limit the self-attention to be just backwards facing
    * Implemented via masking out words w/ a binary filter.

## 9.2 BERT: Masked language modeling
* BERT is not a transformer as it only has the encoder portion.
* Again, it's only meant to produce vector embeddings like Word2Vec.
* Masked language models:
    * Primarily used to model word distribution patterns by masking out certain words and having the model predict the words 'under the masks'.
    * Denoising autoencoding:
        * Attemts to reconstruct a sequence of words from a corrupted 'noisy' signal.
        * This is the sequence of words containing masks.
    * Derived from 1950's Cloze test for assesing the mastery of a native or 2nd language alongside the readability of text.
        * EX: Fill in the blanks in the sentence:
            * The _ performed an echoscopy ('nurse', 'doctor')
    * BERT works as it exploites remote and non-adjacent context for predicting a blanked out, masked word.

### 9.2.1 Training BERT
* BERT is trained on two objectives simultaneously:
    * Predicting words hidden under masks
    * Predicting, for an arbitrary pair of 2 sentences, whether the 2nd sentence is a natural progression of the first.
        * Very similar to the question and answer applications from before!
    * As such, BERT optimizes both the loss function for sentence pair prediction and the loss function for unveiling masked tokens.
* BERT produces a contextual embedding which has embeddings for every word from the attention patternns the input evokes from the model.
    * 2 sentences will have different embeddings for the same word (bank):
        * I went to the bank
        * The man went to the river bank
    * The encoder layers re-do their computations of key-query attention values!

### 9.2.2 Fine-tuning BERT
* A BERT embedding can be fined-tuned by connecting it to a classification problem.
    * Also called a downstream task.
* The loss from the classification predictions will be fed back into the pre-trained BERT model.
* All we need to do here is just add a few softmax layers with labeled data to fine-tune the model.
* Since it's easy to increase the accuracy, these types of models are super good for sentiment-analysis. 
* Auto-encoding model:
    * It corrupts its input data during training, and tries to find an optimal encoding to optimally reconstruct the masked input data.
    * These typically pair with an explicit decoder which allowes for text generation.

### 9.2.3 Beyond BERT
* Primary shortcoming of BERT is its masking approach
    * The masks make sense for the pre-training when BERT is trained to predict words under masks
    * But it's not relevent for the fine-tuning
* BERT also makes an independence assumption
    * Sometimes the interactions between the masked tokens can be lost.
    * Every mask is created in isolation between neighboring masks.
* How can we get also the information between the masks?
    * Enter XLNet!
    * Example:
        * [MASK][MASK] and a Happy New Year
        * BERT: log(P(Merry|and a happy New year))+log(P(Christmas|and a happy New YEAR))
        * XLNet: log(P(Merry|and a happy New year))+log(P(Christmas|Merry, and a happy Near YEAR))
    * XLNET utilizes Pernmutation Language Modeling
        * Words are predicted on different permutations!
* Since XLNET does away with explicit masking, we get super nice performance margins!
    * However may not be the best as these permutations are costly to compute and process :(

## Summary:
* Transformers are complex encoder-decoder networks, based on self.attention.
* BERT is a Transformer encoder
* BERT derives attention-weighted word embeddings, using Masked Language Modeling, a complex attention mechanism and positional encoding.
* BERT differs from Word2Vec by creating dynamic embeddings that discriminate between different contexts, and is similar in its downstream fine-tuning facilities.
* XLNet differs from BERT by omitting the masking of words using a permutation language model, and it may be a better option in some circumstances, while being more costly from a computational point of view. 