# Advanced concepts of transformer architecture

## Decoder models

* transforsmers were developed for language translation, they usually consist of two parts
    * encoder -> process input data (source language sentence)
    * decoder -> generates output data (translated sentence)
* over time, decodes have become central to text generation tasks, forming basis of models like GPT, LLaMA

* decoder model
    * generative pre-training - > model is trained to predict the next word in a sequence based on previous words
    * autoregressive models -> sequentially generating text, predicting text on words that came before

* fine-tuning & reinforcement learning
    * after initial training, GPT models can be fine-tuned for specific tasks (question answering or classification), model is trained on labeled data (supervised learning)
    * reinforcement learning from human feedback is a fine-tuning method where human feedback is used to improve the performance, especially in apps like chatbots

* decoders in text generation
    * key difference from translation is that decoders rely on input from encoders (using cross-attention), in text generation they are independent, predicting next word based on preciding sequence
    * autoregressive process starts with begining-of-sentence token, it predicts the next word, and appends it to the sequence.

* masked self-attention in decoders hiddes future tokens

* text generation process
    * input prompt
    * tokenization
    * word embeddings
    * positional encoding
    * contextual embeddings
    * logits
    * argmax
    * appending & repeat
    * generation stops once end-of-sequence token or token limit is hit

### Decoder training

* key concepts
    * autoregressive approach -> the text is generated sequentially, based on previously generated tokens
    * causal attention masking -> future tokens are hidden to prevent leakage

* notation
    * $\Omega_t$ stands for the predicted token at time step $t$, obtained through decoders final layer
    * $\hat x_t$ represents predicted word embedding at time step $t$, the hat indicates it is an estimate
    * positional encoding is added to word embeddings to provide information about the position

* autoregressive prediction process
    * start with the first word embedding $x_0$ at time $t=0$
    * the decoder generates contextual embeddings
    * contextual embeddings are fed to the net predicting the next word
    * the predicted word embedding is combined with original word embedding and fed back to the model
    * process repeats until end-of-sequence token or limit for tokens is reached

* training process
    * training data -> input sentences are paired with output (shifted sequences), special tokens might be needed (bos, eos)
    * word embeddings -> input tokens are converted into word embeddings and fed into the model
    * causal masking -> prevents the model to see the future tokens, negative infinity is applied to the upper triangle of the attention mask, forcing probability to zero after softmax
    * teacher forcing -> model is fed actual, not predicted previous token, might help with training (model is aligned with the actual sequence)
    * loss function -> comparing predicted and actual tokens, calculating for every position in the sequence

* training vs inference
    * in training model processes the entire input sequence at once, uses actual word embeddings for training, employs teacher forcing to ensure correct inputs,
    * inference is autoregressive, predictions depend on previously generated tokens

### Causal Language model in Pytorch

* causal language model predicts the next word in a sequence based on the previous words
* causal masking as part of the training process

* dataset
    * IMDB reviews -> texts & sentiments
    * special tokens include UNK, PAD, EOS

* processing
    * context size -> how many tokens serve as an input for predicting the next token
    * select a point in the sequence of equal length to the block size
    * create target sequence by shifting the previous sequence
    * collate function to combine multiple sequences into batch & pad them for fixed size

* masking
    * imputing negative inf to upper triangular part of attention matrix
    ```
    [0, -inf, -inf, -inf]
    [0,   0, -inf, -inf]
    [0,   0,    0, -inf]
    [0,   0,    0,    0]
    ```

* architecture
    * embedding layer -> to map indexed tokens to their higher-dim representation
    * positional encoding -> adds information about position of each token in the sentence
    * transformer decoder -> multiple self-attention heads, uses causal mask
    * linear layer -> outputs logit over the vocabulary size
    * forward pass utilizes all of the steps in sequential layer

* training
    * input sequence, target sequence obtained
    * loss computed with cross-entropy across target sequence
    * backward-pass for param update

## Encoder models

* BERT (bidirectional encoder representations from transformers), it is pre-trained in self-supervised manner on large corpora and can be fine-tuned to specific tasks, pre-training usually using masked language modeling or next sentence prediction

* architecture
    * encoder-only approach, thus cannot be used for text generation, excels in comprehension though
    * process texts in both direction (left and right)

* key features
    * bidirectional context -> looking at tokens before and after given word to understand meaning/predict the missing word
    * segment embeddings -> segment embeddings to distinguish two sentences in paired tasks (question-answering)
    * positional encodings -> same as in traditional transformer architecture

* masked language modeling (MLM)
    * randomly masking some words in a sentence and training BERT to predict them
    * helps to learn contextual representation of words and understanding their relationships
    * masking strategy
        * replace 15 % of tokens
        * 80 % of them as [MASK]
        * 10 % of them with a random word
        * 10 % of them unchanged
    * prediction
        * processes the entire sequence and generates contextual embeddings
        * embeddings passed through linear layer to produce logits, predicting a word with highest prob

* next sentence prediction (NSP)
    * determines whether one sentence logically follows another, which is useful for question answering
    * process overview
        * two sentences are combined into a single sequence with special tokens denoting start and end of sequences (CLS, SEP), to separate them
        * segment embeddings are used to indicate whether a token belongs to the first or second sentence
    * strategy
        * model predicts whether the second sentence is the actual next sentence or a random sentence
        * a binary label is used

* training -> minimizing the combined loss from both MLM and NSP tasks
* fine-tuning -> adjusting the model to specific tasks such as sentiment analysis, text summarization or Q&A
* key differences between encoder and decoder architectures
    * encoder
        * process entire sequence
        * bidirectional attention
        * no text generation
    * decoder
        * causal masking for attention
        * text generation

## Applications for language translation