<a href="https://colab.research.google.com/github/ajayrfhp/LearningDeepLearning/blob/main/nlp_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Automated mixed precision
- Implemented GPT2 training with torch.amp.autocast [here](https://github.com/ajayrfhp/GPT2/blob/master/notebooks/utils.py)
  - Autocast automatically casts operations to move to FP16, which can remain in FP32. Idea is to save memory usage
  - Main code modification is to apply loss scaling. Loss is scaled before backward prop. Small gradients can be rounded to 0, so you multiply with a large scaling factor. Gradients will stay above underflow threshold. Scale down before weight update.



### Transformers
- Implemented GPT2 https://github.com/ajayrfhp/GPT2/
  - Tested on toy datasets like repeating abcde
  - Tried training on WikiText101 with 9 million tokens, 150 million model overfits. https://github.com/ajayrfhp/GPT2/blob/master/notebooks/GPT2OnWikiTextLarge.ipynb
  - Wrote unit tests to check if model was implemented correctly.
  - Was optimizing code to run on 90 million token wiki text, hit index error to fix.
- Transformers don't need to concatenate context into a vector, passes the whole input matrix. Transformers are easier to parallelize. RNN process sequence by word. RNNs can't model long range sequences well.
- We use layer norm, not batch norm because sequences can be of different length. RMSNorm, just divide by the square of variance, no need to subtract mean. Less FLOPS
- Cost
  - Compute
    - Multiplying self attention matrix with input
    - O (Batch_size * TokenLength * TokenLength * D)
  - Memory
    - Self attention matrices need to be stored. Longer sequences need more memory to be stored.
    - O (Batch_size * TokenLength * TokenLength)


### Tokenization
- Transformer models need input text broken into discrete tokens.
- BPE is a common tokenization scheme, merge most commonly occuring 2 tokens into one and repeat till certain number of merges is achieved.
  - Implemented BPE from scratch https://github.com/ajayrfhp/GPT2/blob/master/notebooks/bytepairencoding.ipynb
  - Used copilot to write my code in cython and benchmark times show 66% faster processing time.
  - BPE does not need UNK token, starts from character level.
- SentencePiece does not use space for preprocessing, treats every unicode character as a seperate token and finds optimal token merge scheme that fits training data using EM https://github.com/ajayrfhp/GPT2/blob/master/notebooks/comparison_of_tokenizers.ipynb
- WordPiece is older, split words into seperate pairs. Word piece merges pairs of subwords that maximize training data fit. Word piece has UNK token.
- Vision transformer takes (w, h, c) image, applies a (k, k) kernel with stride k to get(w/k, h/k, d) then first two dimensions are flattened to get (w*h/kk) tokens each with d dimensional vector. Fed to transformers.

### Positional embeddings
- Resource https://huggingface.co/blog/designing-positional-encoding
- Desired properties
  - Unique encoding for each position
  - Linear relationship between 2 encoded positions, relative
  - Generalize to different sequences
  - Deterministic mapping
- Need in transformers because transformers lose positional information
- Binary positional embedding
  - Add binary representation of each position
  - Transitions from one to next are not smooth
- Fixed sinusoidal embedding
  - Smoother transitions than binary positional encoding.
  - Sin, cos functions of different frequencies are used to encode different types of relationships between positions.
- Absolute positional embedding
  - Learn embedding vector for each position. Works best if test sequences are of similar length to train sequences. Different dimensions encode position information captured in different frequencies
- Relative positional embedding
  - In language, meaning often comes from relation of words to other words
  - Encode distance between 2 words.
- ROPE
  - Apply a rotational matrix to Query and Key vectors, where the rotational matrix encodes both rotation and distance between words.
    - Each position is mapped to a unique rotation angle. x'=Rx
    - Angle differences encode relative differences. The inner product of RMQ and RNK only depends on M - N
    - As distance between word increases, attention weights decay. Easy implementation.
- ALIBI
  - A linear bias to attention scores is added based on distance between the words. A new head slope parameter is added.
  -
  

TO DO
-
Study the feedforward (MLP) block of transformers.
understand the difference between allocated and reserved memory, transformer param count, memory consumption by attention and feedforward https://erees.dev/transformer-memory/
parallel transformers: https://arxiv.org/pdf/2205.05198.pdf don't go too deep, try to understand the concepts like 1) how tensor parallel is implemented in feedforward, 2) how attention head and its fully connected layer can go parallel, 3) how sequence parallelism works