In [2]:
from IPython.display import Image
from IPython.core.display import HTML 
transformerarch=Image(url= "https://miro.medium.com/v2/resize:fit:1400/format:webp/1*THCl9vdQot0HNZRp6YmzMw.png")

### PreRequisite
Temperature-checking - how many of these concepts are we not familiar with?
1. Introduction to Machine Learning: ML, Supervised/Unsupervised;
2. Deep Learning (Basics): Neural networks, activation functions, backpropagation, gradient desent;
3. Convolutional Neural Networks: CNN & application in CV tasks;
4. REcurrent Neural Networks (RNNs): Basics, LSTM and GRU & usage in sequence data;
5. Intro to NLP: Overview of NLP - tokenization, word embeddings, sentiment analysis;
6. Seq2Seq models: how they work and their role in tasks like machine translation;
7. Attention Mechanism: Attention, Significance in transforming sequences;
8. Transformers and BERT

### Lecture Overview

1. **Introduction to LLMs**: What LLMs are and why they're important (W1)

2. **Training LLMs**: On LLM's training including challenges and techniques involved (W1)

3. **Understanding GPT Architecture**: Detailed study of the GPT architecture used in LLMs. (W1)

4. **Fine-tuning Large Language Models**: Understand the concept of fine-tuning and its application in LLMs. (W2)

5. **Applications of LLMs**: Explore real-world applications of LLMs in various industries. (W2)

6. **Evaluating LLMs**: Learn about different metrics and methods to evaluate the performance of LLMs. (W3)

7. **Bias in LLMs**: Discuss the potential for bias in LLMs and how to mitigate it. (W3)

8. **Limitations and Future of LLMs**: Discuss the current limitations of LLMs and future research directions. (W4)

9. **Case Study - GPT-3**: A detailed case study on GPT-3, its architecture, training, and applications. (W4)

### Introduction to LLMs and how they are trained
A type of ML model designed to understand, generate and converse in human language, 'large' due to the vast number of parameters.
- Ability to generate human-like texts
- Patterns in data used to train the model learnt allows the models to generate text based on received inputs
- Rule of thumb: 7B parameters takes one sota GPU to run, i.e. 13B takes two, etc.
- LLMs can perform natural language processing (NLP) tasks, note LLM ≠ NLP model
- OpenAI first released GPT-1 in 2018, and GPT 3 in 2020, where terrabytes of data were used to train these models

### Architecture of LLM 
'Attention is All you Need', seminal paper on the most commonly used architecture known as **transformer** was first proposed, revoluntionized the field of NLP.
- Transformers are based on the **attention** mechanism, which allows the model to better associate words w.r.t. their positions, of primarily two types:
    - self-attention
    - multi-head attention
- Transformer = **encoder** (input) + **decoder** (Output)

In [3]:
transformerarch
#For those more keen on looking at the two components - note that GPT-3 is decoder only

### Architecture of LLM - Transformer Layer Types
1. Self-Attention Layer: Scaled Dot-Product Attention
    - Allows model to focus on different parts of the input sequence when producing an output.
        - e.g. "The cat, which already ate...., was full." can prioritize "The cat" with "was full" before "already ate";
2. Position-wise Feed-Forward Layer
    - Fully-connected feed-forward network that is applied to each position separately and identically, each consists two linear transformations with a ReLU activation in between;
    - Used to sequentially process the output of the self-attention layer, appying the same learnt weights at every position;
3. Output layer
    - Linear layer followed by a softmax function, transforms the final hidden states to predictions for the next word in the sequence for each possible word in the vocabulary. 

### Architecture of LLM -  Data Handling by GPT Models

#### Tokenization
Process of converting sequence of text into a sequence of tokens - a (part of) word. _Byte Pair Encoding_
platform.openai.com/tokenizer

#### BPE
- subwod tokenization method, 
    - starts by splitting text into individual characters and 
    - iteratively merges the most frequently adjacent pair of symbols. 
- helps handle out-of-vocabulary words and makes the model more robust to spelling errors and variations

#### Sequence length
- context window size (2048 for GPT-3, 8K for GPT-4 or 32K)
- Limit on the number of tokens they can be passed at once due to memory constraints
- For an decoder-only model takes the input sequence all at once, where each token and its relationships with every other token in the sequence in memory
- quadratic increase in memory usage at the sequence length grows
- Primarily to make the model feasible to run on available hardware
- **Longer (input + output) sequence takes very long to generate**

### Architecture of LLM -  Implementation and deployment of GPT

#### Implementation
- Unlikely going to have retraining schedules.
- May require initial fine-tuning or tweaking OR 
- Vanilla
- GPT-in-general: Implemented with Tensorflow or PyTorch: efficient, GPU-accelerated operations

#### Deployment Process
- Models needs to reside on robust hardware or cloud-based solutions: consider numerous simultaneous requests
- Typically involves API-calls made to request services from models deployed on network
- Requests (e.g. text generation, text-completion) are handled and responses sent back

### Architecture of LLM -  Back to the SOTA models
GPT-3, GPT-J, etc.: An **autoregressive, decoder-only transformer model** designed to solve natural language processing (NLP) tasks by **predicting** how a piece of text will continue.

This is different from traditional encoder-decoder transformer models like BERT where the inputs are first encoded and thrown together at the model as a whole when making the prediction.

GPT-3-like decoder-only models puts more emphasis on the more recent inputs, maknig the prediction continuously being more relevant with the more recent information. 

- Advantage of decoder-only architecture:
    - Simplicity: easier to train less computationally expensive
    - Good at generative tasks, producing contextually relevant text - model is built to generate output one token at a time;
- Advantage of encoder-decoder model:
    - Better at classification tasks - tasks where specific structure is needed, e.g. translation, summarization, particularly good when input information comes all at once

Note here these are just high-level benefits, YMMV with different use cases.

A few notes:

1. _Autoregressiveness equals no parallelized during inference_
2. _Autoencoder_ models should be differentiated from encoder-decoder model: 
    - the prior also has an encoder and a decoder but primarily serves the purpose of dimension-reduction and denoising, aka 'learning' the input while
    - the latter is designed to work in sequence-to-sequence tasks.
    - i.e. autoencoder's output is representation of its input
    - encoder-decoder is not just a reconstruction of the input

Tying this back to GPT-3 models (and its alikes): LLMs performs sequence-to-sequence tasks but puts more weights to more recent context, yet it still considers all previous tokens (history) to predict the upcoming token. 

### Architecture of LLM -  Generative Pre-trained Transformer aka OpenAI's proposal

1. GPT-1: Improving Language Understanding by Generative Pre-training" in June 2018. 12 transformer layers and 117m parameters;
2. GPT-2: Released in 2019, has 1.5B parameters, text can become indistinguiashable to text written by humans;
3. GPT-3: Launched in 2020, (disclosed) largeset variant at 175B parameters, generate coherent and contextually relevant passages of text, introduction of 'few-shot learning'.
    - translation
    - Q&A
    - Writing creative contents like poems and stories
4. GPT-4: Released in 2023, no. of parameter undisclosed, still known as the most powerful LLM that has been released;


GPT-3.5: sub-class of GPT-3 first released in Apr. 2022, leading to '**chatGPT**' in Nov. 2023.