## Chapter 1: Understanding LLMs

**Stages of building and using LLMs**
- LLM is pretrained on unlabeled text data
- The LLM has a few basic capabilities after pretraining
- A pretrained LLM can be further trained on a labeled dataset to obtain a fine-tuned LLM for specific tasks

**Pretraining**
- Creating an initial pretrained LLM, often called a base or foundation model. Ex: GPT-3 model
- After obtaining a pretrained LLM from training on large text datasets, where LLM is trained to predict the next word in the text, we can further train the LLM on labeled data - fine tuning

**Fine-tuning**
- Instruction fine-tuning: Labeled dataset consists of instruction and answer pairs, such as a query to translate a text accompanied by the correctly translated text
- Classification fine-tuning: Labeled dataset consists of texts and associated class labels - for example, emails associated with "spam" and "not spam" labels

**Introduction to Transformer Architecture**
- Most modern LLMs rely on the transformer architecture

![Transformer Architecture](pic1.png)

- The transformer architecture consists of two submodules: an encoder and a decoder

**Encoder:**
- Processes the input text and encodes it into a series of numerical representations of vectors that cupture the contextual information of the input

**Decoder:**
- Takes the encoded vectors and generates the output text

**Self-attention Mechanism:**
- Enables the model to capture long-ranged dependencies and contextual relationships within the input data --> enhancing its ability to generate coherent and contextually relevant output

![BERT VS GPT models](pic2.png)

**BERT:**
- Built upon the original transformer's encoder submodule, differes in its training approach from GPT
- Specialize in masked word prediction, where the model predicts masked or hidden words in a given sentence.
- This unique training strategy equips BERT with strengths in text classification tasks, including sentiment

**GPT:**
- GPT focuses on the decoder portion of the original transformer architecture and is designed for tasks that require generating texts
- This includes machine translation, text summarization, fiction writing, writing computer code and more.
- GPT models primarily designed and trained to perform text completion tasks, also show remarkable versatility in their capabilities.

### Closer Look at the GPT architecture

- GPT-3 is a scaled-up version of this model that has more parameters and was trained on a larger dataset
- In addition, the original model offered in ChatGPT was created by fine-tuning GPT-3 on a large instruction dataset using a method from OpenAI's InstructGPT paper
- The next-word prediction task is a form of self-supervised learning, which is a form of self-labeling
- This means that we don't need to collect labels for the training data explicitly but can use the structure of the data itself: we can use the next word in a sentence or document as the label that the model is supposed to predict.
- Since this next-word prediction task alows us to create labels "on the fly," it is possible to use massive unlabeled text datasets to train ML
- The general GPT architecture is relatively simple
- Essentially, it is just the decoder part without the encoder. Since decoder-style models like GPT generate text by predicting text one word at a time, they are considered a type of **autoregressive** model
- **Autoregressive Model:** incorporate their previous outputs as inputs for future predictions. Consequently, in GPT, each new word is chosen based on the sequence that precedes it, which improves the coherence of the resulting text.

### Building a Large Language Model

![Building a LLM](pic3.png)