## Understanding Large Language Models

This lab covers:
- High-level explanations of the fundamental concepts behind large
language models (LLMs)
- Insights into the transformer architecture from which LLMs are derived
- A plan for building an LLM

**Before the advent of LLMs, traditional methods excelled at
categorization tasks** such as email spam classification and straightforward
pattern recognition that could be captured with handcrafted rules or simpler
models. 

However, they typically **underperformed in language tasks** that
demanded complex understanding and generation abilities, such as parsing
detailed instructions, conducting contextual analysis, or creating coherent and
contextually appropriate original text. 

For example, previous generations of
language models could not write an email from a list of keywords—a task
that is trivial for contemporary LLMs.

**While pre-LLMs NLP models were tipically designed for specific tasks, LLMs are designed to be general-purpose models.**

#### What is an LLM?

**An LLM is a neural network designed to process, generate, and respond to human-like text.**

The **"large"** in large language model refers to both the model's size in terms
of parameters and the immense dataset on which it's trained.

LLMs utilize the **transformer architecture**, which allows them to pay selective attention to different parts of
the input when making predictions, making them especially adept at handling
the nuances and complexities of human language.

#### Stages of building LLMs

<img src="https://i.imgur.com/w6q2vA2.png" width="700px">

1. Pretraining

The term "pre" in "pretraining" refers to the initial phase where a model like
an LLM is trained on a large, diverse dataset to develop a broad
understanding of language. The pretrained model is often called **base** or **foundation model**.

A typical
example of such a model is the GPT-3 model. This model is capable of text completion. It also has limited fewshot capabilities, which means it can learn to perform new tasks based on
only a few examples instead of needing extensive training data.


2. Finetuning

The pretrained model can be further refined through finetuning, a
process where the model is specifically trained on a narrower dataset that is
more specific to particular tasks or domains.

- Instruction finetuning - the labeled dataset consists of instruction and answer pairs
- Finetuning for classification - the labeled dataset consists of text and corresponding labels



#### Transformer Architecture

<img src="https://i.imgur.com/EXaitPE.png" width="700px">

The transformer architecture consists of two
submodules, an **encoder** and a **decoder**.

The **encoder** module processes the
input text and encodes it into a series of numerical representations or vectors
that capture the contextual information of the input.

The **decoder**
module takes these encoded vectors and generates the output text from them.

A key component of transformers and LLMs is the **self-attention mechanism**
(not shown), which allows the model to weigh the importance of different
words or tokens in a sequence relative to each other. It enables
the model to capture long-range dependencies and contextual relationships
within the input data.

#### BERT vs. GPT

The encoder segment exemplifies BERT-like LLMs, which focus on masked word prediction and are primarily used for tasks like text classification. 

The decoder segment
showcases GPT-like LLMs, designed for generative tasks and producing coherent text sequences.

<img src="https://i.imgur.com/8kP92gc.png" width="700px">

#### Generative Pretrained Transformer (GPT) architecture

GPT models are pretrained on a relatively simple next-word prediction task.

<img src="https://i.imgur.com/ta0NRXz.png" width="400px">

The GPT architecture is based on the decoder module of the Transformer model.

Since decoder-style models like GPT generate text by predicting text one word at a time, they are considered a type of **autoregressive model**. 

Autoregressive models incorporate their previous outputs as inputs for future predictions. Consequently, in GPT, each new word is chosen based on the sequence that precedes it, which improves coherence of the resulting text.

<img src="https://i.imgur.com/hHhATvY.png" width="700px">

#### Building a large language model

<img src="https://i.imgur.com/SfY0VaV.png" width="800px">

- **Stage 1** - data processing, the attention mechanism, building the model
- **Stage 2** - pretraining and evaluation
- **Stage 3** - finetuning