# **Introduction to Transformers, Large Language Models and GenAI**

## **What's Covered**
1. What is Language Modeling?
2. Auto-Encoding vs Auto-Regression
3. What are LLMs?
4. Pre-Training, Transfer Learning and Fine-Tuning
5. Why Transformers?
6. An Era Before Transformers
7. Attention is all you need
8. A little bit about Transformers
9. Advantages of Transformers
10. Disadvantages of Transformers
11. Popular Modern LLMs
    - BERT
    - GPT
    - T5
    - Domain Specific LLMs
12. Aligning LLMs With Instructional Prompts
13. Prompt Engineering
14. Quick Summary about Transformers, LLMs and Prompt Engineering
15. What is GenAI?
16. Applications of GenAI
17. Popular GenAI Approaches
18. What Next? How to use LLMs?

## **What is Language Modeling?**
1. If we can model the language, we can solve problems like Machine Translation, Question-Answering, Sentiment Analysis, Conversational Agents much better.
2. Language Modeling involves creation of statistical/deep learning models for predicting the likelyhood of a sequence of tokens in a specified vocabulary.
3. Two types of Language Modeling Tasks are:  
    a. Autoencoding Task  
    b. Autoregressive Task  
4. **Autoregressive Language Models** are trained to predict the next token in a sentence, based on the previous tokens in the phrase. These models correspond to the **decoder** part of the transformer model. A mask is applied on the full sentence so that the attention head can only see the tokens that came before. These models are ideal for text generatation. For eg: **GPT**
5. **Autoencoding Language Models** are trained to reconstruct the original sentence from a corrupted version of the input. These models correspond to the **encoder** part of the transformer model. Full input is passed. No mask is applied. Autoencoding models create a bidirectional representation of the whole sentence. They can be fine-tuned for a variety of tasks, but their main application is sentence classification or token classification. For eg: **BERT**
6. **Combination of autoregressive and autoencoding language models** are more versatile and flexible in generating text. It has been shown that the combination models can generate more diverse and creative text in different context compared to pure decode-based autoregressive models due to their ability to capture additional context using the encoder. For eg: **T5**

## **AutoEncoding Vs AutoRegressive Task**
Autoencoding and autoregressive tasks are both types of sequence generation tasks in the field of machine learning, but they have distinct differences in their objectives and approaches:

**Autoencoding Task**:
1. **Objective**:
   - In an autoencoding task, the model is trained to reconstruct the input sequence from a corrupted or noisy version of itself.
   - The objective is to learn a representation of the input data that captures its essential features while filtering out noise or irrelevant information.
2. **Bidirectional Learning**:
   - Autoencoding models are bidirectional, meaning they learn to generate sequences by considering both past and future context simultaneously.
   - The model encodes the entire input sequence into a fixed-size representation (encoding) and then decodes it back into the original sequence (decoding).
3. **Training Signal**:
   - The training signal for an autoencoding task comes from comparing the reconstructed output with the original input.
   - The model adjusts its parameters to minimize the discrepancy between the input and reconstructed output, typically using a reconstruction loss such as mean squared error (MSE) or binary cross-entropy.

**Autoregressive Task**:
1. **Objective**:
   - In an autoregressive task, the model is trained to generate the next token in a sequence given the preceding tokens.
   - The objective is to model the conditional probability distribution of each token in the sequence given its predecessors.
2. **Unidirectional Learning**:
   - Autoregressive models are unidirectional, meaning they generate sequences one token at a time in a left-to-right fashion.
   - At each time step, the model predicts the next token based only on the tokens generated so far, without considering future context.
3. **Training Signal**:
   - The training signal for an autoregressive task comes from comparing the model's predictions for each token with the ground truth.
   - The model adjusts its parameters to maximize the likelihood of generating the correct tokens in the sequence, typically using a cross-entropy loss.

**Key Differences**:
1. **Learning Approach**:
   - Autoencoding models learn to reconstruct the input sequence, while autoregressive models learn to generate new sequences one token at a time.
2. **Directionality**:
   - Autoencoding models are bidirectional, considering both past and future context, while autoregressive models are unidirectional, considering only past context.
3. **Training Signal**:
   - Autoencoding models minimize reconstruction error between input and output sequences, while autoregressive models maximize the likelihood of generating the correct tokens in the sequence.

Overall, autoencoding and autoregressive tasks have different objectives and learning approaches, but both are used for sequence generation tasks in machine learning.

## **What are LLMs?**
1. Usually derived from Transformer architecture (but nor necesserily) by training on large amount of text data.
2. Designed to understand and generate human language, code, and much more.
3. Highly parallelized and scalable.
4. Example: BERT, GPT and T5
5. Techniques like: Stop word removal, stemming, and truncation are not used nor are they necessary for LLMs. LLMs are designed to handle the inherent complexity and variability of human language, including the use of stop words and variations in word forms like tenses and misspellings.
6. Every LLM on the market has been **pre-trained** on a large corpus of the text data and on a specific language modeling related tasks.
7. **Remember:** How an LLM is **pre-trained** and **fine-tuned** makes all the difference.
8. **How to decide whether to train our own embeddings or use pre-trained embeddings?** - A good rule of thumb is to compute the vocabulary overlap. If the overlap between the vocabulary of our custom domain and that of pre-trained word embeddings is significant, pre-trained word embeddings tends to give good results.
9. **One more important factor to consider while deploying models with embeddings-based feature extraction approach:** - Remember that learned or pre-trained embedding models have to be stored and loaded into memory while using these approaches. If the model itself is bulky, we need to factor this into our deployment needs.

## **Pre-Training, Transfer Learning and Fine-Tuning**
<img style="float: right;" width="400" height="400" src="data/images/transfer_learning.jpeg">

1. **Pre-training** of an LLM happens on a large corpus of text data and on a specific language modeling related task. During this phase LLM tries to learn and understand general language and relationships between words.
2. **Transfer Learning** is a technique used in machine learning to leverage the knowledge gained from one task to improve performance on another related task. Understand that pre-trained model has already learned a lot of information about the language and the relationships between words, and this information can be used as a starting point to improve performance on a new task.  
    **a.** Transfer Learning for LLMs involves taking an LLM that has been pre-trained on one corpus of text data and then fine-tuning it for a specific downstream task, such as text classification or text generation, by updating the model's parameter with task-specific data.  
    **b.** Transfer Learning allows LLMs to be **fine-tuned** for specific tasks with much smaller amounts of task-specific data than it would require if the model were trained from scratch. This greatly reduces the amount of time and resources required to train LLMs.  
<img style="float: right;" width="400" height="400" src="data/images/fine_tuning_loop.jpeg">
3. **Fine-tuning** involves training the LLM on a smaller, task-specific dataset to adjust its parameters for the specific task at hand. The basic fine-tuning loop is more or less same.  
    **a.** Define a model you want to fine-tune as well as fine-tuning parameters (eg: learning rate)  
    **b.** Aggregate some training data.  
    **c.** Compute loss and gradients.  
    **d.** Update the model via backpropogation.  
4. The Transformers package from Hugging Face provides a neat and clean interface for training and fine-tuning LLMs.

## **Why Transformers?**

1. Scalable and Parallel Compute
2. Revolutionized NLP with LLMs
3. Unification of DL Approaches
4. Multi-Modal Capability
5. Accelerated GenAI

## **An Era Before Transformers**

1. **2013 and before:** Various Neural Network Architectures like ANN, CNN and RNN became very popular. They use to work well for tabular data, image data and sequential data like text respectively.
2. **[(2014) Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215.pdf)** paper introduced the concept of **Encoder-Decoder Architecture** to solve a seq2seq task, like machine translation.
    - The paper introduces Seq2Seq models, which are neural network architectures designed for mapping input sequences to output sequences. Unlike traditional models that rely on fixed-length input-output mappings, Seq2Seq models can handle variable-length sequences, making them suitable for tasks such as machine translation, summarization, and question answering.
    - The core of the Seq2Seq model is the encoder-decoder architecture. The encoder processes the input sequence while maintaining the hidden state and generates a fixed-length representation, often referred to as a context vector. This context vector encapsulates the representation of the whole sentence.
    - The decoder then uses this representation to generate the output sequence one token at a time.
    - Both encoder and decoder used RNN/LSTM cells due to their ability to capture sequential dependencies.
    - This architecture used to work well with smaller sentence.
    - **The Problem:** While it could handle variable-length input and output sequences, it used to rely on generating a single fixed-length context vector for the entire input sequence, which can lead to information loss, especially for longer sequences.
3. **[(2015) Neural Machine Translation by Joint Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf)** paper introduced the concept of **Attention Mechanism** to solve the above problem.
    - Unlike traditional NMT models that encode the entire source sentence into a fixed-length context vector, the **attention mechanism allows the model to focus on different parts of the source sentence dynamically** while generating the translation.
    - Attention Mechanism also **addressed the problem of learning alignment between input and output sequences**, enables the model to weigh the importance of each word in the source sentence differently during translation. By dynamically adjusting the attention weights, the model can focus more on relevant words and ignore irrelevant ones, leading to more accurate translations. For eg: Think about the english to hindi translation for "I work at Apple Inc" vs "I work at Apple Farm". Where should I keep सेब vs एप्पल इंक ?
    - At each timestamp of the decoder, the dynamically calculated context vector indicates which timestamps of the encoder sequence are expected to have the most influence on the current decoding step of the decoder.
    - In simple terms, context vector will be the weighted sum of encoders hidden state. And these weights are called as **attention weights**.
    - The attention mechanism has improved, the quality of translation on long input sentences. But it was not able to solve a huge fundamental flaw i.e. sequential training.
    - **The Problem:** Since the architecture relies on LSTM units, a notable challenge arises due to the sequential nature of training. Specifically, only one token can be processed at a time as input to the encoder, leading to slow training times. Consequently, it becomes impractical to train the model efficiently with large datasets. This limitation inhibits the application of techniques like transfer learning, which typically involve leveraging pretrained models on large datasets to improve performance on new tasks. Additionally, fine-tuning, which involves further training pretrained models on task-specific data, is also hindered by the slow training process in this architecture.
    - Now because of the above problem, for any task which we are suppose to solve, we have to train the model from scratch. And it takes a huge amount of time, efforts and data.
    - **Transfer Learning:** Transfer learning involves leveraging knowledge gained from solving one problem and applying it to a different, but related, problem.
    - **Fine-Tuning:** Fine-tuning, on the other hand, refers to the process of taking a pretrained model and further training it on task-specific data to adapt it to a particular problem or domain. This typically involves adjusting the parameters of the pretrained model to better suit the new task while retaining the knowledge learned from the original training.

## **Attention is all you need: Introducing Transformer Architecture**

<img style="float: right;" width="400" height="600" src="data/images/transformer.JPG">

**[(2017) Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)** paper introduced by Google which solves the sequential training problem of earlier architecture by removing the need of RNN cells completely.
1. Transformer has the encoder-decoder architecture.
2. **Encoder** is great at understanding text.
3. **Decoder** is great at generating text.
4. Transformer relies solely on self-attention mechanisms and feed-forward neural networks.
5. Understand that **Attention** is a mechanism that assigns different weights to different parts of the input allowing the model to prioritize and emphasize the most important information while performing tasks like translation or summarization. Attention allows a model to focus on different parts of the input dynamically, leading to improved performance.
6. **Positional Encoding:** To retain positional information of words in the input sequence without using recurrence, the model introduces positional encodings. These encodings are added to the input embeddings to provide information about the position of each word in the sequence.
7. **Self-Attention Mechanism:** The key innovation of the Transformer is the self-attention mechanism, which allows each word in the input sequence to attend to all other words in the sequence. This enables capturing global dependencies and alleviates the need for recurrent connections.
8. **Multi-Head Attention:** The Transformer employs multi-head attention mechanisms, where attention is computed multiple times in parallel with different learned linear projections. This allows the model to focus on different parts of the input sequence simultaneously, enhancing its ability to capture diverse patterns.
9. **Feed-Forward Network:** The main goal of the feed-forward network is to apply non-linear transformations to the input representations, helping the model capture complex patterns and relationships in the input sequences. This helps enriching the representations of words or tokens in the input sequence.
10. **Skip/Residual Connections:** The main goal of skip connections is to enable the network to retain important information from previous layers and make it easier for the model to learn and optimize complex patterns in the data. Think of skip connections as shortcuts that allow information to bypass certain layers in the network. These shortcuts ensure that important information from earlier layers is preserved and remains accessible to later layers.
11. **Parallelization and Scalability:** By relying on self-attention mechanisms and feed-forward layers, the Transformer architecture facilitates parallelization of computation across different parts of the input sequence. This results in faster training times and better scalability compared to traditional recurrent models.

**Below you can find an image of Full Attentions for head 5 (from transformer original paper). The image shows the relationships learned between words with the help of self-attention mechanism.**
<img width="800" height="300" src="data/images/attention_mechanism_full.JPG">

## **A little bit about Transformers**

<img style="float: right;" width="300" height="500" src="data/images/attention_mechanism_isolated_for_word_its.JPG">

1. Introduced by Google in the year 2017
2. Transformer is a Sequence to Sequence Model which was proposed initially to solve the task of Machine Translation
3. Has two main components: Encoder-Decoder and Attention Mechanism
4. An **encoder** which is tasked with taking in raw text, splitting them up into its core components, convert them into vectors and using **self-attention** to understand the context of the text.
5. A **decoder** excels at generating text by using a modified type of attention (i.e. **cross attention**) to predict the next best token.
6. Transformers revolutionized NLP by enabling highly scalable training. By leveraging parallel computation and efficient self-attention mechanisms, the Transformer architecture allows for training on massive datasets with unprecedented efficiency. This scalability laid the foundation for the concept of **Transfer Learning** in NLP. Subsequent models such as BERT, GPT, and T5 were developed, leveraging pre-trained Transformer-based architectures that could be easily **fine-tuned** for a wide range of NLP tasks, further advancing the field of natural language processing.
7. Transformers are **trained** to solve a specific NLP task called as **Language Modeling**.
8. **Why not RNNs? -** RNN units can become a bottleneck due to sequential training. Due to parallel training capabilities and self attention mechanism of transformer, it allows each word to "attend to" all the other words in the sequence which enables it to capture long-term dependencies and contextual relationships between words at scale. The goal is to understand each word as it relates to the other tokens in the input text.
9. **Limitations of Transformers:** Transformers are still limited to an input context window (i.e. maximum length of text it can process at any given moment)
10. Timeline
    - Till 2013 - RNN/LSTMs/GRU
    - 2014 - Seq2seq tasks using Encoder-Decoder architecture
    - 2015 - Attention Mechanism
    - 2017 - Transformers
    - 2018 - BERT by Google / GPT by OpenAI
    - 2019 - T5 by Google
    - 2020 - Stable Diffusion / GPT3
    - 2021 - DALL-E / Github Copilot
    - 2022 - ChatGPT

## **Advantages of Transformers**
1. Parallel Training and Scalable
2. Transfer Learning
3. Multimodal Input and Output
4. Flexible Architecture: Encoder only transformer models like BERT, Decoder only transformer like GPT and Encode-Decoder based model like T5.
5. Ecosystem: HuggingFace, OpenAI, Cohere, etc...

## **Disadvantages of Transformers**
1. Needs high computational resources like space and GPUs
2. Huge amount of Data is required to train a model using transformers
3. Overfitting
4. Energy/Electricity Consumptions
5. Interpretation
6. Biasness due to data and Ethical Concerns

## **Popular Modern LLMs**

### **1. BERT (Bidirectional Encoder Representation from Transformers)**
<img style="float: right;" width="300" height="300" src="data/images/bert_oov.jpeg">

1. By Google - Autoencoding Language Model
2. **[Click Here](https://arxiv.org/pdf/1810.04805.pdf)** to read the original paper from Google - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
3. Individual NLP tasks have traditionally been solved by individual models created for each specific task. That is, until— BERT!
4. Tasks - BERT can solve 11+ NLP tasks such as sentiment analysis, named entity recognition, question answering, etc...
5. **Data:** Pretrained on:  
    **a.** English Wikipedia - At the time 2.5 Billion words  
    **b.** Book Corpus - 800 Million words  
6. Training on a dataset this large takes a long time. BERT’s training was made possible thanks to the novel Transformer architecture and sped up by using TPUs (Tensor Processing Units - Google’s custom circuit built specifically for large ML models). ~64 TPUs trained BERT over the course of 4 days.
7. **Input to BERT:** BERT uses three layer of token embedding for a given piece of text: Token Embedding, Segment Embedding (To distinguish between segment A and B) and Position Embedding. 
    - BERT uses WordPiece Embeddings with a 30,000 token vocabulary. The first token of every sequence is always a special classification token i.e. `[CLS]`.
    - Sentence Pairs are packed together into a single sequence with a special separator i.e. `[SEP]`.
    - For a given token, its input representation is constructed by summing the corresponding token, segment and position embedding.
<img style="float: right;" width="300" height="300" src="data/images/bert_language_model_task.jpeg">
8. **Out-of-vocabulary words with BERT** BERT's tokenizer handles OOV tokens (out of vocabulary / previously unknown) by breaking them up into smaller chunks of known tokens.
9. **BERT Training:** Trained on two language modeling specific tasks:  
    **a.** **Masked Language Modeling (MLM) aka Autoencoding Task** - (Covered in detail below) Helps BERT recognize token interaction within the sentence.    
    **b.** **Next Sentence Prediction (NSP) Task** - This task helps BERT learn relationships between sentences and helps it to understand how tokens interact with each other between sentences. It is acheived by predicting whether Sentence B is actual sentence that proceeds Sentence A, or a random sentence. 
<img style="float: right;" width="300" height="300" src="data/images/bert_classification.jpeg">
10. **Why is BERT Training Fast?** BERT uses the encoder of transformer and ignores the decoder to become exceedingly good at processing/understanding massive amounts of text very quickly relative to other slower LLMs that focus on generating text one token at a time.
11. **BERT-Base:** 12 transformer layers, 768 hidden size, 12 attention heads, 110M parameters (This was trained on 4 cloud TPUs for 4 days)
12. **BERT-Large:** 24 transformer layers, 1024 hidden size, 16 attention heads, 340M parameters (This was trained on 16 cloud TPUs for 4 days)
13. BERT itself doesn't classify text or summarize documents but it is often used as a pre-trained model for downstream NLP tasks. 
14. 1 year later RoBERTa by Facebook AI shown to not require NSP task. It matched and even beat the original BERT model's performance in many areas. Other models:
    - RoBERTa: A Robustly Optimized BERT Pretraining Approach. Trained BERT for more epochs and/or on more data. Used improved masking and pre-training data slightly.
    - ALBERT: A Lite BERT. Use smaller embedding size. It is lite in terms of parameters, not speed.
    - T5: Text-To-Text Transfer Tranformer. Has 11B parameters. Trained on 120B words of cleaned common crawl text
15. Reference: [Click here to read more](https://huggingface.co/blog/bert-101)
16. BERT Implementation: [Click here to learn how to use BERT](https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb)

#### **Auto-Encoding Task vs Masked Language Modeling (MLM)**   
- **Problem with Auto-Encoding:** As it is "bidirectional" input reconstruction task without masking, words can see themselves.
- Solution: **Masked LM (proposed by BERT)**
- MLM enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing BERT to bidirectionally use the words on either side of the covered word to predict the masked word. We naturally do this as humans!
- For eg: "How are `[MASK]` doing today?"
- Can you guess the `[MASK]`. You’re naturally able to predict the missing word by considering the words bidirectionally before and after the missing word as context clues. 
- Here it can be 'you', 'they', etc... Prediction will 'you' as it has highest probability here.
- **Fun Fact:** Newer models like BERT can be more accurate than humans! 🤯
- **What is Masked LM?** The bidirectional methodology you did to fill in the `[MASK]` word above is similar to how BERT attains state-of-the-art accuracy. **It only predict the masked words rather than reconstructing the entire input.**
- In BERT, we mask out a random k% of the input words, and then BERT's job is to correctly predict the masked words. We always use k = 15%.
- Too little masking: Too expensive to train.
- Too much masking: Not enough context to learn.
- **Problem with Masked LM:**
    - There's a problem when using BERT for fine-tuning on specific tasks. During pre-training, BERT uses a special token `[MASK]` to mask certain words in the input sentence and trains the model to predict these masked words. But during fine-tuning (when using BERT for specific tasks like classification or translation), the `[MASK]` token does not exist in the input data.
    - To address this mismatch, the authors describe a strategy where they don't always replace words with the `[MASK]` token during pre-training.
    - k=15% of the words to mask, but don't replace with `[MASK]` 100% of the time. Instead: only 80% of the time replace with `[MASK]`. 10% of the time replace with a random word. 10% of the time keep the same word.
    - With 10% random word replacement, BERT learns to handle noisy input and become more robust to variations in the training data.

Auto-encoding and Masked LM are both pre-training tasks used in NLP to train transformer-based models like BERT. While they are very similar, but they have distinct differences. Both tasks aim to learn rich, context-aware representations of words and sentences, but they achieve this goal through different training objectives and mechanisms:
1. Objective:
    - In an autoencoding task, the model is trained to reconstruct the original input sequence from a corrupted or noisy version of the same sequence.
    - In a masked language model task, the model is trained to predict masked or missing words in an input sequence.
2. Input-Output Relationship
    - In autoencoding task, the model learns to map the input sequence to itself, with the goal of minimizing the reconstruction error between the input and output sequences.
    - In MLM, the model's task is to predict the original words that were replaced with `[MASK]` tokens based on the surrounding context.
3. Training
    - In autoencoding task, the model adjusts its parameters to minimize the discrepancy between the input and reconstructed output, typically using a reconstruction loss such as mean squared error (MSE) or binary cross-entropy.
    - The model adjusts its parameters to minimize the discrepancy between the predicted tokens and the original tokens, typically using a cross-entropy loss.


### **2. GPT (Generative Pre-Trained Transformer)**

1. By OpenAI - Autoregressive Language Model
2. **[Click Here](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)** to read the original paper from OpenAI - Improving Language Understanding by Generative Pre-Training.
3. Pretrained on: Proprietary Data (Data for which the rights of ownership are restricted so that the ability to freely distribute the is limited)
4. Autoregressive Language Model that uses attention to predict the next token in a sequence based on the previous tokens.
5. GPT relies on the decoder portion of the Transformer and ignores the encoder to become exceptionally good at generating text one token at a time.


### **3. T5 (Text to Text Transfer Transformer)**
<img style="float: right;" width="400" height="400" src="data/images/t5.jpeg">

1. In 2019, By Google - Combination of Autoencoder and Autoregressor Language Model.
2. Has 11B parameters. Trained on 120B words of cleaned common crawl text
3. Tasks: T5 can solve tasks such as summarization, translation, Q&A, and text classification
4. T5 uses both encoder and decoder of the Transformer to become highly versatile in both processing and generating text.
5. T5 based models can generate wide range of NLP tasks, from text classification to generation.


### **4. Domain Specific LLMs**

1. BioGPT - Trained on large scale biomedical literature (more than 2 million articles). Developed by the AI healthcare company, Owkin, in collaboration with Hugging Face.
2. SciBERT
3. BlueBERT

## **Aligning LLMs With Instructional Prompts**
1. After the initial pre-training, the LLM may undergo fine-tuning where it is further trained on specific tasks or domains. During fine-tuning, the model is exposed to additional data related to the task or domain, along with instructional prompts tailored to the task. This helps the model adapt and specialize for specific applications or use cases.
2. A popular method of aligning language model is through the incorporation of **Reinforcement Learning** into the training loop.
3. **Reinforcement Learning with Human Feedback (RLHF)** is a popular method of aligning pre-trained LLMs that uses human feedback to enhance their performance.
4. Few Language Models that have been specifically designed and trained to be aligned with instructional prompts are GPT-3, GPT-4, ChatGPT (closed-source model from OpenAI), FLAN-T5 (an open-source model from Google) and Cohere's command series (closed-source).

## **Prompt Engineering**
1. Popular LLMs like GPT-3, GPT-4, ChatGPT, Coral, GPT-J, FLAN-T5, etc... have been specifically designed and **trained to be aligned with instructional prompts**.
2. If you are wondering what is the best way to talk to ChatGPT and GPT-4 to get optimal results, we will cover that under **Prompt Engineering**.
3. **Prompt Engineering** involves crafting prompts that effectively communicate the task at hand to the LLM, leading to accurate and useful outputs.

## **Quick Summary about Transformers, LLMs and Prompt Engineering**
1. What really sets the Transformers appart from other deep learning architectures is:
    - Its ability to capture long-term dependencies and relationships between tokens using attention mechanism.
    - Its ability to scale and parallelize the computation
2. Attention is the crucial component of Transformer.
3. Factor behind transformer's effectiveness as a language model is it is highly parallelizable, allowing for faster training and efficient processing of text.
4. LLMs are usually derived from Transformer architecture (but nor necesserily) by training on large amount of text data.
5. Designed to understand and generate human language, code, and much more.
6. LLMs are pre-trained on large corpus and fine-tuned on smaller datasets for specific tasks.
7. Few Popular LLMs: BERT, GPT-4, GPT-3.5, Gemini (Previously known as Bard), Cohere, LLaMa-2, Coral, GPT-J, FLAN-T5, etc...
8. If you are wondering what is the best way to talk to ChatGPT and GPT-4 to get optimal results, we will cover that under **Prompt Engineering**.

Remember, building an LLM requires a huge amount of good quality data and computational resources. **[Read here](https://blog.google/products/gemini/gemini-image-generation-issue/)** where google explained what went wrong after the launch of Gemini.

## **What is Generative AI?**
GenAI System typically learns the patterns from unstructured input data and learns to generate unstructured output data.
Remember that, when the output of a model is one of the following, it is an example of GenAI model:
1. Text
2. Image
3. Video
4. Audio

## **Applications of GenAI**
1. Text Generation
2. Summarization
3. Code Generation
4. Machine Translation
5. Virtual Assistants
6. Question Answering
7. Image Editing
8. Image Generation
9. Image Inpainting
10. etc...

## **Popular GenAI Approaches**
1. **Autoencoding**
    - In autoencoding, the model is trained to reconstruct the input data. It consists of two main components: an encoder and a decoder.
    - The encoder takes the input data and maps it to a latent space, where it is represented in a compressed form.
    - The decoder then takes this compressed representation and tries to reconstruct the original input data from it.
    - The goal of autoencoding is to learn a compact representation of the input data that captures its salient features, allowing the model to generate new samples similar to the training data.
2. **Generative Adversarial Networks**
    - GANs consist of two neural networks: a generator and a discriminator, which are trained adversarially against each other.
    - The generator learns to generate realistic data samples from random noise.
    - The discriminator learns to distinguish between real data samples from the training set and fake data samples generated by the generator.
    - During training, the generator tries to generate data samples that are indistinguishable from real samples, while the discriminator tries to correctly classify real and fake samples.
    - The objective of GANs is to learn to generate new data samples that are realistic and similar to the training data, without explicitly reconstructing the input data like autoencoders.
    - While both autoencoding and GANs are used for generative modeling, autoencoding focuses on reconstructing the input data, while GANs focus on generating new data samples from random noise.
3. **Autoregressive**
    - Autoregressive models generate new data samples by modeling the conditional probability of each data point given previous data points in the sequence.
    - Autoregressive models are often used for sequential data, such as time series data, where the order of elements is important.
    - Unlike autoencoders, which learn to reconstruct the input data, and GANs, which learn to generate samples from random noise, autoregressive models explicitly model the sequential dependencies in the data and generate new samples one element at a time based on these dependencies.
    - Autoregressive models differ from autoencoding and GANs in that they explicitly model the sequential dependencies in the data and generate new samples one element at a time based on these dependencies, rather than focusing on reconstructing the input data or generating samples from random noise.
4. **Stable Diffussion**
    - Stable diffusion models generate new data samples by iteratively refining a noise input through multiple steps.
    - These models gradually add noise to the input noise and refine it through a series of diffusion steps, effectively diffusing the noise until it resembles the target distribution of the data.
    - Stable diffusion models often use deep neural networks, such as convolutional neural networks (CNNs) or transformer architectures, to perform the diffusion process and generate high-quality samples.
    - The objective of stable diffusion models is to learn the underlying distribution of the data and generate new samples that are realistic and similar to the training data, without relying on explicit modeling of sequential dependencies or adversarial training.
    - Stable diffusion models are particularly effective for generating high-resolution images and other complex data types where capturing fine-grained details and global coherence is important.
    - Stable diffusion models differ from autoencoding, GANs, and autoregressive models in that they generate new data samples by iteratively refining a noise input through multiple steps, rather than focusing on reconstructing the input data, generating samples from random noise, or explicitly modeling sequential dependencies in the data.

## **What Next? How to use LLMs?**

Given a business problem, ask this to yourself:
1. What NLP task does it map to?
    - Text Classification
    - Token Classification
    - Text Generation
    - Fill-Mask
    - Conversational
    - Sentence Similarity
    - Question Answer
    - Summarization
    - Table Q&A
    - Translation
    - Zero-Shot Classification
2. Given the task, what model(s) work for that task?

**Example:**  
> **Business Problem:** Generate a news feed for an app so that users can scroll through  
> **Mapping to a NLP task:** Given news article, a standard NLP task is to summarize  

Now before we get into how to solve problems like above, a quick note on NLP ecosystem:

| Popular Tools | Utility |
| :---: | :---: |
| **Hugging Face Transformers** | Pre-trained models and Pipelines |
| **NLTK** | Classical NLP + corpora |
| **SpaCy** | Production grade NLP, especially NER |
| **Gensim** | Classical NLP + Word2Vec |
| **OpenAI** | ChatGPT, Whisper |
| **Spark NLP** | Scale-out, production-grade NLP |
| **LangChain** | LLM Workflows |