#### Flash Attention: Overview and Benefits

**Flash Attention** refers to an optimized version of the attention mechanism commonly used in Transformer models. It is specifically designed to speed up the calculation of attention by improving the memory and computational efficiency of the traditional attention mechanism. Flash Attention is particularly beneficial for large-scale models and datasets, where the standard attention mechanism can become computationally expensive and memory-intensive.

##### Traditional Attention Mechanism (in Transformers)
In the original **Scaled Dot-Product Attention** used in Transformers, the attention mechanism computes attention scores for each token (word or other elements) in relation to every other token in the sequence. This involves:

1. **Query (Q), Key (K), and Value (V) matrices**: These matrices are used to compute the attention weights.
2. **Attention Score Calculation**: The attention score is computed using the dot product of the query and key matrices, followed by scaling, softmax, and weighted sum of the value matrix.

The computation complexity of this approach is **O(N²)**, where **N** is the sequence length. This quadratic complexity arises because each token attends to every other token in the sequence, making it computationally expensive for long sequences.

##### Flash Attention: Key Features

Flash Attention optimizes the attention mechanism to reduce both **memory usage** and **computation time** while maintaining the accuracy of the results.

1. **Memory Efficiency**: Flash Attention reduces the memory footprint of computing the attention matrix by optimizing how the query, key, and value matrices are handled in memory. It uses **in-place operations** that minimize unnecessary memory allocations, making it more efficient for long sequences.
   
2. **Faster Computation**: Flash Attention uses specialized kernel implementations to leverage GPU hardware more effectively. It optimizes matrix multiplications and other operations involved in attention, making them faster and less resource-intensive compared to traditional methods.

3. **Efficient Batch Processing**: Flash Attention supports efficient parallelization, which allows for faster processing of multiple sequences in a batch.

4. **Low Precision Arithmetic**: It often uses **low-precision arithmetic** (such as FP16 instead of FP32), which helps further speed up computation without significantly impacting the model's performance.

##### How Flash Attention Helps with Regular Attention

Flash Attention improves the regular attention mechanism in the following ways:

- **Speed**: By optimizing the attention calculation and reducing memory overhead, Flash Attention significantly speeds up the computation, especially for long sequences.
  
- **Memory Usage**: Traditional attention requires storing large matrices for the attention scores, which can be infeasible for long sequences. Flash Attention reduces the memory footprint, making it possible to handle longer sequences or larger batch sizes within the same hardware constraints.

- **Scalability**: Flash Attention is more scalable, making it suitable for training and inference on large models, such as GPT-3 and BERT, where the traditional attention mechanism would become a bottleneck.

##### Example Use Cases
- **Large Language Models**: Flash Attention can speed up training and inference of large-scale language models like GPT, BERT, etc., where the sequence length can be very long.
- **Vision Transformers (ViT)**: In tasks like image classification using Transformer-based architectures, Flash Attention can help handle the large input sizes (e.g., high-resolution images) more efficiently.


Came across this whitepaper - "Foundational Large Language Models & Text Generation" from Google. This paper dives deep into the evolution of large language models, from the early days of GPT-1 and BERT to advanced models like PaLM, Gemini, and more. It covers key concepts such as:

- Transformer architecture and multi-head attention
- The evolution of GPT and BERT models
- Fine-tuning techniques including supervised learning and reinforcement learning from human feedback
- Practical insights on prompt engineering, sampling techniques, and accelerating inference

https://drive.google.com/file/d/1XixH1yGVXgX6hdPFv8JUaqK5t31U984V/edit

"Prompt Engineering" by Lee Boonstra. This paper covers everything from basic prompting techniques to advanced strategies like Chain of Thought (CoT), Tree of Thoughts (ToT), and ReAct, along with practical tips for code-related prompting. Key sections include:

- Configuring output with temperature, top-K, and top-P settings
- Using zero-shot, one-shot, and few-shot prompting effectively
- System, role, and contextual prompting techniques
- Best practices, including experimenting with input formats and adapting prompts to model updates

https://drive.google.com/file/d/13PXk5hmgoTeFyKCKCKroE4pz7AoFoyjS/view

#### What is KV Cache?

**KV cache** refers to the **Key-Value cache** used in transformer models, particularly in Large Language Models (LLMs) like GPT or BERT. In the context of attention mechanisms, the **Key** and **Value** matrices store intermediate information from previous layers that are reused during inference (i.e., when generating text or processing input). The **Key** and **Value** pairs are crucial for the attention mechanism to compute dependencies between tokens at different positions.

#### Example: How KV Cache Works in LLMs

Let’s break it down with a simple example to understand how the KV cache works:

#### Scenario: Text Generation with a Transformer Model

Imagine we are generating text one token at a time, with the input prompt being:

**Input**: "The quick brown fox"

1. **Tokenization**: The input text is tokenized into individual words or subwords, e.g., `["The", "quick", "brown", "fox"]`.

2. **Processing with Transformer Layers**: 
   - Each token is processed through multiple transformer layers (e.g., 12 layers for a BERT model). 
   - In each layer, the **attention mechanism** computes the relevance of each token to every other token in the sequence.

3. **Key and Value Pairs**: 
   - During this process, each layer computes two vectors for each token: 
     - **Key** (`K`): Represents the contextual information about the token.
     - **Value** (`V`): Represents the actual content (value) that will be used to compute attention.
   - These Key and Value matrices are stored for each token at each layer of the transformer model.

4. **Storing in KV Cache**:
   - For the token "The", the model computes its Key and Value matrices and stores them in the cache.
   - As the model processes subsequent tokens ("quick", "brown", "fox"), it stores the Key and Value matrices for each token as well.
   - These matrices are stored in a cache so they can be **reused** without having to recalculate them for each token.

5. **During Inference**:
   - When the model is generating text, it uses the stored **Key-Value pairs** in the cache to quickly compute attention for each new token.
   - For example, when generating the next token after "fox", the model already has the Key and Value matrices for "The", "quick", and "brown" in its cache and uses them to calculate attention to predict the next token (say "jumps").

6. **Efficiency**:
   - Instead of recalculating the Key and Value pairs for every token at each layer, the **KV cache** allows the model to efficiently access previously computed values, reducing computational costs and speeding up inference.

#### How KV Cache Optimization Helps

In models like **SwiftKV**, optimizations like **SingleInputKV** and **AcrossKV** are used to reduce the size and computation of the KV cache:
- **SingleInputKV**: Uses the output of an earlier layer to fill the KV cache for subsequent layers, skipping redundant computations.
- **AcrossKV**: Reduces memory by sharing KV caches across multiple layers, saving resources while maintaining model performance.

#### Summary

The **KV cache** stores the Key and Value pairs in transformer models, allowing for faster and more efficient inference by reusing previously computed data. This caching mechanism is essential for models that process large inputs, like in text generation, and optimizations to this cache can greatly improve throughput and reduce latency.


Some important concepts 

1. **Tokens**  
   - **Importance**: Fundamental concept in NLP and LLMs. Everything in LLMs starts with tokenization, and understanding tokens is crucial for understanding how models process input text.  
   - **Usage**: Tokens are used universally across various NLP tasks and models (e.g., GPT, BERT).

2. **Attention Mechanism**  
   - **Importance**: Core to transformer models, which are the foundation of modern LLMs. Attention enables models to focus on relevant parts of the input.  
   - **Usage**: Central to model architectures like BERT, GPT, and T5, attention mechanisms are key in understanding how LLMs process and generate language.

3. **KV Cache (Key-Value Cache)**  
   - **Importance**: A mechanism to improve inference efficiency by caching important information for attention operations, reducing the need to recompute during inference.  
   - **Usage**: Increasingly important for optimizing LLMs, especially for large-scale models like GPT-3.

4. **Zero-shot Learning**  
    - **Importance**: Key to the generalization power of LLMs. Allows models to perform tasks without explicit training on those tasks.  
    - **Usage**: Commonly referenced in papers and literature, especially for models like GPT-3 that perform tasks "out of the box."

5. **Few-shot Learning**  
    - **Importance**: Enhances the versatility of LLMs by allowing them to learn tasks with minimal examples, often using prompt engineering.  
    - **Usage**: Popular in the context of LLMs like GPT-3, where only a few examples are needed to perform specific tasks.

6. **Layer Normalization**  
    - **Importance**: Helps stabilize training by ensuring consistent activations across layers. It plays a vital role in training deep models like transformers.  
    - **Usage**: Used in transformers and other deep learning models to prevent issues like vanishing or exploding gradients.

7. **AcrossKV**  
    - **Importance**: An optimization technique that reduces memory consumption during inference by sharing key-value caches across multiple layers.  
    - **Usage**: More specialized and used in specific optimizations for LLMs to enhance memory efficiency.

### 1. Tokenizer
#### What is a Tokenizer?
A tokenizer is a tool in Natural Language Processing (NLP) that converts raw text into smaller units called tokens. Tokens can be words, characters, or subwords, depending on the type of tokenizer used. Tokenization is a crucial preprocessing step for text analysis and training language models, as models operate on tokens rather than raw text.


#### How Does Tokenization Work?
1. **Text Input**: The raw input text (e.g., "Hello, how are you?") is provided to the tokenizer.
2. **Processing**: The tokenizer breaks the text into tokens based on its rules or configuration.
3. **Output**: A list of tokens is generated (e.g., `["Hello", ",", "how", "are", "you", "?"]`).

The process can vary based on the tokenizer type, which may handle spaces, punctuation, and special characters differently.

#### Types of Tokenizers
1. **Whitespace Tokenizer**: Splits text by spaces.  
   Example: "Tokenization is fun" → `["Tokenization", "is", "fun"]`

2. **Word Tokenizer**: Splits text into words while preserving punctuation.  
   Example: "Hello, world!" → `["Hello", ",", "world", "!"]`

3. **Character Tokenizer**: Splits text into individual characters.  
   Example: "Hello" → `["H", "e", "l", "l", "o"]`

4. **Subword Tokenizer**: Splits text into subwords based on frequency or patterns.  
   Example: "unbelievable" → `["un", "believ", "able"]`  
   - Commonly used in modern language models (e.g., BPE, WordPiece).

5. **Sentence Tokenizer**: Splits text into sentences.  
   Example: "Hello world. How are you?" → `["Hello world.", "How are you?"]`


#### Tokenizers Used in ChatGPT and Gemini
1. **ChatGPT**: Uses Byte Pair Encoding (BPE) as the tokenizer, part of the GPT architecture. BPE splits text into subword units based on frequency.
2. **Gemini**: Likely uses a custom tokenizer optimized for multimodal data processing, possibly based on subword or BPE techniques.

#### Byte Pair Encoding (BPE): Overview

#### What is BPE?
Byte Pair Encoding (BPE) is a subword tokenization technique commonly used in modern NLP models like GPT and BERT. It splits text into smaller, more frequent subword units, enabling efficient handling of rare and compound words.


#### How Does BPE Work?
1. **Initialization**:
   - Start with the text split into individual characters (e.g., "hello" → `["h", "e", "l", "l", "o"]`).
   
2. **Merging**:
   - Identify the most frequent pair of adjacent symbols (e.g., "l" and "l").
   - Merge the pair into a single unit (e.g., `["h", "e", "ll", "o"]`).
   
3. **Iteration**:
   - Repeat the process until the desired vocabulary size is reached, merging frequent pairs iteratively.

4. **Encoding**:
   - During tokenization, the trained BPE merges are applied to break words into subword units (e.g., "unbelievable" → `["un", "believ", "able"]`).


#### How is BPE Trained?
1. **Input Data**: Collect a large text corpus.
2. **Character Split**: Start with all words split into individual characters.
3. **Frequency Calculation**:
   - Count the frequency of all adjacent character pairs.
4. **Pair Merging**:
   - Merge the most frequent pair and update the vocabulary.
5. **Repeat**:
   - Continue merging until reaching a predefined vocabulary size.


#### Advantages of BPE
- Handles **rare words** by splitting them into subword units.
- **Compact vocabulary**, reducing memory usage.
- Balances between character- and word-level tokenization.


#### Libraries Supporting BPE
1. SentencePiece
```python
import sentencepiece as spm

# Train a BPE tokenizer
spm.SentencePieceTrainer.train(input='text.txt', model_prefix='bpe', vocab_size=8000)

# Load and tokenize
sp = spm.SentencePieceProcessor(model_file='bpe.model')
tokens = sp.encode('unbelievable', out_type=str)
print(tokens)  # Output: ['un', 'believ', 'able']
```
BPE is a simple yet powerful tokenization method, merging frequent subword pairs to create a flexible vocabulary. It strikes a balance between robustness (handling rare words) and efficiency (compact vocabulary).



#### Standard Python Modules for Tokenization

#### 1. **Using `nltk`**
`nltk` is a popular library for basic tokenization tasks.
```python
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')

# Word Tokenizer
text = "Hello, how are you?"
tokens = word_tokenize(text)
print(tokens)  # Output: ['Hello', ',', 'how', 'are', 'you', '?']

# Sentence Tokenizer
sentences = sent_tokenize("Hello world. How are you?")
print(sentences)  # Output: ['Hello world.', 'How are you?']
```

####  2. Using transformers (Hugging Face)
For subword tokenization, commonly used with models like BERT, GPT, etc.
```python
from transformers import AutoTokenizer

# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Encode text
text = "Tokenization is fun!"
encoded = tokenizer.encode(text)
print(encoded)  # Output: [Tokens as IDs]

# Decode tokens back to text
decoded = tokenizer.decode(encoded)
print(decoded)  # Output: Tokenization is fun!
```

#### 3. Using spaCy
A powerful library for NLP tasks.
```python
import spacy

# Load a pre-trained model
nlp = spacy.load("en_core_web_sm")

# Tokenize text
doc = nlp("Hello, how are you?")
tokens = [token.text for token in doc]
print(tokens)  # Output: ['Hello', ',', 'how', 'are', 'you', '?']
```

#### 4. Using tiktoken
For OpenAI models like ChatGPT.
```python
import tiktoken

# Load the tokenizer for a model
tokenizer = tiktoken.get_encoding("gpt-3.5-turbo")

# Encode text
text = "Hello, how are you?"
tokens = tokenizer.encode(text)
print(tokens)  # Output: [Token IDs]

# Decode tokens back to text
decoded_text = tokenizer.decode(tokens)
print(decoded_text)  # Output: Hello, how are you?
```

model router
llm as judge
monitoring
finetune

### 2. Attention

#### Attention Mechanism in NLP

#### What is Attention Mechanism?
The attention mechanism allows neural networks to focus on relevant parts of the input data while making predictions. It dynamically assigns weights to input tokens, helping the model prioritize crucial information for tasks like translation, summarization, or language generation.

#### Evolution of Attention Mechanism
1. **Seq2Seq with Attention (2015)**: Attention was first introduced to improve machine translation models, allowing them to focus on specific input words during decoding.
2. **Self-Attention (2017)**: The Transformer model introduced self-attention, enabling models to consider relationships between all tokens in a sequence simultaneously.
3. **Multi-Head Attention**: Improved self-attention by combining multiple attention heads to learn various aspects of the input sequence.
4. **Cross-Attention**: Allows the model to focus on relationships between different input modalities (e.g., text and image).


#### Self-Attention
Self-attention computes attention scores between tokens in the same sequence to capture contextual relationships. For a sentence, each token determines how much attention it should pay to every other token.

#### How It Works
1. Create **Key (K)**, **Query (Q)**, and **Value (V)** vectors for each token.
2. Compute attention scores as a dot product of Q and K.
3. Normalize scores using a softmax function.
4. Multiply the scores by V to get the weighted output.

#### Key, Query, and Value Vectors: Intuition
- **Query (Q)**: Represents the token for which attention is being computed. Represents the token's "focus" — the current token's perspective.
- **Key (K)**: Represents the token being compared to. Represents the "content" of other tokens that the current token might want to focus on.
- **Value (V)**: Contains the actual information to be passed forward. Represents the information or content of the token itself that could be used by the model.

Attention score determines how much of the **Value** vector from one token contributes to another.

In the context of **self-attention**, the **Key (K)**, **Query (Q)**, and **Value (V)** vectors are crucial to understanding how the model attends to different parts of the input sequence when generating an output. As we saw earlier, the self-attention mechanism allows the model to weigh the importance of each token in the input sequence relative to the others. For a given token, self-attention computes a "score" of how much attention it should give to other tokens in the sequence based on their relationships.

#### 2. How Does Self-Attention Work?
- The **Query (Q)** of a token is compared to the **Keys (K)** of all tokens in the sequence (including itself).
- The attention scores are computed by taking the dot product between the Query (Q) of a token and the Keys (K) of all other tokens, followed by scaling and softmax normalization.
- The softmax scores determine the amount of attention given to each token. These scores are then used to weight the **Values (V)** corresponding to each token.
- The output of self-attention is a weighted sum of the **Values (V)**, with weights given by the attention scores.

#### 3. How are Key, Query, and Value Updated?
- **Key (K)** and **Value (V)** vectors are generated once per token and remain fixed during processing. They are learned during training.
- **Query (Q)** is dynamically generated for each token during self-attention.
- The model computes attention scores at each layer and updates the **Query** vectors accordingly for each token as it processes the sequence.

#### 4. Multi-Head Attention
In **multi-head attention**, the attention process is repeated multiple times (using different sets of learned weights) in parallel, with each "head" focusing on different aspects of the input. The outputs of these multiple heads are then concatenated and transformed to produce the final output.

- **Multi-head attention** allows the model to attend to different parts of the sequence simultaneously and capture various relationships between tokens at different levels of abstraction.
- Each head learns to focus on different aspects of the sequence, leading to richer and more nuanced representations.

#### 5. Example: 2 Heads of Multi-Head Attention
Consider an input sentence: **"The cat sat on the mat"**. Let's focus on the token **"sat"**.

- **Query (Q)**: For the token "sat", we compute a Query vector.
- **Keys (K)**: The same process is done for all tokens in the sentence, including "sat".
- **Values (V)**: The Values for each token also correspond to each token's vector representation.

#### Process for each Attention Head:
#### Head 1:
- The **Query** for "sat" is compared with the **Keys** of all tokens in the sentence.
- Attention scores are computed, and the **Values** corresponding to those tokens are weighted accordingly.
- The output of **Head 1** is a weighted sum of the **Values (V)**.

#### Head 2:
- The same process happens again, but with different learned weights for **Query**, **Key**, and **Value** vectors (head-specific).
- The output of **Head 2** is another weighted sum of the **Values (V)**, but focusing on different relationships.

#### 6. Final Output of Multi-Head Attention
After computing attention for both heads, the outputs are concatenated and passed through a linear layer to produce the final output.

#### Example Code for Attention Mechanism

#### Implementation of Scaled Dot-Product Attention
```python
import numpy as np

# Define a simple attention mechanism
def attention(query, key, value):
    # Compute attention scores (dot product)
    scores = np.dot(query, key.T)
    # Scale scores
    scores /= np.sqrt(key.shape[-1])
    # Apply softmax
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True)
    # Compute output
    output = np.dot(attention_weights, value)
    return attention_weights, output

# Example sentence: "The cat sat on the mat"
query = np.array([[1.0, 0.0]])  # Example query
key = np.array([[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]])  # Example keys
value = np.array([[1.0, 1.0], [0.5, 0.5], [0.0, 0.0]])  # Example values

weights, output = attention(query, key, value)
print("Attention Weights:\n", weights)
print("Output:\n", output)
```

#### Visualization of Attention in Python
You can use transformers library for pre-trained models and visualization tools like bertviz to explore attention.

Example Using bertviz

```bash
pip install bertviz
```



In [7]:
from transformers import AutoTokenizer, AutoModel
from bertviz import head_view
from IPython.core.display import display, HTML


# Load model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True)

# Tokenize input
sentence = "The cat sat on the mat."
inputs = tokenizer(sentence, return_tensors="pt")

# Forward pass
outputs = model(**inputs)
print (outputs.attentions)

# Extract tokens
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

# Add custom styles for size
display(HTML("""
    <style>
        .bertviz {
            width: 1500px; /* Set desired width */
            height: 1000px; /* Set desired height */
            margin: auto;
        }
    </style>
"""))

# Visualize attention
head_view(
    attention=outputs.attentions,
    tokens=tokens
)


  from IPython.core.display import display, HTML


(tensor([[[[5.6051e-02, 1.2726e-01, 4.2542e-02, 4.9169e-02, 1.0663e-01,
           1.3596e-01, 9.7063e-02, 1.1015e-01, 2.7518e-01],
          [1.0814e-01, 1.1571e-01, 8.8204e-02, 7.4477e-02, 1.3991e-01,
           1.3925e-01, 1.0622e-01, 1.2108e-01, 1.0701e-01],
          [8.5311e-02, 7.3755e-02, 9.7551e-02, 1.5349e-01, 5.7332e-02,
           7.8201e-02, 1.7413e-01, 1.3029e-01, 1.4994e-01],
          [7.7629e-02, 1.0930e-01, 1.5464e-01, 6.3742e-02, 9.3512e-02,
           1.0681e-01, 1.1360e-01, 8.3411e-02, 1.9735e-01],
          [8.4655e-02, 7.2877e-02, 1.0875e-01, 8.1040e-02, 2.7345e-01,
           7.5599e-02, 9.7974e-02, 9.6557e-02, 1.0911e-01],
          [9.2437e-02, 1.1052e-01, 9.3894e-02, 7.2929e-02, 1.9264e-01,
           1.1976e-01, 1.1053e-01, 1.1630e-01, 9.0985e-02],
          [6.4587e-02, 2.5977e-02, 3.0396e-01, 1.9987e-01, 5.0117e-02,
           2.5672e-02, 9.4786e-02, 9.7632e-02, 1.3740e-01],
          [8.8078e-02, 8.6885e-02, 9.6353e-02, 7.2783e-02, 2.0151e-01,
           

<IPython.core.display.Javascript object>

#### 3. What is KV Cache? (Simplified Explanation)

The Key-Value (KV) Cache is an optimization technique used during inference in transformer-based models, such as GPT. To understand it, let's break it down step by step, starting with the foundation of Keys (K), Queries (Q), and Values (V) in the Attention Mechanism:

How Attention Works (Key, Query, and Value):
- Input Representation: Each token (word or subword) in a sequence is represented as an embedding vector.

- Key (K), Query (Q), and Value (V) Vectors:

    - Query Vector (Q): Represents the "question" each token is asking about the sequence (e.g., "What is the context around me?").
    - Key Vector (K): Represents the "identity" of each token, helping other tokens decide how relevant it is to them.
    - Value Vector (V): Contains the actual content or information that is passed along if a token is found to be relevant.

- Attention Calculation:

    - Attention determines how much focus a token should give to others in the sequence.
    - It computes a score for each token pair using the dot product of Query (Q) and Key (K) vectors.
    - These scores are then normalized (softmax) to produce attention weights, which are used to combine the Value (V) vectors and produce the output.

#### Analogy
Imagine you are reading a book and trying to summarize it. Instead of starting fresh each time you summarize a new chapter, you keep notes (a cache) about what you’ve already summarized. This saves time because you don’t need to re-read everything—just glance at your notes to remember the important parts.

In the world of Large Language Models (LLMs), like ChatGPT, a Key-Value (KV) cache serves a similar purpose. It helps the model remember what it has already processed during a conversation or text generation. This makes responses faster and more efficient because the model doesn’t need to recompute information for the parts of the text it has already seen.

#### Why is KV Cache Useful?

When generating text (inference), the model processes tokens one at a time. Without caching, the model would recompute the Key and Value vectors for all tokens at every step. This is inefficient.

The KV cache stores the Key and Value vectors for tokens that have already been processed. This way:

- The model doesn’t need to recompute K and V for tokens in previous steps.
- Instead, it can reuse the cached K and V, combining them with the Query (Q) vector for the current token to compute attention.

It also:

- Speeds Up Responses: By remembering past context, the model can quickly focus on generating new text instead of recalculating everything.
- Saves Computational Resources: Avoids redundant calculations, making the model more efficient.
- Keeps Context: Ensures that the generated responses remain consistent with the earlier conversation.

In the context of the Key-Value (KV) cache during inference (like in GPT models), the Key (K) and Value (V) vectors remain fixed once computed for a token, while the Query (Q) vector changes at each step of the token generation process.

#### Here's why:

- **Key (K) and Value (V)** represent the context and content for each token in the sequence.
  - Once a token's Key and Value are computed, they don’t change.
  - The **Key** essentially serves as a static reference to the token, while the **Value** contains the information (the content) that might be used by other tokens.
  - These **Key** and **Value** vectors are cached and used for subsequent steps, saving computation.

- **Query (Q)**, on the other hand, is the vector that changes with each new token being processed.
  - The **Query** represents the current token's "question" or "focus" in relation to the other tokens.
  - As the model generates or processes the next token in the sequence, the **Query** vector updates to reflect the current token's relationship with the previously generated tokens (or the input context).

#### In Simple Terms:
- **K and V** are precomputed and stored (cached) once per token.
- **Q** is dynamically computed at each step as the model generates or processes new tokens.

#### Why Is This Important?
By caching the **Key** and **Value** vectors, the model avoids recalculating them for every token in the sequence, making the process of generating new tokens much faster and more efficient.

#### Example:
- In the first pass, the model computes **K**, **V** for the first token ("Once").
- As it moves to the second token ("upon"), the **K** and **V** for "Once" are reused from the cache while a new **Query** is computed for "upon."
- This process continues, and the model can focus on using previously computed information (**K** and **V**) to efficiently generate the next tokens.

This caching mechanism is especially beneficial for tasks involving long sequences or autoregressive generation (like GPT), where tokens are generated one by one.


#### Example: KV Cache in Python
Here’s how a simplified KV cache works during text generation using the Hugging Face transformers library.

#### Step-by-Step Explanation of KV Cache in the Example:
- First Pass (Initial Context):

    - The model computes the K, Q, and V vectors for the input text ("Once upon a time").
    - The Key and Value vectors for this input are stored in the cache (kv_cache).
- Subsequent Passes (Token-by-Token Generation):

    - For each new token generated, the Query vector for the current token is computed.
    - The cached Key and Value vectors are reused to compute attention without recalculating for previous tokens.
    - This saves time and avoids redundant computation.

- Final Output:

    - The model uses the cached K and V vectors to efficiently generate additional tokens, combining them with the Q vector of the current token.

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load a pre-trained causal language model (like GPT-2)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Input text (initial context)
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")

# Generate text with KV caching
print("Generating with KV caching...\n")
output_ids = []
kv_cache = None  # Initialize empty cache

# Simulate iterative generation of tokens
for _ in range(5):  # Generate 5 tokens
    outputs = model(**inputs, use_cache=True, past_key_values=kv_cache)
    next_token_id = outputs.logits[:, -1, :].argmax(dim=-1)  # Get the next token
    output_ids.append(next_token_id.item())
    kv_cache = outputs.past_key_values  # Update KV cache
    inputs = {"input_ids": next_token_id.unsqueeze(0)}  # Prepare next input

# Decode the generated sequence
generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)
print(f"Generated text: {input_text} {generated_text}")


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Generating with KV caching...

Generated text: Once upon a time , the world was a


#### How It Works
- past_key_values: This is the KV cache. It stores information about keys and values for attention layers, helping the model "remember" prior context.
- use_cache=True: Enables the KV cache for efficient token generation.
- Iterative Generation: Tokens are generated one by one, reusing the cached context each time to avoid recalculating.

#### Real-World Analogy
Think of KV cache as sticky notes on a book:

- Key (K): Acts like a label on the sticky note that describes what the note is about (e.g., “Main idea of Chapter 1”).
- Value (V): The actual content of the sticky note (e.g., “The story begins with a brave knight.”).

By referring to these notes (cache), you don’t need to re-read the entire chapter each time.

#### Visualization Libraries for Advanced Use
To visualize how the KV cache works in action:

- Hugging Face transformers: Provides tools for inspecting past_key_values.