### Cache-Augmented Generation

### üîç Overview
Retrieval-Augmented Generation (RAG) enhances language models by integrating external knowledge but faces challenges like retrieval latency, errors, and system complexity.

Cache-Augmented Generation (CAG) addresses these by preloading relevant data into the
model's context, leveraging modern LLMs' extended context windows and caching runtime parameters.

This eliminates real-time retrieval during inference, enabling direct response generation.

<img src="https://raw.githubusercontent.com/genieincodebottle/genaicodelab/main/cache_augumeted_generation/images/cag_diagram.png" width="300" height="400" alt="CAG">


### ‚ú® Advantages of CAG
* **Reduced Latency:** Faster inference by removing real-time retrieval.
* **Improved Reliability:** Avoids retrieval errors and ensures context relevance.
* **Simplified Design:** Offers a streamlined, low-complexity alternative to RAG with comparable or better performance.

### ‚ö†Ô∏è Limitations of CAG
* **Knowledge Size Limits:** Requires fitting all relevant data into the context window, unsuitable for extremely
large datasets.
* **Context Length Issues:** Performance may degrade with very long contexts.

### üìö References
* [Research Paper](https://arxiv.org/abs/2412.15605)
* https://github.com/hhhuang/CAG/tree/main


### Step - 1 Install Required Packages

In [1]:
!pip install -qU transformers accelerate bitsandbytes sentence-transformers torch plotly python-dotenv

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.4/44.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m9.7/9.7 MB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m336.6/336.6 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m69.7/69.7 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m275.7/275.7 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?

### Step - 2 Import Libraries

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
from sentence_transformers import SentenceTransformer
import pandas as pd
import time

# Check if we can use GPU
print(f"GPU available: {torch.cuda.is_available()}")

GPU available: True


### Step - 3 Load the Model

* How to get your Hugging Face token and store it in Google Colab:
  
  * Visit the [Hugging Face Tokens Page](https://huggingface.co/settings/tokens)
  * Create a new token with read access
  * Copy the Huggigface token
  * In Google Colab, navigate to the Secret section, add a secret with the name HF_TOKEN, and paste the token you copied in the previous step.
  * <img src="https://raw.githubusercontent.com/genieincodebottle/genaicodelab/main/cache_augumeted_generation/images/google_cloab_secret_page.png" width="300" height="200" alt="Colab Secret Screenshot">

* Models
  * [Llama3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B) - For Llama model you need to accept EULA(End User Licence Agreement) at Huggingface site if not accepted earlier.
  * [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) - Fine-tuned dense model using DeepSeek-R1 data, showing exceptional benchmark performance for smaller distilled model

In [3]:
model_name = "meta-llama/Llama-3.2-1B-Instruct"
#model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

def load_model(model_name="meta-llama/Llama-3.2-1B-Instruct"):
    """Load the language model and tokenizer"""

    # Set up model settings based on available hardware
    model_settings = {
        "device_map": "auto",
        "torch_dtype": torch.float16 if torch.cuda.is_available() else torch.float32,
    }

    # Load model and tokenizer
    model = AutoModelForCausalLM.from_pretrained(model_name, **model_settings)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Load BERT(similarity) model
    similarity_model = SentenceTransformer('all-MiniLM-L6-v2')

    # Move model to GPU if available
    if torch.cuda.is_available():
        model.cuda()

    return model, tokenizer, similarity_model

# Load the model
model, tokenizer, similarity_model = load_model()

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Step - 4 Download and Load Dataset

In [None]:
# Download the sample dataset
!wget https://raw.githubusercontent.com/genieincodebottle/generative-ai/main/cache_augumeted_generation/datasets/sample_qa_dataset.csv

# Load the dataset
df = pd.read_csv("sample_qa_dataset.csv")
print("Dataset preview:")
display(df.head())

--2025-01-24 00:31:17--  https://raw.githubusercontent.com/genieincodebottle/genaicodelab/main/cache_augumeted_generation/datasets/sample_qa_dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43105 (42K) [text/plain]
Saving to: ‚Äòsample_qa_dataset.csv‚Äô


2025-01-24 00:31:17 (4.35 MB/s) - ‚Äòsample_qa_dataset.csv‚Äô saved [43105/43105]

Dataset preview:


Unnamed: 0,topic,text,sample_question,sample_ground_truth
0,Setting Up a Mobile Device for Company Email,**Setting Up a Mobile Device for Company Email...,"""How do I set up my company email on my mobile...",To set up your company email on your mobile de...
1,Resetting a Forgotten PIN,**Resetting a Forgotten PIN**\n\nIf you have f...,"I forgot my PIN, how can I reset it?","Don't worry, I'm here to help To reset your fo..."
2,Configuring VPN Access for Remote Workers,**Configuring VPN Access for Remote Workers**\...,How do I set up VPN access on my laptop so I c...,To set up VPN access on your laptop and access...
3,Troubleshooting Issues with Microsoft Office,**Troubleshooting Issues with Microsoft Office...,"""My Microsoft Word keeps freezing every time I...",I'd be happy to help you troubleshoot the issu...
4,Setting Up a Conference Call on Cisco Webex,"To set up a conference call on Cisco Webex, fo...",How do I set up a conference call on Cisco Web...,To set up a conference call on Cisco Webex wit...


### Step - 5 Define Helper Functions

In [6]:
def create_prompt(question, context=""):
    """Create a prompt for the model"""
    if context:
        return f"""
        Context information:
        {context}

        Question: {question}
        Answer:"""
    return f"Question: {question}\nAnswer:"

def prepare_kv_cache(
        documents: str,
        instruction: str = None
    ) :
        """
        Prepare the key-value cache for efficient text generation by pre-computing attention patterns.

        Args:
            documents: Context text to be used for generation
            instruction: Optional specific instruction for response generation
                        (defaults to short answer instruction)

        Returns:
            tuple: (past_key_values, execution_time)
                - past_key_values: Cached attention patterns
                - execution_time: Time taken to prepare cache
        """
        # Track execution time for performance monitoring
        start_time = time.time()

        # Set default instruction if none provided
        instruction = instruction or "Answer the question with a short answer."

        # Construct prompt with system context, user documents and instruction
        # Example: If documents="The cat is black" and instruction="Describe color"
        # Prompt becomes: system role + context + instruction + question marker
        prompt = f"""
        <|begin_of_text|>
        <|start_header_id|>system<|end_header_id|>
        You are an assistant for giving short answers based on given context.
        <|eot_id|>
        <|start_header_id|>user<|end_header_id|>
        Context information is below.
        ------------------------------------------------
        {documents}
        ------------------------------------------------
        {instruction}
        Question:
        """

        try:
            # Convert prompt to token IDs and move to model's device
            # Example: "The cat" -> [token_ids]
            input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

            # Initialize empty cache for storing attention patterns
            past_key_values = DynamicCache()

            # Generate initial attention patterns by disabling gradient calculation for inference
            with torch.no_grad():
                outputs = model(
                    input_ids=input_ids,
                    past_key_values=past_key_values,
                    use_cache=True,  # Enable caching mechanism
                    output_attentions=False,  # Don't need attention matrices
                    output_hidden_states=False  # Don't need hidden states
                )

            if not outputs.past_key_values or len(outputs.past_key_values) == 0:
                raise ValueError("Empty KV cache generated")

            return outputs.past_key_values, time.time() - start_time

        except Exception as e:
            print(f"Error preparing KV cache: {str(e)}")
            return DynamicCache(), time.time() - start_time

def generate_answer(model, tokenizer, prompt, past_key_values, max_tokens=300):
    """
    Generate text using a language model with greedy decoding strategy.

    Args:
        model: The language model to use for generation
        tokenizer: Tokenizer for encoding/decoding text
        prompt: Input text to generate from
        past_key_values: Cached key/value attention pairs from previous forward passes
        max_tokens: Maximum number of tokens to generate (default: 300)

    Returns:
        str: Generated text response
    """
    # Convert prompt text to token IDs and move to model's device
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    origin_ids = input_ids  # Store original input length for trimming later
    output_ids = input_ids.clone()  # Initialize output with input tokens
    next_token = input_ids  # Current token for next prediction

    # Generate tokens one at a time using greedy decoding
    with torch.no_grad():  # Disable gradient calculation for inference
        # Generate tokens sequentially until max_tokens limit
        # Example: If max_tokens=3, loop runs up to 3 times to generate:
        # 1st loop: "The" -> 2nd: "cat" -> 3rd: "sits"
        for _ in range(max_tokens):
            # Get model predictions using cached attention values
            outputs = model(
                input_ids=next_token,
                past_key_values=past_key_values,
                use_cache=True  # Enable caching for faster generation
            )

            # Select token with highest probability (greedy approach)
            # Example: If logits are [-2.0, 0.5, 1.2] for tokens ["the", "cat", "dog"]
            # argmax will select "dog" (index 2) as it has highest value 1.2
            next_token_logits = outputs.logits[:, -1, :]  # Get logits for next token position
            next_token = next_token_logits.argmax(dim=-1).unsqueeze(-1)  # Choose token with highest logit
            next_token = next_token.to(model.device)  # Ensure token tensor is on same device as model

            # Update cache and append new token to output
            # Example: If previous cached attention is for "The cat" and new token is "sits"
            # - past_key_values stores attention patterns for "The cat" to avoid recomputation
            # - output_ids gets extended from ["The", "cat"] to ["The", "cat", "sits"]
            past_key_values = outputs.past_key_values  # Store attention patterns for faster next token generation
            output_ids = torch.cat([output_ids, next_token], dim=1)  # Append new token to growing sequence

            # Stop if end-of-sequence token is generated
            if next_token.item() in model.config.eos_token_id:
                break

    # Remove the original prompt tokens from the generated sequence
    # Example: If prompt is "Tell me a story" (4 tokens) and model generates 10 tokens,
    # this keeps only the 6 newly generated tokens (slices from index 4 onwards)
    # [:, origin_ids.shape[-1]:] means: keep all batches (:) and slice tokens after prompt length
    output = output_ids[:, origin_ids.shape[-1]:]  # Trim prompt tokens
    return tokenizer.decode(output[0], skip_special_tokens=True)

def calculate_similarity(text1, text2, similarity_model):
    """
    Calculate cosine similarity between two text embeddings using a pre-trained model.

    Args:
        text1: First text input
        text2: Second text input
        similarity_model: Pre-trained model for generating text embeddings

    Returns:
        float: Cosine similarity score between 0 and 1
    """
    # Convert texts to embeddings using the model
    # Example: "The cat" -> [0.1, 0.2, 0.3, ...] (high-dimensional vector)
    embedding1 = similarity_model.encode(text1, convert_to_tensor=True)
    embedding2 = similarity_model.encode(text2, convert_to_tensor=True)

    # Add batch dimension and calculate cosine similarity
    # Example: If embeddings are [0.1, 0.2] and [0.2, 0.4]
    # Similarity would be ~1.0 (very similar) due to proportional values
    similarity = torch.nn.functional.cosine_similarity(embedding1.unsqueeze(0),
                                                     embedding2.unsqueeze(0))

    # Convert tensor to Python float
    return similarity.item()

### Step - 6 Test the System

In [8]:
# Get a sample of questions (adjust number as needed , this dataset has max 10 questions)
num_questions = 5
test_questions = df['sample_question'].iloc[:num_questions].tolist()
ground_truths = df['sample_ground_truth'].iloc[:num_questions].tolist()

# Prepare context (combine all relevant texts)
context = '\n\n'.join(df["text"].tolist())

knowledge_cache, prep_time = prepare_kv_cache(context)
print(f"KV Cache prepared in {prep_time:.2f} seconds")

# Store results
results = []

# Process each question
for question, ground_truth in zip(test_questions, ground_truths):
    # Start timing
    start_time = time.time()

    # Create prompt and generate answer
    prompt = create_prompt(question, context)
    answer = generate_answer(model, tokenizer, prompt, knowledge_cache)

    # Calculate generation time
    generation_time = time.time() - start_time

    # Calculate similarity with ground truth
    similarity = calculate_similarity(answer, ground_truth, similarity_model)

    # Store results
    results.append({
        'Question': question,
        'Generated Answer': answer,
        'Ground Truth': ground_truth,
        'Similarity': similarity,
        'Generation Time': generation_time
    })

# Convert results to DataFrame
results_df = pd.DataFrame(results)
display(results_df)

Cache prepared in 0.11 seconds


Unnamed: 0,Question,Generated Answer,Ground Truth,Similarity,Generation Time
0,"""How do I set up my company email on my mobile...","""To set up your company email on your mobile ...",To set up your company email on your mobile de...,0.815589,4.808354
1,"I forgot my PIN, how can I reset it?","To reset your PIN, follow the steps outlined ...","Don't worry, I'm here to help To reset your fo...",0.790417,5.061365
2,How do I set up VPN access on my laptop so I c...,"To set up VPN access on your laptop, follow t...",To set up VPN access on your laptop and access...,0.894838,27.589754
3,"""My Microsoft Word keeps freezing every time I...","To fix the issue, try the following steps:\n\...",I'd be happy to help you troubleshoot the issu...,0.768436,32.585256
4,How do I set up a conference call on Cisco Web...,To set up a conference call on Cisco Webex wi...,To set up a conference call on Cisco Webex wit...,0.980663,38.30336


### Step - 7 Visualize Results

In [9]:
import plotly.graph_objects as go

# Create performance visualization
fig = go.Figure()

# Add generation time trace
fig.add_trace(go.Scatter(
    x=list(range(len(results))),
    y=results_df['Generation Time'],
    name='Generation Time',
    mode='lines+markers'
))

# Add similarity trace
fig.add_trace(go.Scatter(
    x=list(range(len(results))),
    y=results_df['Similarity'],
    name='Answer Similarity',
    mode='lines+markers',
    yaxis='y2'
))

# Update layout
fig.update_layout(
    title='Performance Metrics',
    xaxis_title='Question Number',
    yaxis_title='Generation Time (seconds)',
    yaxis2=dict(
        title='Similarity Score',
        overlaying='y',
        side='right'
    )
)

fig.show()

# Print summary statistics
print("\nSummary Statistics:")
print(f"Average Generation Time: {results_df['Generation Time'].mean():.2f} seconds")
print(f"Average Similarity Score: {results_df['Similarity'].mean():.2f}")


Summary Statistics:
Average Generation Time: 21.67 seconds
Average Similarity Score: 0.85
