<a href="https://colab.research.google.com/github/VishaalChandrasekar0203/LLMs-Cache-Prefetching-/blob/main/Cache_and_Prefetching_for_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title Setup and Imports
!pip install transformers




In [None]:
import time
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Check if a GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Running on:", device)

# Load the GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.to(device)
model.eval()

# Global cache for embeddings
embedding_cache = {}

# Simulated function to retrieve token embeddings with delay if not cached
def get_embedding(token_id):
    # Check if token embedding is already cached
    if token_id in embedding_cache:
        return embedding_cache[token_id]
    else:
        # Simulate a delay for a cache miss (e.g., 10ms per token)
        time.sleep(0.01)
        # Retrieve the embedding from the model's embedding layer
        embedding = model.transformer.wte.weight[token_id].detach().to(device)
        embedding_cache[token_id] = embedding
        return embedding

# Function to process a prompt and simulate the embedding retrieval step
def process_prompt(prompt):
    tokens = tokenizer.encode(prompt)
    embeddings = []
    start_time = time.time()
    for token in tokens:
        emb = get_embedding(token)
        embeddings.append(emb)
    elapsed = time.time() - start_time
    print(f"Processed prompt in {elapsed:.4f} seconds (Tokens: {len(tokens)})")
    return embeddings

# Example prompt
prompt = "The quick brown fox jumps over the lazy dog."

# ----- Experimental Setup -----
print("### Baseline Run: Cold Cache ###")
# Clear the cache to simulate cold cache conditions
embedding_cache = {}
_ = process_prompt(prompt)

print("\n### Warm Cache Run: After Prefetching ###")
# Now the cache is warm from the previous run; processing the same prompt again
_ = process_prompt(prompt)

# Optional: Demonstrate prefetching on a new prompt
new_prompt = "A swift auburn fox leaps over a tired canine."
print("\n### New Prompt with Prefetching Strategy ###")
# Proactively prefetch embeddings for all tokens in new_prompt
new_tokens = tokenizer.encode(new_prompt)
# Prefetch embeddings asynchronously (simulation)
for token in new_tokens:
    get_embedding(token)
# Now process the prompt and measure the time with a warm cache
_ = process_prompt(new_prompt)


Running on: cpu


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### Baseline Run: Cold Cache ###
Processed prompt in 0.1095 seconds (Tokens: 10)

### Warm Cache Run: After Prefetching ###
Processed prompt in 0.0000 seconds (Tokens: 10)

### New Prompt with Prefetching Strategy ###
Processed prompt in 0.0000 seconds (Tokens: 12)


Explanation of the Code and Framework
Model & Device Setup:

We install and import necessary libraries.

GPT‑2 and its tokenizer are loaded and moved to the appropriate device (CPU or GPU).

Caching Mechanism:

A global dictionary (embedding_cache) stores token embeddings.

The function get_embedding simulates a cache miss by adding a delay (time.sleep(0.01)) if the embedding isn’t in the cache. Once stored, subsequent requests for the same token are returned immediately.

Processing Prompts:

The process_prompt function tokenizes the input prompt and retrieves embeddings for each token, measuring the elapsed time.

Experimental Comparison:

The first run with an empty cache simulates the “cold” cache scenario.

The second run on the same prompt uses the warmed cache, which should be faster.

An additional example shows how prefetching can be simulated by iterating through tokens in a new prompt before processing it.

Assumptions Made:

Cache Granularity: We assume the critical performance bottleneck is at the token embedding lookup level.

Simulated Delay: A fixed delay (10ms) per token simulates the memory access latency of a cache miss. Real-world delays will vary based on hardware.

Prefetching Strategy: The simple prefetching demonstrated here assumes that token embeddings can be loaded in advance. In a more complex system, one might use asynchronous prefetching or predictive models to load not just embeddings but also intermediate activations.

Extending the Framework:

Dynamic Profiling: In a full implementation, integrate a profiler to measure cache hit/miss rates and memory access times.

Advanced Prefetching: Incorporate a predictive model that leverages the attention mechanism or sequence analysis to decide which parts of the model’s memory should be prefetched.

Hardware Integration: Consider hardware-level caching strategies by profiling on different architectures and potentially interfacing with lower-level APIs for memory management.
