# 📓 Extending an LLM's Context Window

1.  **Setup:** Install libraries and ensure a GPU is active.
2.  **Load Model:** Load a 4-bit quantized version of Llama 3 8B.
3.  **Extend Context:** Apply RoPE scaling on-the-fly during model loading.
4.  **Validate:** Test the new context window with a "Needle in a Haystack" evaluation and instruct the model to return its findings as JSON.

---
## ⚙️ Step 1: Environment Setup

First, we install the necessary libraries from Hugging Face. We also need `bitsandbytes` for quantization, that lets us run this large model on a free T4 GPU.

*Make sure to enable the T4 GPU runtime by going to **Runtime → Change runtime type → T4 GPU**.*

In [None]:
# Step 1: Install all the necessary libraries
# The `-q` flag makes the output less noisy (quiet).

print("⏳ Installing required libraries...")
!pip install transformers torch accelerate bitsandbytes einops -q

print("✅ Installation Complete!")

⏳ Installing required libraries...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   

---
## 🛠️ Step 2: Model Loading and Configuration

### Hugging Face Login
To access Llama 3, you need a Hugging Face account.
1.  Go to the [Meta Llama 3 8B Instruct model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and accept the license terms.
2.  Get an access token from your Hugging Face profile: **Settings → Access Tokens**.

The cell below will prompt you to paste your token.

In [1]:
# You'll need a Hugging Face account and an access token.
from huggingface_hub import login

print("🔑 Please log in to your Hugging Face account.")
login()

🔑 Please log in to your Hugging Face account.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Quantization and Model Definition

To make this powerful model fit onto the Colab T4 GPU (~15 GB VRAM), we'll use **4-bit quantization**. This shrinks the model's memory footprint with a minimal impact on its performance.

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig

# The specific model we'll use

model_id = "Qwen/Qwen1.5-1.8B-Chat"

print("✅ Configuration is ready.")

✅ Configuration is ready.


---
## 🧠 Step 3: Extend Context with RoPE Scaling

This is the core of our experiment. We will instruct the `transformers` library to dynamically adjust the model's **Rotary Position Embeddings (RoPE)**.

By setting `factor=4.0`, we "stretch" the model's original 8,192 token positional understanding to handle approximately **32,768 tokens**. This is done directly when we load the model.

- The model was originally trained to understand positions within a specific window (e.g., 4,000 tokens). Thanks to RoPE's flexible, relative approach, we don't have to re-train the model from scratch to handle more tokens.

 - Instead, during a special kind of fine-tuning, we do exactly what you said: we "adjust the angles." This technique is formally known as **Position Interpolation**.



Let's use the clock analogy:

* **Original Model (8k tokens):** Imagine for every new word, the model turned the clock hand by **1 minute**. It's used to the angles and relationships within this system.

* **Stretched Model (32k tokens):** With Position Interpolation, we tell the model, "From now on, for every new word, only turn the hand by **15 seconds**."



In [4]:
# Load the model, applying our quantization config and RoPE scaling simultaneously.
# The `device_map="auto"` argument will automatically place the model on the GPU.

print("⏳ Loading the model... (This may take a few minutes)")

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    rope_scaling={"type": "linear", "factor": 4.0} # ✨ This is the key line!
)

# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_id)

print("✅ Model and tokenizer loaded successfully!")

⏳ Loading the model... (This may take a few minutes)
✅ Model and tokenizer loaded successfully!


---
## 🎯 Step 4: Validate
1.  Creating a very long, repetitive text.
2.  Placing a unique sentence with a specific fact deep inside it.
3.  Asking the model a question

### Create the Test Prompt

In [5]:
def get_max_token_size(model_id: str) -> int:
    """
    Retrieves the maximum context window size for a given model from Hugging Face.

    Args:
        model_id: The identifier of the model on the Hugging Face Hub.

    Returns:
        The maximum number of tokens the model can handle.
    """
    # Load the model's configuration file.
    config = AutoConfig.from_pretrained(model_id)

    # The most common attribute is 'max_position_embeddings'.
    max_tokens = getattr(config, 'max_position_embeddings', None)

    # Fallback for some other models like Mistral that might define it differently
    if max_tokens is None and hasattr(config, 'sliding_window'):
        max_tokens = getattr(config, 'sliding_window')

    return max_tokens

In [6]:
get_max_token_size(model_id)

32768

In [7]:
# 1. Define the needle: the specific fact we want the model to find.
needle = "The best way to learn AI is by building hands-on projects."

# 2. Create a long haystack of repetitive text.
haystack = "The quick brown fox jumps over the lazy dog. This sentence is repeated to create a long context. " * 2000

# 3. Place the needle in the middle of the haystack.
text_to_test = haystack + "\n" + needle + "\n" + haystack

# Check how many tokens our test text has.
input_tokens = tokenizer(text_to_test, return_tensors="pt")
print(f"Total tokens in the 'haystack' text: {input_tokens.input_ids.shape[1]}")

# 4. Create the final prompt for the model, instructing it to return JSON.
prompt = f"""
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert information extraction assistant. You will be given a long text and a question. Your task is to find the answer within the text and return it in a structured JSON format.

Your JSON output must contain the following keys:
- "retrieved_fact": The exact sentence you found that answers the question.
- "source_verified": A boolean value, true if you found the answer in the text.
- "confidence_score": A float between 0.0 and 1.0 representing your confidence.<|eot_id|><|start_header_id|>user<|end_header_id|>

Here is the text:
---
Text: {text_to_test}
---
Question: Based on the text provided, what is the best way to learn AI?

Please provide your answer in the specified JSON format.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

# Tokenize the complete prompt and send it to the GPU
final_inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

print("✅ JSON-instructed prompt is ready and tokenized.")

Token indices sequence length is longer than the specified maximum sequence length for this model (80015 > 32768). Running this sequence through the model will result in indexing errors


Total tokens in the 'haystack' text: 80015


RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

### Generate and Parse the JSON Answer

Now we run the model, extract the text response, and parse it into a Python dictionary.

In [None]:
import json
import pprint

print("⏳ Generating the answer...")

# Generate the output from the model
outputs = model.generate(
    final_inputs.input_ids,
    max_new_tokens=150,  # Increase max tokens to ensure the full JSON object is generated
    eos_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.1, # Lower temperature for more predictable, structured output
    top_p=0.9,
)

# Decode the generated tokens into text
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Clean up the output to isolate the assistant's JSON response
try:
    # Find the start of the assistant's message
    json_part_str = response_text.split("<|end_header_id|>assistant<|end_header_id|>")[1].strip()
    # Find the start and end of the JSON object
    json_start = json_part_str.find('{')
    json_end = json_part_str.rfind('}') + 1
    json_string = json_part_str[json_start:json_end]

    # Parse the string into a Python dictionary
    parsed_json = json.loads(json_string)

    print("\n--- ✅ Successfully Parsed JSON Output ---")
    pprint.pprint(parsed_json)
    print("-----------------------------------------")

except (json.JSONDecodeError, IndexError) as e:
    print("\n--- ❌ Failed to parse JSON ---")
    print(f"Error: {e}")
    print("Raw model output:")
    print(response_text.split("<|end_header_id|>assistant<|end_header_id|>")[1].strip())
    print("-----------------------------")