## 1. Knowledge Base Creation

This section defines the custom Knowledge Base (KB) used for the RAG system.  
The topic is a fictional late-2024 research paper titled:

**“Adaptive Retrieval-Augmented Generation for Memory-Constrained Edge Devices” (Lin & Moretti, 2024)**

The KB contains three detailed paragraphs summarizing the paper.

In [1]:
kb_text = """
In late 2024, Lin and Moretti introduced an approach called Adaptive RAG, designed specifically for deployment on memory-constrained edge devices such as IoT sensors, mobile robots, and compact industrial controllers. Traditional RAG systems rely on large vector stores and GPU-accelerated embeddings, making them difficult to run on hardware with limited RAM and no dedicated accelerators. Adaptive RAG addresses this by using low-dimensional (128–256d) embeddings and a dynamic sliding-window vector store that automatically discards less relevant vectors to maintain a constant memory footprint. The system also employs context-sensitive retrieval, selecting only the top-1 or top-2 chunks to minimize token expansion during generation.

The paper introduces a novel module called the Relevance-Weighted Cache (RWC), which tracks which knowledge fragments have been most frequently and recently used by the model. When a new query arrives, the RWC assigns a relevance score based on recency, similarity, and query intent. If the cache predicts that a chunk is highly likely to be needed again, the system stores it in a dedicated micro-index separate from the sliding-window store. This improves retrieval latency by up to 40% compared to baseline compact RAG systems. The authors evaluate the method on an edge inference benchmark using a Raspberry Pi 4 and several microcontrollers, showing that Adaptive RAG maintains 92% of the accuracy of full RAG systems while reducing memory usage by 65%.

The study concludes that the proposed architecture significantly improves on-device reasoning without requiring cloud offloading. However, the authors note that its performance depends heavily on the quality of the lightweight embedding model and the tuning of the sliding-window size. Future work includes knowledge distillation for ultra-small embedding models, quantized vector stores, and integrating temporal reasoning for edge devices that interact with time-dependent sensor data. Lin and Moretti argue that Adaptive RAG may become a foundational technique for autonomous systems operating in remote or bandwidth-limited environments.
"""
kb_text

'\nIn late 2024, Lin and Moretti introduced an approach called Adaptive RAG, designed specifically for deployment on memory-constrained edge devices such as IoT sensors, mobile robots, and compact industrial controllers. Traditional RAG systems rely on large vector stores and GPU-accelerated embeddings, making them difficult to run on hardware with limited RAM and no dedicated accelerators. Adaptive RAG addresses this by using low-dimensional (128–256d) embeddings and a dynamic sliding-window vector store that automatically discards less relevant vectors to maintain a constant memory footprint. The system also employs context-sensitive retrieval, selecting only the top-1 or top-2 chunks to minimize token expansion during generation.\n\nThe paper introduces a novel module called the Relevance-Weighted Cache (RWC), which tracks which knowledge fragments have been most frequently and recently used by the model. When a new query arrives, the RWC assigns a relevance score based on recency, 

## 2. Embedding & Indexing

We chunk the KB into smaller segments, compute embeddings using a
Sentence Transformer model, and store them in a simple vector index.

In [2]:
!pip install sentence-transformers faiss-cpu transformers

from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Simple chunking
chunks = kb_text.split("\n\n")
chunks = [c.strip() for c in chunks if c.strip()]

chunks

Collecting sentence-transformers
  Downloading sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp313-cp313-macosx_14_0_arm64.whl.metadata (5.1 kB)
Collecting transformers
  Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.9.1-cp313-none-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Downloading huggingface_hub-1.1.2-py3-none-any.whl.metadata (13 kB)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.6.2-cp38-abi3-macosx_11_0_arm64.whl.metadata (4.1 kB)
Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface-hub>=0.20.0->sentence-transformers)
 

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

['In late 2024, Lin and Moretti introduced an approach called Adaptive RAG, designed specifically for deployment on memory-constrained edge devices such as IoT sensors, mobile robots, and compact industrial controllers. Traditional RAG systems rely on large vector stores and GPU-accelerated embeddings, making them difficult to run on hardware with limited RAM and no dedicated accelerators. Adaptive RAG addresses this by using low-dimensional (128–256d) embeddings and a dynamic sliding-window vector store that automatically discards less relevant vectors to maintain a constant memory footprint. The system also employs context-sensitive retrieval, selecting only the top-1 or top-2 chunks to minimize token expansion during generation.',
 'The paper introduces a novel module called the Relevance-Weighted Cache (RWC), which tracks which knowledge fragments have been most frequently and recently used by the model. When a new query arrives, the RWC assigns a relevance score based on recency, 

In [3]:
# Generate embeddings
embeddings = model.encode(chunks)

# Convert to float32 for FAISS
embeddings = np.array(embeddings).astype("float32")

# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print("Chunks stored:", len(chunks))

Chunks stored: 3


## 3. Retrieval

This function embeds the user query and performs a similarity search
against the knowledge base using FAISS.

In [4]:
def retrieve(query, k=2):
    query_emb = model.encode([query]).astype("float32")
    distances, indices = index.search(query_emb, k)
    
    retrieved_chunks = [chunks[i] for i in indices[0]]
    return retrieved_chunks, distances[0]

## 4. Generation (Augmented LLM Prompting)

We combine the user's query with retrieved KB context
and pass it to a small LLM for final answer generation.

In [5]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load a **small T5 model** (lightweight for demos)
llm_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(llm_name)
model_llm = AutoModelForSeq2SeqLM.from_pretrained(llm_name)

def generate_answer(query):
    context, _ = retrieve(query, k=2)
    context_text = "\n\n".join(context)

    prompt = f"""
Context:
{context_text}

Question: {query}
Answer:
"""

    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    outputs = model_llm.generate(**inputs, max_length=150)
    return tokenizer.decode(outputs[0], skip_special_tokens=True), context

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## 5. Test Cases

Three tests required by the assignment:
1. Factual (answer appears in KB)
2. General/Foil (not in KB)
3. Synthesis (requires multiple chunks)

In [6]:
### Test Case 1 — Factual
q1 = "What is the purpose of the Relevance-Weighted Cache?"
a1, c1 = generate_answer(q1)
print("ANSWER 1:", a1)
print("\nRETRIEVED CHUNKS:", c1)

### Test Case 2 — Foil / Not in KB
q2 = "How does Adaptive RAG compare to ChatGPT-4?"
a2, c2 = generate_answer(q2)
print("\nANSWER 2:", a2)
print("\nRETRIEVED CHUNKS:", c2)

### Test Case 3 — Synthesis
q3 = "How does Adaptive RAG reduce memory usage while improving retrieval latency?"
a3, c3 = generate_answer(q3)
print("\nANSWER 3:", a3)
print("\nRETRIEVED CHUNKS:", c3)

ANSWER 1: to track which knowledge fragments have been most frequently and recently used by the model

RETRIEVED CHUNKS: ['The paper introduces a novel module called the Relevance-Weighted Cache (RWC), which tracks which knowledge fragments have been most frequently and recently used by the model. When a new query arrives, the RWC assigns a relevance score based on recency, similarity, and query intent. If the cache predicts that a chunk is highly likely to be needed again, the system stores it in a dedicated micro-index separate from the sliding-window store. This improves retrieval latency by up to 40% compared to baseline compact RAG systems. The authors evaluate the method on an edge inference benchmark using a Raspberry Pi 4 and several microcontrollers, showing that Adaptive RAG maintains 92% of the accuracy of full RAG systems while reducing memory usage by 65%.', 'In late 2024, Lin and Moretti introduced an approach called Adaptive RAG, designed specifically for deployment on m