# Improving Agentic Systems with RAG and Prompt Engineering Demo

In this notebook, we'll explore how to use prompt engineering and RAG to improve agentic system performance. First use the open-source model Zephyr-7b beta (a fine tuned version of Mistral-7B, developed by mistral.ai) to observe how changes in prompts influence the model's output. Then, we'll build a RAG system from scratch to help our system provide correct responses.

Below, we filter out any annoying warning messages that might pop up when using certain libraries. Your code will run the same without doing this, we just prefer a cleaner output.

In [1]:
import warnings
warnings.filterwarnings("ignore")

# Part 1: Prompt Engineering - Designing a Prompt for Text Classification

We'll follow the below workflow we saw in module 2 to install apppropriate packages and instantiate our model.

## 🔧 Step 1: Install Required Packages
Install `transformers` and `accelerate`.

In [2]:
%%capture
!pip install -U transformers 
!pip install -U accelerate

## 🤖 Step 2: Import Required Libraries
Import from `transformers`:
* `AutoModelForCausalLM`
* `AutoTokenizer`
* `pipeline`

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

## 🔄 Step 3: Load the Tokenizer
Use `AutoTokenizer` to load the tokenozer for `HuggingFaceH4/zephyr-7b-beta`

In [4]:
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

## 🧠 Step 4: Load the model - Zephyr-7b-Beta (a version of Mistral 7B)

In [5]:
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)


config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## 🗞 Step 5: Wrap everything in a pipeline object

In [6]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,

)

Device set to use cuda
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


## Engineering a Prompt for Text Classification

## 🌟 Activity
Create a prompt that guides a text generation model to classify a piece of text into categories related to food. 

For example, classify text into categories like 'Cuisine Type,' 'Cooking Techniques,' 'Ingredients,' or 'Food Culture.' Define the persona, task, and expected output format in your prompt. Use the following sample text to test your prompt:

Sample Text: "Spaghetti carbonara is a classic Italian pasta dish made with eggs, cheese, pancetta, and pepper. It originated in the Lazio region of Italy and is known for its creamy texture, achieved without using cream. The dish is quick to prepare and relies on high-quality ingredients for the best flavor."

In [12]:
# Prompt components
persona = "You are a food journalist who excels at identifying the main focus of food-related content.\n"
instruction = "Classify the provided text into one of the following categories: 'Cuisine Type,' 'Cooking Techniques,' 'Ingredients,' or 'Food Culture'.\n"
context = "Examine the content carefully to determine the main subject of the text and provide the most suitable category.\n"
data_format = "Respond with the category label and a brief explanation of your reasoning.\n"
audience = "This classification is intended for a general audience interested in understanding culinary writing.\n"
tone = "The response should be clear, concise, and easy to understand.\n"

text = "Spaghetti carbonara is a classic Italian pasta dish made with eggs, cheese, pancetta, and pepper. It originated in the Lazio region of Italy and is known for its creamy texture, achieved without using cream. The dish is quick to prepare and relies on high-quality ingredients for the best flavor."
data = f"Text to classify: {text}"

# The full prompt
query = persona + instruction + context + data_format + audience + tone + data

messages = [
    {"role": "user", "content": query}
]
# Generate the output
outputs = pipe(messages)
print(outputs[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Category: Cuisine Type

Explanation: The text provides a brief description of spaghetti carbonara, a specific dish from Italian cuisine. It highlights the regional origin of the dish and its traditional ingredients, which are essential to its identity as a classic Italian pasta dish. Therefore, the main focus of the text is on the cuisine type, specifically the Italian dish of spaghetti carbonara.


## Chain of Thought Prompting

## 🌟 Activity

Get the model to try and solve the following questions, first without COT, and then using COT:

1. Lily has twice as many apples as John. Together, they have 36 apples. How many apples does each of them have?
2. There are five houses in a row, each painted a different color. In each house lives a person with a different profession. The teacher lives in the red house. The doctor lives next to the teacher. The engineer lives two houses away from the teacher. Who lives in the blue house?

Ensure the model generates the correct output.


In [None]:
prompt_cot = (
    "Q: Lily has twice as many apples as John. Together, they have 36 apples. "
    "How many apples does each of them have?\n"
    "Let's think step by step.\nA:"
)

print(pipe(prompt_cot)[0]['generated_text'])


prompt_cot_logic = (
    "Q: There are five houses in a row, each painted a different color. In each house lives a person with a different profession. "
    "The teacher lives in the red house. The doctor lives next to the teacher. The engineer lives two houses away from the teacher. "
    "Who lives in the blue house?\n"
    "Let's think step by step.\nA:"
)

print(pipe(prompt_cot_logic)[0]['generated_text'])

# Part 2: Implementing a RAG System

&#x1F6A7; **You will need to restart the notebook kernel before starting this part of the demo otherwise you will run out of GPU memory**

In [2]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

Below, we filter out any annoying warning messages that might pop up when using certain libraries. Your code will run the same without doing this, we just prefer a cleaner output.

In [None]:
import warnings
warnings.filterwarnings("ignore")

We'll follow the below workflow to install apppropriate packages and instantiate our model.

## 🔧 Step 1: Install Required Packages
The new package being installed here is `faiss-gpu`, a GPU-accelerated version of FAISS, a library for efficient similarity search and clustering of dense vectors, which speeds up vector retrieval by running computations on the GPU.

In [None]:
%%capture
!pip install langchain 
!pip install langgraph
!pip install langchain-huggingface 
!pip install -U langchain-community
!pip install faiss-gpu 
!pip install transformers 
!pip install accelerate 
!pip install datasets 
!pip install sentence-transformers

## 🤖 Step 2: Import Required Libraries
The new imports are:
* `from langchain_huggingface import HuggingFacePipeline`
  Loads a Hugging Face model into a LangChain-compatible wrapper for LLM-based chains.
* `from langchain.text_splitter import CharacterTextSplitter`
  Splits large texts into smaller chunks based on character count, useful for document indexing.
* `from langchain.schema import Document`
  Defines the standard format for documents used in LangChain pipelines.
* `from langchain.vectorstores import FAISS`
  Provides access to the FAISS vector database for efficient similarity search and retrieval.
* `from langchain_huggingface import HuggingFaceEmbeddings`
  Loads Hugging Face embedding models for generating dense vector representations of text.
* `from datasets import load_dataset`
  Loads benchmark datasets (like SQuAD) from the Hugging Face datasets library.
* `from langgraph.graph import StateGraph, END`
  Constructs modular state-based execution graphs for chaining together AI agent components.
* `from langchain.schema.runnable import RunnableLambda`
  Wraps Python functions as LangChain-compatible steps in a processing chain.
* `from typing import TypedDict, List, Optional`
  Provides static type definitions for better structure and type-checking in Python.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_huggingface import HuggingFacePipeline
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from datasets import load_dataset
from langgraph.graph import StateGraph, END
from langchain.schema.runnable import RunnableLambda
from typing import TypedDict, List, Optional

## 🔄 Step 3: Load the Tokenizer

In [None]:
model_name = "microsoft/phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

## 🧠 Step 4: Load the model - Phi 3 Mini 4k Instruct

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

## 🗞 Step 5: Wrap everything in a pipeline object

In [None]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256, return_full_text=False)
llm = HuggingFacePipeline(pipeline=pipe)

## &#x1F4F0; Step 6: Process the Documents that will form the Knowledge Base

Here we're loading the Stanford Question Answering Dataset (SQuAD), which is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

We're using it as it contains lots of segments of text we can put into a vector store with corresponding questions we can ask to see if the RAG system works!

We load in the dataset from HuggingFace, create a list of langchain Documents from the unique pieces of text, and then split the pieces of text (if larger than 500 characters) into overlapping chunks.

In [None]:
squad = load_dataset("squad", split="validation")
docs = [Document(page_content=sample) for sample in squad.unique('context')]
splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_docs = splitter.split_documents(docs)

## &#x1F522; Step 7: Embed the Documents in the Vector Database

Next, we use the `all-MiniLM-L6-v2` model to embed the chunks of text i.e. encode them as vectors where "closeness" means similarity of meaning. These embeddings are then stored in a vectorstore with Facebook AI Similarity Search (FAISS), a library developed by Facebook for efficient similarity search and clustering of dense vectors.

We set k = 3, where k is the number of similar documents we want to get back from a search.

In [None]:
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(split_docs, embedding_model)
retriever = vectorstore.as_retriever(search_kwargs={'k': 3})

## &#x1F531; Step 8: Define Graph Components

Here, we define a class and two functions. The class `RAGState` outlines the structure that each node of our langgraph graph will follow - there will always be a question, and we might have some helpful documents and eventually an answer.

The functions describe what will happen in two nodes of the graph:
* In a `retrieve` node, we'll get the current question from the state, search for documents similar to the question in the vectorstore, and then update the state so that docs points to the relevant documents we found.
* In a `generate` node, we'll get the current question and docs from the state, construct a prompt so that it contains the relevant documents that were found, the question, and a slot for the answer, then pass the prompt to our language model. It then updates the state so that answer contains the response from the model.

In [None]:
class RAGState(TypedDict):
    question: str
    docs: Optional[List[Document]]
    answer: Optional[str]

def retrieve(state: RAGState) -> RAGState:
    question = state["question"]
    results = retriever.invoke(question)
    return {"question": question, "docs": results, "answer": None}

def generate(state: RAGState) -> RAGState:
    question = state["question"]
    docs = state["docs"] or []
    context = "\n".join([doc.page_content for doc in docs])
    prompt_template = f"""<|system|>
    Context: {context}
    Given the context, write an answer to the question. 
    Only provide an answer, do not detail the context or question.<|end|>
    <|user|>
    {question}<|end|>
    <|assistant|>"""
    prompt = prompt_template
    result = llm.invoke(prompt)
    return {"question": question, "docs": docs, "answer": result}

## &#x25B6; Step 9: Build Graph

Here we construct our graph. The flow of events will be retrieve -> generate -> answer.

In [None]:
builder = StateGraph(RAGState)
builder.add_node("retrieve", RunnableLambda(retrieve))
builder.add_node("generate", RunnableLambda(generate))
builder.set_entry_point("retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)
graph = builder.compile()

## 🧪 Step 10: Test the System by Asking a Question it Should Know About

First, let's quickly test how the model does without the added context. The specific answer we are looking for is "Distributed Adaptive Message Block Switching".

In [None]:
question = "What did Paul Baran develop"

In [None]:
ragless_prompt = f"""<|system|>
    Write a concise answer to the question or explain that you don't know it.<|end|>
    <|user|>
    {question}<|end|>
    <|assistant|>"""

In [None]:
ragless_result = llm.invoke(ragless_prompt)
print("Answer:", ragless_result)

Here we see how the model does with the added RAG context.

In [None]:
result = graph.invoke({"question": question, "docs": None, "answer": None})
print("Answer:", result["answer"])

Below, we can see the 3 documents that were found to give context to the model. From this, it is able to answer our query.

In [None]:
result['docs']

## ✅❌ Step 11: Evaluate System Performance
Here we have a small piece of code which we can use to evaluate the performance of our RAG system. It takes the given answer to a given question and checks to see if it is in the response generated by the system. We do that 5 times and find the percentage of the time we are correct.

In [None]:
import re, random

def normalize(text):
    text = text.lower()
    text = re.sub(r'\b(a|an|the)\b', ' ', text)
    text = re.sub(r'[^\w\s]', '', text)
    return re.sub(r'\s+', ' ', text).strip()

correct = 0
total = 0

random.seed(1)
test_indices = [random.randint(0, 10_000) for _ in range(5)]
for i in test_indices:
    question = squad[i]["question"]
    ground_truth = squad[i]["answers"]["text"][0] if squad[i]["answers"]["text"] else ""
    response = graph.invoke({"question": question})
    prediction = response["answer"]
    norm_pred = normalize(prediction)
    norm_truth = normalize(ground_truth)

    if norm_truth in norm_pred or norm_pred in norm_truth:
        correct += 1
    total += 1

    print(f"\nQ: {question}\nPredicted: {prediction}\nActual: {ground_truth}")
print(f"\nAccuracy: {correct}/{total} = {correct / total:.2%}")

## Extension  🌟
* Select a question from the SQuAD dataset and see how well you think the system answers it!