RAG AGENT

ADD ALL PRE REQS

In [1]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass(
    "Enter OPENAI_API_KEY: "
)

CREATE A BASIC LLM AND ASK UP TO DATE QUESTIONS TO MAKE SURE THE PARAMETRIC KNOWLEDGE IS LIMITED

In [2]:
from agents import Agent

agent = Agent(
    name="Agent",
    model="gpt-4.1-mini"
)

In [3]:
from agents import Runner

query = "SHOULD BE THE SAME QUESTION"

result = await Runner.run(
    starting_agent=agent,
    input=query,
)

print(result.final_output)

Could you please clarify what you mean by "SHOULD BE THE SAME QUESTION"? Are you asking for help in making two questions identical, or do you want me to check if two given questions are the same? If you provide the questions or more context, I’ll be happy to assist!


CREATE ANOTHER LLM WITH ADDITIONAL SOURCE DATA TO SHOW THAT LLMS CAN USE ADDITIONAL DATA TO CREATE AN ANSWER

In [4]:
agent = Agent(
    name="Agent",
    instructions="DATA ABOUT THE QUESTION YOU WANT TO ANSWER",
    model="gpt-4.1-mini"
)

In [5]:
query = "SHOULD BE THE SAME QUESTION"

result = await Runner.run(
    starting_agent=agent,
    input=query,
)

print(result.final_output)

It looks like you want me to answer the same question, but I don’t see the original question in your message. Could you please provide the question you want me to answer again?


GET THE HUGGING FACE DATASET

In [6]:
from datasets import load_dataset

dataset = load_dataset(
    "aurelio-ai/jfk-files",
    split="train"
)

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 606/606 [00:00<00:00, 42875.55 examples/s]


In [7]:
dataset[0]

{'id': 'doc_21c0d725_0fa9_40ef_a217_c062909cc236',
 'filename': '104-10110-10340.pdf',
 'url': 'https://www.archives.gov/files/research/jfk/releases/2025/0318/104-10110-10340.pdf',
 'date': datetime.datetime(2025, 3, 18, 0, 0),
 'content': '[704-10710-10340}\n\n<!-- image -->',
 'pages': 1}

CREATE A KNOWLEDGE BASE USING PINECONE

In [8]:
os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") or getpass(
    "Enter PINECONE_API_KEY: "
)

In [9]:
from pinecone import Pinecone, ServerlessSpec
    
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

index_name = "rag-example"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric='cosine',
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )

index = pc.Index(index_name)

In [10]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838,
 'vector_type': 'dense'}

MAKE BASIC EXAMPLE OF EMBEDDING

In [11]:
from openai import OpenAI

client = OpenAI()

In [12]:
texts = [
    'this is the first chunk of text',
    'then this is the second chunk of text'
]

In [13]:
res = client.embeddings.create(
    input=texts,
    model="text-embedding-3-small"
)

In [14]:
len(res.data), len(res.data[0].embedding)

(2, 1536)

EMBED HUGGING FACE FILES

In [20]:
from tqdm.auto import tqdm
import tiktoken

def chunk_text(text, chunk_size=4000):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Tokenize the text
    tokens = tokenizer.encode(text)
    
    # Split into chunks
    chunks = []
    for i in range(0, len(tokens), chunk_size):
        chunk = tokens[i:i + chunk_size]
        chunks.append(tokenizer.decode(chunk))
    
    return chunks

def truncate_text(text, max_length=1000):
    # Truncate text to a reasonable length for metadata
    if len(text) > max_length:
        return text[:max_length] + "..."
    return text

data = dataset.to_pandas()

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    
    # Process each document
    all_ids = []
    all_texts = []
    all_metadata = []
    
    for idx, row in batch.iterrows():
        # Chunk the content
        chunks = chunk_text(row['content'])
        
        # Create IDs and metadata for each chunk
        for chunk_idx, chunk in enumerate(chunks):
            all_ids.append(f"{row['id']}-{chunk_idx}")
            all_texts.append(chunk)
            all_metadata.append({
                'text': truncate_text(chunk),  # Truncate text in metadata
                'source': row['url'],
                'title': row['filename'],
                'chunk_id': chunk_idx
            })
    
    # Create embeddings for all chunks
    embeds = client.embeddings.create(
        input=all_texts,
        model="text-embedding-3-small"
    )
    
    vectors = [record.embedding for record in embeds.data]
    
    # Upsert all chunks
    index.upsert(vectors=zip(all_ids, vectors, all_metadata))

100%|██████████| 7/7 [00:49<00:00,  7.05s/it]


In [21]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 5508}},
 'total_vector_count': 5508,
 'vector_type': 'dense'}

MAKING RAG CHATBOT

In [23]:
query_embedding = client.embeddings.create(
    input=["What were the key findings in the JFK assassination investigation?"],
    model="text-embedding-3-small"
).data[0].embedding

results = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True
)

for match in results["matches"]:
    print(match["metadata"]["text"])
    print("\nSource:", match["metadata"]["source"])
    print("File:", match["metadata"]["title"])
    print("-" * 80 + "\n")

2025 RELEASE UNDER THE PRESIDENT JOHN F\_ KENNEDY ASSASSINATION RECORDS ACT OF 1992

HOUSE

<!-- image -->

IP
| REI=   | REVIEED   |
|--------|-----------|
|        | BE        |

DCD-45s/78 19 April 1978

FROY

Ruth Elliff DCD/FIO/PAO

SUBJECT

House Select Committee on Assassinations Request (OLC 78-0986/1}

is forvarded in response to subject request:

- a DCD file A-19-91-59 on Abran Chayes
- b Docunents concerning Monica Krzner and Rita Nazan. (Please escuse the Poor qualitr of sone of this naterial it was iapossible to clear reproduction fron OUI microfiln.) get

<!-- image -->

Attachnents a/s


## RELLIFF:vfc Distribution

- DCD Chrono
- 0 Addressee
- 1 Staff A
- 3 Control
- 3 RElliff

E2 IMPDET CL BY 386090

4~

FRON

SUBJECT

Case 64574

USSR Exteraal Folicy

1 and a set of questions (Enclosure 5) at Qur request\_ inirial of each US participzt, as to the true identities of As result, sOne felt they had attended. cases , holever , Gere questia as to the true idcitity of\_the 

MAKE RAG AS A TOOL FOR AGENT

In [31]:
from agents import function_tool

@function_tool
async def return_source_knowledge(query: str) -> str:
    # 1. Get the query embedding
    embeds_response = await client.embeddings.create(
        input=[query],
        model="text-embedding-3-small"
    )
    query_embedding = embeds_response.data[0].embedding

    # 2. Query Pinecone
    results = await index.query(
        vector=query_embedding,
        top_k=3,
        include_metadata=True
    )

    # 3. Extract the passages
    source_knowledge = "\n".join(
        match["metadata"]["text"] for match in results["matches"]
    )

    return source_knowledge

MAKE FINAL AGENT

In [32]:
rag_agent = Agent(
    name="JFK Document Assistant",
    model="gpt-4o",
    instructions="""You are an assistant specialized in answering questions about the JFK assassination and related documents. 
    When users ask questions about JFK, the assassination, or related historical events, use the return_source_knowledge tool 
    to retrieve relevant information from the official JFK files. Always base your answers on the retrieved documents and 
    clearly indicate when information comes from the source documents. If you're unsure about something, acknowledge the 
    limitations of the available documents rather than making assumptions.""",
    tools=[return_source_knowledge]
)

In [33]:
query = "What were the key findings in the JFK assassination investigation?"

result = await Runner.run(
    starting_agent=rag_agent, 
    input=query,
)

print(result.final_output)

It seems there is a temporary issue with retrieving the document details. However, I can summarize the key findings from the JFK assassination investigation based on well-known historical information.

The primary investigation was conducted by the Warren Commission, which concluded:

1. **Lee Harvey Oswald Acted Alone**: Oswald was determined to have acted alone in the assassination of President Kennedy, shooting from the sixth floor of the Texas School Book Depository.

2. **Three Shots Fired**: The Commission found that three shots were fired, with one shot hitting Kennedy and Governor John Connally, another hitting Kennedy in the head, and a third missing the motorcade.

3. **No Conspiracy**: The Commission found no credible evidence of a conspiracy, whether domestic or foreign, involving Lee Harvey Oswald.

4. **Oswald's Background**: Oswald was found to have a troubled background, having lived in the Soviet Union for a time and expressing Marxist leanings.

5. **U.S. Secret Servi