# Workshop Notebook 3 - RAG

In this notebook, we will ingest the entire 92-document dataset into Aryn and implement RAG to answer some questions over them.

## Full ingestion script

In the previous notebook we walked through how you might come up with a sycamore ingestion script
to write a bunch of documents to Aryn (or another database). Now let's scale that. Because this is
a new notebook, I've gone and copied the ingestion script from the previous notebook, and also
condensed it a little. I've also partitioned the documents ahead of time, as this can take a little 
while (our DOS protection is to limit the concurrent requests in an account) - so this script will 
get them from a materialize. This is one of the things you downloaded with `make downloads`


In [None]:
from pathlib import Path
import re

import sycamore
from sycamore import MaterializeSourceMode
from sycamore.data import Element
from sycamore.llms.openai import OpenAI, OpenAIModels
from sycamore.transforms.partition import ArynPartitioner
from sycamore.transforms.extract_schema import LLMPropertyExtractor
from sycamore.transforms.merge_elements import MarkedMerger
from sycamore.transforms.embed import OpenAIEmbedder
from aryn_sdk.client.client import Client

repo_root = Path.cwd()
pdf_dir = repo_root / "files" / "earnings_calls"
materialize_dir = repo_root / "materialize"

gpt4o = OpenAI(OpenAIModels.GPT_4O)

schema = {
    "type": "object",
    "properties": {
        "quarter": {"type": "string", "description": "Quarter of the earnings call, it should be in the format of Q1, Q2, Q3, Q4"},
        "date":{"type": "string", "description": "The date of the earnings call"},
        "company_name": {"type": "string", "description": "The name of the company in the earnings call"},
        "company_ticker": {"type": "string", "description": "The stock ticker of the company in the earnings call"},
    }
}

def mark_speakers(elt: Element) -> Element:
    if not elt.text_representation:
        return elt

    external_speaker = re.match('([^ ]*[^\S\n\t]){1,4}--[^\S\n\t].*--', elt.text_representation)
    internal_speaker = re.match('([^ ]*[^\S\n\t]){1,4}--.*', elt.text_representation)
    if elt.text_representation.strip() == 'Operator':
        elt.properties['speaker_name'] = 'Operator'
        elt.properties['speaker_role'] = 'Operator'
        elt.properties['speaker'] = True
        elt.data['_break'] = True
    elif external_speaker:
        location = elt.text_representation.find('--')
        location2 = location + elt.text_representation[location+2:].find('--')
        elt.properties['speaker_name'] = elt.text_representation[:location].strip()
        elt.properties['speaker_external_org'] = elt.text_representation[location+2:location2+1].strip()
        elt.properties['speaker_role'] = elt.text_representation[location2+4:].strip()
        elt.properties['speaker'] = True
        elt.data["_break"] = True
    elif internal_speaker:
        location = elt.text_representation.find('--')
        elt.properties['speaker_name'] = elt.text_representation[:location].strip()
        elt.properties['speaker_role'] = elt.text_representation[location+2:].strip()
        elt.properties['speaker'] = True
        elt.data["_break"] = True
    return elt

aryn_client = Client()
aryn_docset = aryn_client.create_docset(name = "haystack-workshop-all")
docset_id = aryn_docset.value.docset_id

ctx = sycamore.init()

In [None]:
(
    ctx.read.binary(paths = str(pdf_dir), binary_format = "pdf")
    .partition(ArynPartitioner())
    .materialize(path=materialize_dir / "alldocs-partitioned", source_mode=MaterializeSourceMode.USE_STORED)
    .filter_elements(lambda e: e.type in ("Section-header", "Text"))
    .extract_properties(LLMPropertyExtractor(llm=gpt4o, schema=schema))
    .map_elements(mark_speakers)
    .merge(MarkedMerger())
    .embed(OpenAIEmbedder(model_name="text-embedding-3-small"))
    .spread_properties(['path', 'entity'])
    .write.aryn(aryn_url="https://api.aryn.ai/v1/storage", docset_id=docset_id, autoschema=True)
)

# Questions

Alrighty, now let's answer some questions on the data!

We've come up with a list of questions that we think will be interesting to try to answer - some are easy and some are hard:

0. In the Broadcom earnings call, what details did the CFO, Kirsten Spears, discuss about the VMware acqusition?
1. What was the change in stock price on the day of the Q2 2024 AirBnB earnings call?
2. List all the speakers in the MongoDB Q4 2024 earnings call.
3. List all the speakers in the Broadcom Q4 2024 earnings call.
4. How many customers did MongoDB have at the end of the Q1 2024 quarter?
5. What was the first PLTR earnings call where Anduril is mentioned?
6. List all the companies that mentioned inflation and give me a count of the number of times each of the companies mentioned inflation.
7. Summarize how the VMWare acquisition contributed to revenue changes for Broadcom quarter over quarter.
8. Summarize how Intuit’s latest AI powered platform, Intuit Assist is being integrated through its products. Give me a quarter by quarter break down of the progress. 
9. Summarize all the mergers and acquisitions that happened in 2024 and give a breakdown of how each acquisition impacted earnings.
10. Summarize how AI integration is progressing across each company's products. Give me a quarter by quarter break down of the progress per company and overall.

With the data ingested in Aryn, we should be able to build programmatically answer these question by retrieving the data and using an LLM. We'll 
start with RAG (Retrieval Augmented Generation), which can answer most of the simpler questions, but getting further down the list, RAG starts to 
break down. There's simply too much data that needs to be retrieved to do it all with an LLM call. In the next notebook, we will use sycamore to
answer these harder questions, but for now let's knock out the simpler retrieval-based ones.

## Simple RAG implementation

RAG is essentially comprised of two steps: a query step and a llm step. In other words, you execute a search query, and then you hand an LLM the 
search results and the original question, and it crunches the search results into an answer. There are about a million variants, but for the
purposes of expediency we'll just use basic RAG here. We can query Aryn using aryn_sdk, so let's do that:

In [None]:
from aryn_sdk.client.client import Client as ArynClient

aryn_client = ArynClient()

In [None]:
# I had way too much fun making this to not include it. 
# Basically just an extremely fancy way of printing all the DocSets in your account
import rich

dtable = rich.table.Table(title="Docsets")
dtable.add_column("docset_id")
dtable.add_column("name")
dtable.add_column("created_at")
dtable.add_column("size")

dses = list(aryn_client.list_docsets())
for ds in dses:
    dtable.add_row(ds.docset_id, ds.name, ds.created_at.isoformat(), str(ds.size))

rich.console.Console().print(dtable)

# Write the docset id to a file to pick up in the next notebook:
with open("docset_id", "w") as f:
    f.write(docset_id)

### Search
First let's play around a little with the search API. `aryn_client.search()` accepts two parameters: a `docset_id` and a `search_request`.

An Aryn `SearchRequest` has the following parameters:

- `query`: the query string
- `query_type`: the kind of query to execute. One of "keyword", "lexical", "vector", and "hybrid"
- `properties_filter`: a filter expression. More on those later
- `return_type`: either "element" or "doc". Whether to return individual elements or whole documents.

### Filters
Aryn filter syntax is comprised of expressions formatted like so: `"(property = \"value\")`, where the
property name is given in dotted notation. String values are double-quoted. Other comparison operators (<, >, <=, >=, <>) are supported.
Any number of expressions can be joined with ANDs. No grouping is allowed though.

Here's an example query. Feel free to mess with it.

In [None]:
from aryn_sdk.types.search import SearchRequest

response = aryn_client.search(
    docset_id = docset_id,
    query = SearchRequest(
        query="What is Tesla up to these days?",
        query_type="vector",
        return_type="doc",
        properties_filter="(properties.entity.company_ticker = \"TSLA\")"
    )
)

response.value.results

Now we'll write a relatively simple RAG function to reuse for the first several questions. 

I'll use the sycamore LLM interface because it's what I'm most familiar with and it's fairly easy to use. A sycamore LLM has a 
`generate` method that accepts a `RenderedPrompt` which is made up of `RenderedMessage`s, following the messages API that the 
model providers seem to have settled on.

In [None]:
from aryn_sdk.types.search import SearchRequest
from sycamore.llms.openai import OpenAI, OpenAIModels
from sycamore.llms.prompts.prompts import RenderedPrompt, RenderedMessage

llm = OpenAI(OpenAIModels.GPT_4O)

def rag(question: str, search_request: SearchRequest) -> str:
    search_result = aryn_client.search(docset_id = docset_id, query = search_request)

    messages = [RenderedMessage(role="user", content=f"Using the provided documents, answer the question: {question}")]
    for sr in search_result.value.results:
        # We don't want to include the embeddings in the prompt - 
        # It will just take up space with thousands of random numbers.
        sr.pop("embedding", None)
        if "elements" in sr:
            for elt in sr["elements"]:
                elt.pop("embedding", None)
        # This really isn't super intelligent. We just dump each search result 
        # out as a string and let the LLM decide what to pay attention to
        messages.append(RenderedMessage(role="user", content=str(sr)))
        
    return llm.generate(prompt=RenderedPrompt(messages=messages))

### Question 1: What was the change in stock price on the day of the Q2 2024 AirBnB earnings call?

Correct answer: 2.92%

It turns out that plain, unfiltered BM25 search is sufficient to answer this question. Note that we're asking
for elements back rather than documents - typically this is what you want to do in RAG systems.

In [None]:
question1 = "What was the change in stock price on the day of the Q2 2024 AirBnB earnings call?"

search_request1 = SearchRequest(
    query=question1,
    query_type="lexical",
    return_type="element",
)

answer1 = rag(question1, search_request1)
print(answer1)

### Question 2: List all the speakers in the MongoDB Q4 2024 earnings call.

Correct answer: <It's like a list of 12 people>

Here we need a filter - specifically we want only documents in Q4 for MongoDB (we only have 2024 data but you'd 
probably want to add that filter too)

In [None]:
question2 = "List all the speakers in the MongoDB Q4 2024 earnings call."

search_request2 = SearchRequest(
    query=question2,
    query_type="lexical",
    return_type="element",
    properties_filter="(properties.entity.company_ticker=\"MDB\") AND (properties.entity.quarter=\"Q4\")",
)

answer2 = rag(question2, search_request2)
print(answer2)

### Question 3: List all the speakers in the Broadcom Q4 2024 earnings call.

Correct answer: <It's like a list of 16 people>

Write the search request yourself! (Note: The stock ticker for Broadcom is the extremely intuitive AVGO. Filtering on company name also works)

In [None]:
question3 = "List all the speakers in the Broadcom Q4 2024 earnings call."

search_request3 = ...

answer3 = rag(question3, search_request3)
print(answer3)

### Question 4: How many customers did MongoDB have at the end of the Q1 2024 quarter?

Correct answer: >49.2k

Write the search request yourself! (I didn't need any filters for this one - just lexical search did the trick)

In [None]:
question4 = "How many customers did MongoDB have at the end of the Q1 2024 quarter?"

search_request4 = ...

answer4 = rag(question4, search_request4)
print(answer4)

### Question 5: What was the first PLTR earnings call where Anduril is mentioned?

Correct answer: Q3 2024 (November 4)

This question can be answered by RAG, but this is mostly due to the fact that our dataset is rather small. You could imagine that
if our data contained hundreds of 2020s Palantir reports, it would be hard to be sure that we were retrieving the first document
referencing the acquisition, and therefore difficult to tell if the RAG answer is correct. What we'd actually like to do is get
all the records mentioning the acquisition, sort them by date, and then return the first one.

For completeness, here's the rag implementation:

In [None]:
question5 = "What was the first PLTR earnings call where Anduril is mentioned?"

search_request5 = SearchRequest(
    query=question5,
    query_type="lexical",
    return_type="element",
    properties_filter="(properties.entity.company_ticker=\"PLTR\")"
)

answer5 = rag(question5, search_request5)
print(answer5)

## Question 6: List all the companies that mentioned inflation and give me a count of the number of times each of the companies mentioned inflation.

Correct answer:

- Amazon: 3 
- AstraZeneca: 3 
- ...
- Camden Property Trust: 1

I was completely unable to get this question to work, as it requires working over all chunks that mention inflation and counting them
per company. If you can get RAG to answer this (especially in a way that scales), kudos. Here's what it would probably look like:

In [None]:
question6 = "List all the companies that mentioned inflation and give me a count of the number of times each of the companies mentioned inflation."

search_request6 = SearchRequest(
    query="inflation",
    query_type="vector",
    return_type="element",
)

answer6 = rag(question6, search_request6)
print(answer6)