# Question Answering Tutorial

In the previous tutorial we ingested a bunch of earnings call transcripts into Aryn with Sycamore. Now let's answer some questions on the data!

We've come up with a list of questions that we think will be interesting to try to answer - some are easy and some are hard:

0. In the Broadcom earnings call, what details did the CFO, Kirsten Spears, discuss about the VMware acqusition?
1. What was the change in stock price on the day of the Q2 2024 AirBnB earnings call?
2. List all the speakers in the MongoDB Q4 2024 earnings call.
3. List all the speakers in the Broadcom Q4 2024 earnings call.
4. How many customers did MongoDB have at the end of the Q1 2024 quarter?
5. What was the first PLTR earnings call where Anduril is mentioned?
6. Summarize how the VMWare acquisition contributed to revenue changes for Broadcom quarter over quarter.
7. Summarize how Intuit’s latest AI powered platform, Intuit Assist is being integrated through its products. Give me a quarter by quarter break down of the progress.
8. List all the companies that mentioned inflation and give me a count of the number of times each of the companies mentioned inflation.
9. Summarize all the mergers and acquisitions that happened in 2024 and give a breakdown of how each acquisition impacted earnings.
10. Summarize how AI integration is progressing across each company's products. Give me a quarter by quarter break down of the progress per company and overall.

With the data ingested in Aryn, we should be able to build programmatically answer these question by retrieving the data and using an LLM. We'll 
start with RAG (Retrieval Augmented Generation), which can answer most of the simpler questions, but getting further down the list, RAG starts to 
break down. There's simply too much data that needs to be retrieved to do it all with an LLM call. In practice, most RAG practitioners will end up 
building complicated agentic flowy systems to break down the question into more manageable pieces - here, we'll use sycamore to spellout query plans 
which can answer these 'sweep and harvest'-style questions.

## Simple RAG implementation

RAG is essentially comprised of two steps: a query step and a llm step. In other words, you execute a search query, and then you hand an LLM the 
search results and the original question, and it crunches the search results into an answer. There are about a million variants, but for the
purposes of expediency we'll just use basic RAG here. We can query Aryn using aryn_sdk, so let's do that:

In [None]:
from aryn_sdk.client.client import Client as ArynClient

aryn_client = ArynClient()

In [None]:
# I had way too much fun making this to not include it. 
# Basically just an extremely fancy way of printing all the DocSets in your account
import rich

dtable = rich.table.Table(title="Docsets")
dtable.add_column("docset_id")
dtable.add_column("name")
dtable.add_column("created_at")
dtable.add_column("size")

dses = list(aryn_client.list_docsets())
for ds in dses:
    dtable.add_row(ds.docset_id, ds.name, ds.created_at.isoformat(), str(ds.size))

rich.console.Console().print(dtable)

In [None]:
# ok ok ok let's get the docset_id of the docset we created at the end of the ingestion tutorial
with open("docset_id", "r") as f:
    docset_id = f.read()

print(docset_id)

### Search
First let's play around a little with the search API. `aryn_client.search()` accepts two parameters: a `docset_id` and a `search_request`.

An Aryn `SearchRequest` has the following parameters:

- `query`: the query string
- `query_type`: the kind of query to execute. One of "keyword", "lexical", "vector", and "hybrid"
- `properties_filter`: a filter expression. More on those later
- `return_type`: either "element" or "doc". Whether to return individual elements or whole documents.

### Filters
Aryn filter syntax is comprised of expressions formatted like so: `"(property = \"value\")`, where the
property name is given in dotted notation. String values are double-quoted. Other comparison operators (<, >, <=, >=, <>) are supported.
Any number of expressions can be joined with ANDs. No grouping is allowed though.

Here's an example query. Feel free to mess with it.

In [None]:
from aryn_sdk.types.search import SearchRequest

aryn_client.search(
    docset_id = docset_id,
    query = SearchRequest(
        query="What is Tesla up to these days?",
        query_type="vector",
        return_type="doc",
        properties_filter="(properties.entity.company_ticker = \"TSLA\")"
    )
)

Now we'll write a relatively simple RAG function to reuse for the first several questions. 

I'll use the sycamore LLM interface because it's what I'm most familiar with and it's fairly easy to use. A sycamore LLM has a 
`generate` method that accepts a `RenderedPrompt` which is made up of `RenderedMessage`s, following the messages API that the 
model providers seem to have settled on.

In [None]:
from aryn_sdk.types.search import SearchRequest
from sycamore.llms.openai import OpenAI, OpenAIModels
from sycamore.llms.prompts.prompts import RenderedPrompt, RenderedMessage

llm = OpenAI(OpenAIModels.GPT_4O)

def rag(question: str, search_request: SearchRequest) -> str:
    search_result = aryn_client.search(docset_id = docset_id, query = search_request)

    messages = [RenderedMessage(role="user", content=f"Using the provided documents, answer the question: {question}")]
    for sr in search_result.value.results:
        # We don't want to include the embeddings in the prompt - 
        # It will just take up space with thousands of random numbers.
        sr.pop("embedding", None)
        if "elements" in sr:
            for elt in sr["elements"]:
                elt.pop("embedding", None)
        # This really isn't super intelligent. We just dump each search result 
        # out as a string and let the LLM decide what to pay attention to
        messages.append(RenderedMessage(role="user", content=str(sr)))
        
    return llm.generate(prompt=RenderedPrompt(messages=messages))

### Question 1: What was the change in stock price on the day of the Q2 2024 AirBnB earnings call?

Correct answer: 2.92%

It turns out that plain, unfiltered BM25 search is sufficient to answer this question. Note that we're asking
for elements back rather than documents - typically this is what you want to do in RAG systems.

In [None]:
question1 = "What was the change in stock price on the day of the Q2 2024 AirBnB earnings call?"

search_request1 = SearchRequest(
    query=question1,
    query_type="lexical",
    return_type="element",
)

answer1 = rag(question1, search_request1)
print(answer1)

### Question 2: List all the speakers in the MongoDB Q4 2024 earnings call.

Correct answer: <It's like a list of 12 people>

Here we need a filter - specifically we want only documents in Q4 for MongoDB (we only have 2024 data but you'd 
probably want to add that filter too)

In [None]:
question2 = "List all the speakers in the MongoDB Q4 2024 earnings call."

search_request2 = SearchRequest(
    query=question2,
    query_type="lexical",
    return_type="element",
    properties_filter="(properties.entity.company_ticker=\"MDB\") AND (properties.entity.quarter=\"Q4\")",
)

answer2 = rag(question2, search_request2)
print(answer2)

### Question 3: List all the speakers in the Broadcom Q4 2024 earnings call.

Correct answer: <It's like a list of 16 people>

Write the search request yourself! (Note: The stock ticker for Broadcom is the extremely intuitive AVGO. Filtering on company name also works)

In [None]:
question3 = "List all the speakers in the Broadcom Q4 2024 earnings call."

search_request3 = ...

answer3 = rag(question3, search_request3)
print(answer3)

### Question 4: How many customers did MongoDB have at the end of the Q1 2024 quarter?

Correct answer: >49.2k

Write the search request yourself! (I didn't need any filters for this one)

In [None]:
question4 = "How many customers did MongoDB have at the end of the Q1 2024 quarter?"

search_request4 = ...

answer4 = rag(question4, search_request4)
print(answer4)

### Question 5: What was the first PLTR earnings call where Anduril is mentioned?

Correct answer: Q3 2024 (November 4)

This question can be answered by RAG, but this is mostly due to the fact that our dataset is rather small. You could imagine that
if our data contained hundreds of 2020s Palantir reports, it would be hard to be sure that we were retrieving the first document
referencing the acquisition, and therefore difficult to tell if the RAG answer is correct. What we'd actually like to do is get
all the records mentioning the acquisition, sort them by date, and then return the first one.

For completeness, here's the rag implementation:

In [None]:
question5 = "What was the first PLTR earnings call where Anduril is mentioned?"

search_request5 = SearchRequest(
    query=question5,
    query_type="lexical",
    return_type="element",
    properties_filter="(properties.entity.company_ticker=\"PLTR\")"
)

answer5_rag = rag(question5, search_request5)
print(answer5_rag)

Let's use sycamore to answer the question though. We'll start by reading the DocSet from Aryn.

Quiz: We'll be reading this docset from Aryn a bunch. What can we do to cache a local copy to speed this up?

In [None]:
import sycamore
from sycamore import MaterializeSourceMode
from sycamore.data import Element, Document
context = sycamore.init()

read_me_seymoure = context.read.aryn(docset_id = docset_id, aryn_url="https://api.aryn.ai/v1/storage").materialize(path="materialize/temp", source_mode=MaterializeSourceMode.USE_STORED)

The documents come out with an "_original_elements" property which contains a copy of the elements. We use this in the UI to render bounding boxes of elements on the 
documents (circumventing any chunking) but for our purposes this is a load of crap so we'll add a transform to remove it.

In [None]:
def remove_original_elements(doc: Document) -> Document:
    doc.properties.pop("_original_elements", None)
    return doc
    
cleaned_docset = read_me_seymoure.map(remove_original_elements)

Now we'll actually answer the question. First we'll want to filter to only Palantir documents. Next we'll filter out any elements 
that don't contain the string 'anduril'. We'll apply another filter for documents that still have elements. We'll standardize our 
date property - since we extracted it with an LLM, it's just an unstructured string. `DateTimeStandardizer` will parse it into a 
python `DateTime` object, which we can sort by.

Once we've done all of that, we can simply take the first document and report, for example, the day.

In [None]:
from sycamore.transforms import DateTimeStandardizer

vwmare_docset_sorted = (
    cleaned_docset
    .filter(lambda doc: doc.properties['entity']['company_ticker'] == 'PLTR')
    .filter_elements(lambda elt: "anduril" in (elt.text_representation or "").lower())
    .filter(lambda doc: len(doc.elements) > 0)
    .map(lambda doc: DateTimeStandardizer.standardize(doc, key_path = ["properties","entity","date"]))
    .sort(descending=False, field="properties.entity.dateTime")
)
answer5_sycamore = vwmare_docset_sorted.take(1)[0].properties['entity']['day']
print(answer5_sycamore)

## Question 9: Summarize all the mergers and acquisitions that happened in 2024 and give a breakdown of how each acquisition impacted earnings.

Correct answer: A summary of all the mergers and acquisitions. I ain't writing that out.

Plan:

1. Filter elements by whether "merger" or "acquisition" is in the text
2. Summarize everything.

In [None]:
# First, a filter_elements. This is fairly simple
from sycamore.data import Element

def filter_mna(elt: Element) -> bool:
    return "acquisition" in elt.text_representation.lower() or "merger" in elt.text_representation.lower() 

mna_filtered_ds = cleaned_docset.filter_elements(filter_mna)

## Summarize

Sycamore exports a function called `summarize_data` which attempts to summarize an entire DocSet. You can think of this as similar to a RAG operator
except it works on data larger than the context window might allow (by heirarchically summarizing summaries). To call it, pass in a docset and a 
Summarizer like so:

In [None]:
from sycamore.transforms.summarize import MultiStepDocumentSummarizer, EtCetera
from sycamore.query.execution.operations import summarize_data
from sycamore.functions.tokenizer import OpenAITokenizer
from sycamore.llms.llms import LLMMode

oaitk = OpenAITokenizer(OpenAIModels.GPT_4O.value.name, max_tokens=100_000)

# Some parameters that will go into the prompt - the question you want the llm to answer, and a description of
# the data being provided
question = "For each of the companies mentioned please summarize the impact of mergers and acquisitions on earnings."
data_desc = "Acquisition Earnings"

# Construct the summarizer:
# llm_mode tells it to make its llm calls asynchronously
# fields is a list of fields to include for each document. a list terminating with the sentinel value EtCetera means
#     "all fields, but any that were specified go first"
# tokenizer is needed to determine how many documents can fit in a batch.
summarizer = MultiStepDocumentSummarizer(
    llm=llm, 
    llm_mode=LLMMode.ASYNC, 
    question=question, 
    data_description=data_desc, 
    fields=[EtCetera],
    tokenizer=oaitk
)

# Summarize:
# Give it 
summary = summarize_data(
    llm=oai,
    question=question,
    data_description=data_desc,
    input_data=[acq_elts],
    docset_summarizer=summarizer
)

print(summary)

### Question 8: List all the companies that mentioned inflation and give me a count of the number of times each of the companies mentioned inflation.

Simple RAG cannot answer this question. I might be wrong, I guess, but I couldn't get it to work. And I think any RAG solution
that does work would be very unnatural. Let's see what this looks like with sycamore.

In [None]:
question8 = "List all the companies that mentioned inflation and give me a count of the number of times each of the companies mentioned inflation."

First off, we'll want to operate on elements, not documents, since we would like to count the number of inflation mentions in each document.
Doing this with sycamore is actually quite simple, although I tend to avoid it where possible as it makes the data model harder to reason about.
Remember: Each member of a DocSet is a Document, and each Document contains several Elements. We're about to make sycamore treat each Element as
a Document.

The `docset.explode()` transform turns every Element into a top-level Document. However, it keeps the original top-level Documents as husks of their
former selves (they contain no elements). Accordingly, we'll throw in this filter at the end to get rid of them.

In [None]:
element_docset = cleaned_docset.explode().filter(lambda doc: "parent_id" in doc)

Next we ask an LLM whether each element-now-document mentions inflation, keeping only the ones that do.
We will be left with a DocSet full of quotes that mention inflation, so we can group them by company and
then count, using the `groupby_count` transform.

In [None]:
from sycamore.llms.prompts.default_prompts import LlmFilterMessagesJinjaPrompt
from sycamore.llms.llms import LLMMode
from sycamore.functions.tokenizer import OpenAITokenizer

tk = OpenAITokenizer(model_name=OpenAIModels.GPT_4O.value.name)

inflation_mentions_ds = element_docset.llm_filter(
    llm=llm,
    new_field="inflation_mentioned_confidence",
    prompt = LlmFilterMessagesJinjaPrompt.fork(filter_question="Does this text mention inflation?", use_elements=False),
    tokenizer = tk,
    max_tokens = 80_000,
).groupby_count('properties.entity.company_name')

In [None]:
# hahahahahahaha the return of the rich table
inflation_table = rich.table.Table(title="inflation_mentions")
inflation_table.add_column("company")
inflation_table.add_column("mentions")

counts = [(d.properties['count'], d.properties['key']) for d in inflation_mentions_ds.take_all()]
for c, k in sorted(counts):
    inflation_table.add_row(k, str(c))

## Question 10: Summarize how AI integration is progressing across each company's products.

Plan:
1. llm_map to get how AI is being integrated into the company's products
2. llm_filter "Is artificial intelligence being discussed?"
3. summarize_data: pointing `fields` at the date, company name, and integration summary properties

See if you can get this one to work. Good Luck!