# Workshop Notebook 4 - Analytics Queries with Sycamore

In the previous tutorial we answered several questions using RAG, however there were some patterns that RAG could not 
get, as the context window is too small (and LLMs aren't great at counting).

For a refresher, here are the questions:

0. In the Broadcom earnings call, what details did the CEO, Hock Tan, discuss about the VMware acqusition?
1. What was the revenue in the Q2 AirBnB earnings call?
2. List all the speakers in the MongoDB Q4 2024 earnings call.
3. List all the speakers in the Broadcom Q4 2024 earnings call.
4. How many customers did MongoDB have at the end of the Q1 2024 quarter?
5. What was the first PLTR earnings call where Anduril is mentioned?
6. List all the companies that mentioned inflation and give me a count of the number of times each of the companies mentioned inflation.
7. Summarize how the VMWare acquisition contributed to revenue changes for Broadcom quarter over quarter.
8. Summarize how Intuit’s latest AI powered platform, Intuit Assist is being integrated through its products. Give me a quarter by quarter break down of the progress. 
9. Summarize all the mergers and acquisitions that happened in 2024 and give a breakdown of how each acquisition impacted earnings.
10. Summarize how AI integration is progressing across each company's products. Give me a quarter by quarter break down of the progress per company and overall.

We were able to answer up to question 5 with RAG, although as discussed question 5 is a little dangerous. With more data we are not necessarily guaranteed to get
the correct answer. Accordingly, let's pick up from there.

In [None]:
# Last notebook we wrote our docset id to a file to pick up here.
with open("docset_id", "r") as f:
    docset_id = f.read()

print(docset_id)

### Question 5: What was the first PLTR earnings call where Anduril is mentioned?

Correct answer: Q3 2024 (November 4)

As shown previously, this question can be answered by RAG, but this is mostly due to the fact that our dataset is rather small.
What we'd actually like to do is 

1. get all the records mentioning the acquisition
2. sort them by date
3. return the first one

That looks like a simple enough plan. Let's implement it with sycamore! We'll start by reading the DocSet from Aryn.

Quiz: We'll be reading this docset from Aryn a bunch. What can we do to cache a local copy to speed this up?

In [None]:
import sycamore
from sycamore import MaterializeSourceMode
from sycamore.data import Element, Document
context = sycamore.init()

read_me_seymoure = context.read.aryn(docset_id = docset_id)

The documents come out with an "_original_elements" property which contains a copy of the elements. We use this in the UI to render bounding boxes of elements on the 
documents (circumventing any chunking) but for our purposes this is a load of crap so we'll add a transform to remove it.

In [None]:
def remove_original_elements(doc: Document) -> Document:
    doc.properties.pop("_original_elements", None)
    return doc
    
cleaned_docset = read_me_seymoure.map(remove_original_elements)

Now we'll actually answer the question. We'll need to flesh out the plan from above a little bit:

1. Filter to only Palantir documents
2. Filter out any elements that don't contain the string 'anduril'
3. Filter out any documents that now contain zero elements
4. Parse the date property (The llm generates a string, but we'd like a `DateTime` object to sort by)
5. Sort by date
6. Take the first document

In [None]:
from sycamore.transforms import DateTimeStandardizer

palantir_docset_sorted = (
    cleaned_docset
    .filter(lambda doc: doc.properties['entity']['company_ticker'] == 'PLTR')
    .filter_elements(lambda elt: "anduril" in (elt.text_representation or "").lower())
    .filter(lambda doc: len(doc.elements) > 0)
    .map(lambda doc: DateTimeStandardizer.standardize(doc, key_path = ["properties","entity","date"]))
    .sort(descending=False, field="properties.entity.dateTime")
)
answer5_sycamore = palantir_docset_sorted.take(1)[0].properties['entity']['day']
print(answer5_sycamore)

### Question 6: List all the companies that mentioned inflation and give me a count of the number of times each of the companies mentioned inflation.

Simple RAG cannot answer this question. I might be wrong, I guess, but I couldn't get it to work. And I think any RAG solution
that does work would be very unnatural. Let's see what this looks like with sycamore.

In [None]:
question6 = "List all the companies that mentioned inflation and give me a count of the number of times each of the companies mentioned inflation."

First off, we'll want to operate on elements, not documents, since we would like to count the number of inflation mentions in each document.
Doing this with sycamore is actually quite simple, although I tend to avoid it where possible as it makes the data model harder to reason about.
Remember: Each member of a DocSet is a Document, and each Document contains several Elements. We're about to make sycamore treat each Element as
a Document.

The `docset.explode()` transform turns every Element into a top-level Document. However, it keeps the original top-level Documents as husks of their
former selves (they contain no elements). Accordingly, we'll throw in this filter at the end to get rid of them.

In [None]:
element_docset = cleaned_docset.explode().filter(lambda doc: "parent_id" in doc)

Next we ask an LLM whether each element-now-document mentions inflation, keeping only the ones that do.
We will be left with a DocSet full of quotes that mention inflation, so we can group them by company and
then count, using the `groupby_count` transform.

In [None]:
from sycamore.llms.prompts.default_prompts import LlmFilterMessagesJinjaPrompt
from sycamore.llms.llms import LLMMode
from sycamore.llms.openai import OpenAI, OpenAIModels
from sycamore.functions.tokenizer import OpenAITokenizer

llm = OpenAI(OpenAIModels.GPT_4O)
tk = OpenAITokenizer(model_name=OpenAIModels.GPT_4O.value.name)

inflation_mentions_ds = element_docset.llm_filter(
    llm=llm,
    new_field="inflation_mentioned_confidence",
    prompt = LlmFilterMessagesJinjaPrompt.fork(filter_question="Does this text mention inflation?", use_elements=False),
    tokenizer = tk,
    max_tokens = 80_000,
).groupby_count('properties.entity.company_name')

In [None]:
# the return of the rich table
import rich
inflation_table = rich.table.Table(title="inflation_mentions")
inflation_table.add_column("company")
inflation_table.add_column("mentions")

counts = [(d.properties['count'], d.properties['key']) for d in inflation_mentions_ds.take_all()]
for c, k in sorted(counts):
    inflation_table.add_row(k, str(c))

rich.print(inflation_table)

## Question 9: Summarize all the mergers and acquisitions that happened in 2024 and give a breakdown of how each acquisition impacted earnings.

Correct answer: A summary of all the mergers and acquisitions. I ain't writing that out.

Plan:

1. Filter elements by whether "merger" or "acquisition" is in the text
2. Summarize everything.

In [None]:
# First, a filter_elements. This is fairly simple
from sycamore.data import Element

def filter_mna(elt: Element) -> bool:
    return "acquisition" in elt.text_representation.lower() or "merger" in elt.text_representation.lower() 

mna_filtered_ds = cleaned_docset.filter_elements(filter_mna)



## Summarize

Sycamore exports a function called `summarize_data` which attempts to summarize an entire DocSet. You can think of this as similar to a RAG operator
except it works on data larger than the context window might allow (by heirarchically summarizing summaries). To call it, pass in a docset and a 
Summarizer like so:

In [None]:
from sycamore.transforms.summarize import MultiStepDocumentSummarizer, EtCetera
from sycamore.query.execution.operations import summarize_data
from sycamore.functions.tokenizer import OpenAITokenizer
from sycamore.llms.llms import LLMMode


oaitk = OpenAITokenizer(OpenAIModels.GPT_4O.value.name, max_tokens=100_000)

# Some parameters that will go into the prompt - the question you want the llm to answer, and a description of
# the data being provided
question = "For each of the companies mentioned please summarize the impact of mergers and acquisitions on earnings."
data_desc = "Acquisition Earnings"

# Construct the summarizer:
# llm_mode tells it to make its llm calls asynchronously
# fields is a list of fields to include for each document. a list terminating with the sentinel value EtCetera means
#     "all fields, but any that were specified go first"
# tokenizer is needed to determine how many documents can fit in a batch.
summarizer = MultiStepDocumentSummarizer(
    llm=llm, 
    llm_mode=LLMMode.ASYNC, 
    question=question, 
    data_description=data_desc, 
    fields=[EtCetera],
    tokenizer=oaitk
)

# Summarize:
summary = summarize_data(
    llm=llm,
    question=question,
    data_description=data_desc,
    input_data=[mna_filtered_ds],
    docset_summarizer=summarizer
)

print(summary)

## Question 10: Summarize how AI integration is progressing across each company's products.

Plan:
1. llm_map to get how AI is being integrated into the company's products
2. llm_filter "Is artificial intelligence being discussed?"
3. summarize_data: pointing `fields` at the date, company name, and integration summary properties

See if you can get this one to work. Good Luck!

I've started it with an example of what the llm_map might look like. I'm now realizing llm_map is new
to this tutorial. llm_map takes a prompt, llm, and output_field, parametrizes the prompt with each
document and calls the llm, and puts the llm response in `doc.properties[output_field]`

More documentation on prompts is [here](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/prompts.html)

In [None]:
from sycamore.llms.prompts.prompts import JinjaPrompt

# JinjaPrompt parametrizes a prompt using the Jinja templating system. Each template gets a reference
#   to `doc` which is the document being parametrized.
# There is also a JinjaElementPrompt which gets references to `elt` and `doc` (the element and the
#   element's parent document)
ai_integration_prompt = JinjaPrompt(
    system = "You are a banana",
    user = """
    You are given a transcript of an earnings call for {{ doc.properties['entity']['company_name'] }}.
    Please generate a summary of how AI is being integrated into the company's products.

    Transcript:
    {% for elt in doc.elements %}
    {{ elt.text_representation }}
    {% endfor %}
    """
)

cleaned_docset.llm_map(prompt=ai_integration_prompt, llm=llm, output_field="ai_integration_summary")