In [0]:
%pip install sturdy-stats-sdk


In [0]:
# pip install sturdy-stats-sdk pandas numpy plotly

from sturdystats import Index, Job
import pandas as pd
import numpy as np

from plotly import express as px
import openai
import json


API_KEY = "PublicIndex_NoKeyNeeded" # Replace with your own Api Key to query your own indices
index_id = "index_05a7cb07da764f0f81397b39ce65ab06" ## Optionally Replace with your own if you want
OPENAI_API_KEY= "XXX"

gpt = openai.OpenAI(api_key=OPENAI_API_KEY)
index = Index(API_key=API_KEY, id=index_id)

quarter = "2024Q4"
prevquarter = "2024Q3"
ticker = "GOOG"



## Section 1: Single Document Summaries

#### Latest Earnings Call

In [0]:
getcall = lambda q, t: index.query(filters=f"quarter='{q}' AND ticker='{t}'", context=100000, limit=1)["docs"][0]["text"]
latest_earnings_call = getcall(quarter, ticker)

print(latest_earnings_call[:1000])
print("...")
print(latest_earnings_call[-1000:])

### GPT

#### Helper Function

This function helps track the cost of our input on output tokens for a given model

In [0]:
COSTS = {
  "gpt-4o-mini": [ .15/1e6, .6/1e6],
  "gpt-4o": [2.5/1e6, 10/1e6]
}
def basicGPT(prompt, model="gpt-4o-mini", response_format={"type":"json_object"}):
  res = gpt.chat.completions.create(
    model=model,
    response_format=response_format,
    messages=[{"role": "user", "content":prompt}]
  )
  pt, ct = res.usage.prompt_tokens, res.usage.completion_tokens
  cost = pt*COSTS[model][0] + ct*COSTS[model][1]

  res = res.choices[0].message.content
  if response_format is not None:
    res = json.loads(res)
  return res, pt+ct, cost

#### Basic Prompt Engineering

In [0]:
prompt = f"""
You are an financial analysts.
You have been given an earnings call from {ticker}'s {quarter}.
Provide a summary with excerpts extracted from the earnings call. 
Provide the summary in a json file under the key 'summary'. The summary itself should be a single formatted string containing structured bullet points. This summary can be relatively verbose. Pull out anything that is interesting with examples. Be as specific as possible

EARNINGS CALL
{latest_earnings_call}
"""
res, tokens, cost = basicGPT(prompt, model="gpt-4o")

print("Tokens:", tokens)
print("Cost:", cost)

In [0]:
print(res["summary"])

### Sturdy Statistical Summary

#### Topic Diff cuts through the noise
The topic diff api compares the dataset selected by q1 (in this case, google's latest earnings call) to a second specified subset (if none provided, it uses dataset as a whole). This has the effect of automatically eliminating basic boilerplate parts of the conversation and pulling out the most distinct topical content in the call. It also opens the door to complex quantitative comparison which enables granular, use controlled summaries.

In [0]:

df = pd.DataFrame(index.topicDiff(f"quarter='{quarter}' AND ticker='{ticker}'", cutoff=1.0, limit=100)["topics"])
df = df.sort_values(["confidence", "prevalence"], ascending=False)
df[["topic_id", "short_title", "prevalence", "confidence"]]

#### Extractive Summaries
Our topic diff provides a high level list of topics. For GPT outputs, this is often where analysis ends. With Sturdy Statistics, a topic is just the beginning.

We can extract all the excerpts that are associated any one or multiple topics. We can perform this extraction on the corpus as a whole, or, in this case, on a single document. 

In [0]:
row = df.sample()
docs = index.query(topic_id=row.topic_id, filters=f"quarter='{quarter}' AND ticker='{ticker}'")["docs"]
print(row.short_title)
for doc in docs:
  print(doc["text"])

#### Synergy
We can combine our topic diff meta analysis with our granlular extractions to power what we call a __Topic Augmented Retrieval__ or __TAG__ for short. 

__TAG__ enables us to get statistically driven summaries at a fraction of the cost of GPT. Instead of using gpt for extraction, analysis, and formatting, we now can use it exclusively as a formatting engine. This enables us to both reduce our input token count AND use smaller, cheaper llms while getting more complete and arguably better results.

Unlike the previous GPT centered approach, we have full visibility over the actual data and can easily fact check and cite the inputs. It is important to note that while RAG also enables this fact checking, __we cannot perform RAG on this task because there is no input query__. 

We will compare TAG to RAG in Sections (2, 3) on a set of problems more adapted to RAG use cases.

In [0]:
def summarizeRow(row):
  topic_id = row['topic_id']
  docs = index.query(topic_id=topic_id, filters=f"quarter='{quarter}' AND ticker='{ticker}'", override_args=dict(max_excerpts_per_doc=5))["docs"]
  text = "\n-\n".join([doc['text'] for doc in docs])
  prompt = f"""You are an financial analysts.
  You have been given a set of EXCERPTS from {ticker}'s {quarter} earnings calls. 
  These excerpts pertain to row['short_title']. Summarize on the content relevant to the row['short_title'].
  Provide a short_title and a brief summary for the EXCERPTS provided below.
  Return the output in valid json dictionary containing the keys short_title and summary.

  EXCERPTS
  {text}
  """
  res, tokens, cost = basicGPT(prompt)
  res["examples"] = text
  res["cost"] = cost 
  res["tokens"] = tokens
  return res

summaries = [ summarizeRow(row) for row in df.to_dict("records") ]
for key in ["summary", "examples", "cost", "tokens"]:
  df[key] = [ s[key] for s in summaries ]

print("Tokens:", df.tokens.sum())
print("Cost:", df.cost.sum())

for row in df.to_dict("records"):
  print(row["short_title"])
  print("Prevalence:", row["prevalence"])
  print("Topic Id:", row["topic_id"])
  print(row["summary"])
  #print(s["examples"])
  print("-"*20)

In [0]:
for d in docs:
  print(d["text"])
  print("-"*20)

#### Recursive Summaries
If you want an even high level report, you add one more layer of summarization above the previous report. Because we have deeply simplified our prompts to gpt asking it only to rephrase excerpts rather than find them itself, the gpt summaries are much more reliable and are more amenable to recursive use.

In [0]:
def summarizeSummaries(df):
  text = [ f"TITLE: {row['short_title']}\nPREVALENCE: {row['prevalence']}\nSUMMARY: {row['summary']}" for row in df.to_dict("records") ]
  prompt = f"""You are an financial analysts.
  You have been given a set of summaries from {ticker}'s {quarter} earnings calls. 
  Each summary has been given a title, a prevalence which describes the percentage of the call that summary entails and a summary.
  Given the information in the TITLE, PREVALANCE and SUMMARY fields provided about the earnings call, provide a structured overview of the events of the past quarter. 
  Provide the summary in a json file under the key 'summary'. The summary itself should be a single formatted string a 1-3 paragraphs. This summary can be relatively verbose. Pull out anything that is interesting with examples. Be as specific as possible.

  EXCERPTS
  {text}
  """
  return basicGPT(prompt)
res, tokens, cost = summarizeSummaries(df)
print("Tokens:", tokens)
print("Cost:", cost)

print("OVERVIEW")
print(res["summary"])
print("-"*50, "\n")
for row in df.to_dict("records"):
  print(row["short_title"])
  print("Topic Id:", row["topic_id"])
  print(row["summary"])
  #print(s["examples"])
  print("-"*20, "\n")

## Section 2: Trends over Time

In the above report, topic 87 (AI in Consumer Devices) caught my eye.

```
AI in Consumer Devices
Topic Id: 87
During the Q4 earnings call, Google discussed its ongoing efforts to enhance the performance and capabilities of its AI models, particularly the Gemini models, which are now utilized across all major products with over 2 billion monthly users, including Google Maps. The company announced plans to introduce advanced AI experiences, particularly through Project Astra, by 2025. Additionally, Gemini is being made available to developers, as highlighted by its integration with GitHub Copilot. Google is restructuring its teams to improve agility in deploying new models, including moving the Gemini app team to Google DeepMind. These changes aim to streamline operations and accelerate advancements in AI technologies.
```

I want to understand how Google's approach/results with respect to consumer devices has transformed over the past few quarters.

### Load The Data

In [0]:
docs = index.query(filters="ticker='GOOG'", context=100000, sort_by="quarter")["docs"]
quarters = [ d["metadata"]["quarter"] for d in docs ]
print(quarters)

### GPT 

In [0]:
text = "\n\n---------\n\n".join([ f"QUARTER: {d['metadata']['quarter']}\nCONTENT: {d['text']}" for d in docs ])

prompt = f"""
You are an financial analysts.
You have been given Google's earnings call from the following quarters: {quarters}.
We are interested in understanding how Google's approach to `AI IN CONSUMER DEVICES` has changed over quarters.
Provide an overview and a quarter by quarter summary of how this approach has changed.

Each earnings call has the following Structures:
QUARTER: 2024Q1
CONTENT: The full content of the earnings call
This field will be very long.

Provide the summary in a json file under the key 'summary'. The summary itself should be a single formatted string containing structured bullet points. This summary can be relatively verbose. Pull out anything that is interesting with examples. Be as specific as possible


{text}
"""
res, tokens, cost = basicGPT(prompt, model="gpt-4o")

print("Tokens:", tokens)
print("Cost:", cost)

In [0]:
print(res["summary"])

In [0]:
### GPT