In [71]:
# pip install sturdy-stats-sdk pandas numpy plotly

from sturdystats import Index, Job
import pandas as pd
import numpy as np

API_KEY = None

### Load Earnings Call Dataset

For those who do not follow finance, an earnings call is  "a conference call between the management of a public company, analysts, investors, and the media to discuss the company’s financial results during a given reporting period, such as a quarter or a fiscal year. An earnings call is usually preceded by an earnings report, which contains summary information on financial performance for the period." [1] 

This dataset is on earnings calls for the {Google, Apple, Meta, Microsoft, and Nvidia} for the years {2023, 2024}

In [23]:
df = pd.read_parquet("data/tech_earnings_calls_oct_2024.parquet")
df.head()

Unnamed: 0,ticker,quarter,year,doc,published,title,author,priceDelta
0,GOOG,2024Q1,2023,"Operator: Welcome, everyone. Thank you for sta...",2024-01-30,GOOG 2024Q1,GOOG,-0.091346
1,GOOG,2023Q4,2023,"Operator: Welcome, everyone. Thank you for sta...",2023-10-24,GOOG 2023Q4,GOOG,-0.083095
2,GOOG,2023Q3,2023,"Operator: Welcome, everyone. Thank you for sta...",2023-07-25,GOOG 2023Q3,GOOG,0.061722
3,GOOG,2023Q2,2023,"Operator: Welcome, everyone. Thank you for sta...",2023-04-25,GOOG 2023Q2,GOOG,-0.021557
4,GOOG,2024Q4,2024,"Operator: Welcome, everyone. Thank you for sta...",2024-10-29,GOOG 2024Q4,GOOG,0.045372


In [24]:
df.ticker.unique(), np.array(sorted(df.quarter.unique()))

(array(['GOOG', 'AAPL', 'META', 'MSFT', 'NVDA'], dtype=object),
 array(['2022Q2', '2022Q3', '2022Q4', '2023Q1', '2023Q2', '2023Q3',
        '2023Q4', '2024Q1', '2024Q2', '2024Q3', '2024Q4'], dtype='<U6'))

#### A brief preview of what the call looks like
The transcripts of these calls tend to range from 10-40 pages and 8000-30000 tokens each.

For a full call see: https://abc.xyz/assets/bd/7b/d57831684953be8bcc2c5a42aee8/2024-q2-earnings-transcript.pdf

In [27]:
print(df.doc.iloc[0][:2000])

Operator: Welcome, everyone. Thank you for standing by for the Alphabet Fourth Quarter 2023 Earnings Conference Call. [Operator Instructions] 
 I would now like to hand the conference over to your speaker today, Jim Friedland, Director of Investor Relations. Please go ahead. 
James Friedland: Thank you. Good afternoon, everyone, and welcome to Alphabet's Fourth Quarter 2023 Earnings Conference Call. With us today are Sundar Pichai, Philipp Schindler and Ruth Porat. 
 Now I'll quickly cover the safe harbor. Some of the statements that we make today regarding our business, operations and financial performance may be considered forward-looking. Such statements are based on current expectations and assumptions that are subject to a number of risks and uncertainties. Actual results could differ materially. Please refer to our Forms 10-K and 10-Q, including the risk factors discussed in our upcoming Form 10-K filing for the year ended December 31, 2023. We undertake no obligation to update a

#### Create an index and upload data



In [31]:
index = Index(API_key=API_KEY, name="demo_tech_earnings_calls__v1")
index.get_status()

Found an existing index with id="index_05a7cb07da764f0f81397b39ce65ab06".


{'index_id': 'index_05a7cb07da764f0f81397b39ce65ab06',
 'name': 'demo_tech_earnings_calls__v1',
 'state': 'untrained'}

### Upload the data
The upload accepts a list of dictionaries (json records format). The only requirement is that the text field you wish to index is specified under the `doc` field. 

The upload api supports partial updates, metadata only updates. If you provide a `doc_id` field, you can update the `doc` content. If there is no `doc_id` provided, a `doc_id` is created based on the sha of the `doc` content.

In [33]:
res = index.upload(df.to_dict("records"))

Uploading data to UNTRAINED index for training.
uploading data to index...
committing changes to index "index_05a7cb07da764f0f81397b39ce65ab06"...

#### We can immediately query any uploaded data via standard SQL
At the moment, this capability is not particularly interesting. However, once the unstructured data has been statistically indexed, the presently unstructured data becomes unified with the existing structured data and can be analyzed with standard quantitative analysis methods.

In [66]:
index.queryMeta("SELECT quarter, count(*) as c FROM doc_meta GROUP BY quarter ORDER BY quarter" )

[{'quarter': '2022Q2', 'c': 1},
 {'quarter': '2022Q3', 'c': 1},
 {'quarter': '2022Q4', 'c': 2},
 {'quarter': '2023Q1', 'c': 3},
 {'quarter': '2023Q2', 'c': 5},
 {'quarter': '2023Q3', 'c': 5},
 {'quarter': '2023Q4', 'c': 5},
 {'quarter': '2024Q1', 'c': 5},
 {'quarter': '2024Q2', 'c': 4},
 {'quarter': '2024Q3', 'c': 4},
 {'quarter': '2024Q4', 'c': 2}]

### Kick of training 
No parameters are required for training. However, because our model is a hierarchical Bayesian model, we have a few levers we can pull to point our model in the right direction based on our prior knowledge about our dataset. The nature of our datset enables us to provide the model with additional information via two parameters: `doc_hierarchy` and `regex_paragraph_splitter`.

`doc_hierarchy` enables the user to set the high level model structure based on existing metadata. In our case, we have earnings calls from a variety of companies and a variety of quarters. The calls likely looks vastly different across companies. Additionally, within a company calls likely look different across quarters, though I would guess not to the extent they differ across comapnies. Thus we set the model to hierarchicaly split the data first across companies, then across quarters. We are not stratifying the data. More detail, see Appendix A.

`regex_paragraph_splitter` allows the user to provide regex string to split a document in a semantically meaningful way. This allows our model to index data not only across documents but also within a document. In this case, earnings call paragraphs are split by new lines.


In [69]:
doc_hierarchy = ["quarter", "ticker"]
regex_paragraph_splitter = "\n"
job = index.train(dict(doc_hierarchy=doc_hierarchy, regex_paragraph_splitter=regex_paragraph_splitter), wait=False)
job.get_status()

## Crunching the numbers
Because we set `wait=False`, the train functional call returned to us a job object. This job provides a mechanism to poll the job to estimate it's status. Training takes anywhere from minutes to 1-2 days, scaling linearly with the number of tokens in the dataset. 

We are training a custom model tuned to your data and your data alone. This is not a neural network model but a bayesian estimator that will hierarchically structure your data and orgainize into a discrete set of topics, applied to each word, sentence, paragraph, document and metadata structure you provide. This statistical structure does not hallucinate, naturally debiases language (and has configuations to debias around metadata), and opens the door to running quantitative analysis on unstructured data. 

Once a model is trained, all future uploads will automatically be index in the same statistical structure.

While you wait, you can explore a model that has already been trained on this dataset here: https://sturdystatistics.com/analyze?folder_id=index_05a7cb07da764f0f81397b39ce65ab06&comp_fields=ticker,quarter&max_excerpts_per_doc=5&bar_plot_fields=ticker,quarter


# Explore your data!

Once the job is done, you can explore your data on our website's dashboard!


In [79]:
job.wait()
def getURL(index_id, api_key):
    url = f"https://sturdystatistics.com/analyze?folder_id={index_id}&comp_fields=ticker,quarter&max_excerpts_per_doc=5&bar_plot_fields=ticker,quarter"
    if api_key is not None:
        url += f"&api_key={api_key}"
    return url
print("index_id:", index.id)
getURL(index.id, API_KEY)

index_id: index_05a7cb07da764f0f81397b39ce65ab06


'https://sturdystatistics.com/analyze?folder_id=index_05a7cb07da764f0f81397b39ce65ab06&comp_fields=ticker,quarter&max_excerpts_per_doc=5&bar_plot_fields=ticker,quarter'

## Next Steps

We have a series of additional notebooks in this series
1. Upload the data
2. Build Visualizations
3. Throw our dirty old RAGs: Use Topic Augmented Generation to supercharge LLMs. 

### Sources
[1] https://www.investopedia.com/terms/e/earnings-call.asp

[2] http://www.stat.columbia.edu/~gelman/book/BDA3.pdf

# APPENDIX

# A.
 

The `doc_hierarchy` parameter enables us to tell the model about the high level structure of the dataset. In machine learning, there is often a trade off between stratification and sample size. E.g. say you polled 50,000 individuals at random who they are voting for, as well as took their age, gender, and ethnicity. This is a massive sample size for a poll (about 50x larger than a standard poll). About as large of a polling dataset you will find. You can either lump everyone together and keep a sample size of 50,000. But that doesn't provide very much information beyond a raw number: we cannot adjust for demographic, voter turnout likelihood, state distribution etc. 

But let's say we want to know how women are voting. If we are following standard machine learning techniques, we need to stratify our dataset (ie throw away any data that doesn't match the sample we care about). Filtering out men reduces our sample size to ~25000. Let's say specifically we want to know how women in Pennsylvania will be voting. This reduces our dataset to 750. Lets say we want to specifically know how latina women our voting. That reduces our sample size down to ~150. Let's say we want to know how lainta women under 30 are voting. Our sample size is now down to ~30. We had a giant dataset and are trying to answer the relatively basic question "How are latina women under 30 in Pennsylvania voting?". 

You can obviously solve this problem by simply polling more. E.g. If we had a sample size of 500,000, we would now have 300 latina women in PA. Still not a great sample size and now 10x more expensive of a poll. And by the time you are finished polling, the race may have drastically changed. 

Another solution would be to poll to answer a specific question. If we are interested in latina women in PA under 30, let's just poll only them. But as soon as you are interested in any new questions, you have to start over from scratch.

Essentially instead of working harder or spending more money, why not just use more clever statistics. Hierarchy allows one to balance aggragated knowledge with stratified breakdowns. Instead of splitting up our data across a number of different models, we can just build the structure of the data into our model

In this case
```
                US Voters
                /   |    \
State:      NY     PA    CA ...
                 /    \
Gender:        Men     Women ...
                    /    |     \
Ethnicity        White  Black  Latin ...
                             /    |    \
Age                        18-30  30-60  60+ 
```

At the lowest level of the model, you can plug in our stratified model that has computes for each class the voting preference for the most specific possible demographic. Then one level up, the model aggregates those subdivisions and computes an estimate for the demographic as a whole. E.g. get an estimate for latina women under 30 from the data, then aggregate those statistics across all age groups to estimate all latina women. 

However, this is where the model gets interesting. The graph is bidirectional. So information about latina women 18-30 informs the latina estimator: but the latina estimator also informs the stratified estimator about latina women under 30. As a result, our smallest estimator is able to directly benefit from data about all latina women and thus improve its confidence. If in our survey for example, we saw of the 30 latina women under 30 polled, 20 planned to vote Republican and 10 Democrat, a simple stratified model would spit out latina women are 66% likely to vote republican +-5. However, if out of 150 latina women only 60 were to vote democrate, our hierarchical model would levarage that information to temper its original prediction, though weighing each vote outside of its demographic less than it would a vote inside its demographic. Going up the chain, if we saw 450 the 750 women in PA polled, this provides additional data for our hierarchical model to leverage, though weighing it less than it would latina women specifically. This bidirectional chain of reasoning is happening in every node of this graph until the model achieves a stable equilibrium and we have successfully squeezed as much information as we possibly can out of our dataset.

In our current dataset, we have earnings calls from a variety of companies and a variety of quarters. The calls likely looks vastly different across companies. Additionally, within a company calls likely look different across quarters, though I would guess not to the extent they differ across comapnies. The dataset is relatively small, 36 calls, ~400k tokens altogether. Unlike Large Language Models or Neural Networks, for our model this is plenty of data. If we wanted we could train a model from scratch on just a fraction of this dataset size. We can train on such specialized datasets because our model structures itself to squeeze as much information from language as possible. And our api provides the ability to adjust our model's default structure to leverage any knowledge you have about you dataset's metadata. 

```
             All Earnings Calls
                /   |    \
Company:    GOOG   AAPL  META ...
                 /    \
Quarter:       24Q1   24Q2 ...
                    /         \
Paragarph          P1         P2 ...

Sentence ...

Word ...
```


For more on hierarchical modelling, checkout chapter 5 in http://www.stat.columbia.edu/~gelman/book/BDA3.pdf.

m