# Introduction: The Conceptual What and Why of RAG

Large language models like ChatGPT can process large quantities of text practically instantly. However, there are still limits to how much information can be given to ChatGPT with the context of a conversation. As of this writing, ChatGPT's flagship GPT-4o model can receive a maximum of about 125,000 words$^1$. 

This quantity of words, 125,000, is roughly equivalent to that of a book. This is ample capacity for many use cases, but what if you want to use ChatGPT to analyze the text equivalent of *an entire bookshelf* of books? 

We can't feed the text equivalent of an entire bookshelf of books to ChatGPT because of the aforementioned technical limitation, but we can *select relevant subsets of the text and give these subsets to ChatGPT*. This describes Retrieval Augmented Generation (RAG). 

The purpose of this document is to demonstrate a use case for RAG and to do so in a manner which abstracts from specific technical details. I've authored classes and methods which implement these specific technical aspects, but I'll mostly keep the code in the `.py` files (which are included in this repository) and focus on the conceptual overview of RAG here. 

# The Corpus: 10-Q Corporate Earnings Reports

In this project, we'll demonstrate RAG using a corpus of corporate quarterly reports, in the form of PDF documents. 

Large publicly traded corporations are required by US law to make public a series of annual and quarterly reports. One such report is the 10-Q, which gives a comprehensive overview of financial performance and operations over a given three month period. The 10-Q report includes numerical accounting data, such as income and cash flow statements, as well as management's discussion of the quarter's results. [Here](https://drive.google.com/file/d/1NYVFl_wz9FjFRopKOpQ6a8LBN7adoZ-f/view?usp=sharing) is an example of a 10-Q. 

Of course, the standardized accounting data that is included in the 10-Q is of great interest to investors, but anyone who has followed stock market investments closely would know very well that comments made by management in earnings reports can often be more impactful than the accounting data. Investors scrutinize management comments in order to better understand the current and future realities of the company. 

Management's comments in quarterly reports can be used to analyze a particular company. Furthermore, such commentary from a group of companies can be used to analyze a particular industry or to analyze the entire economy. People who manage large companies likely know a few things about the world! 

However, it is a massive investment of time and energy to read hundreds of very large documents every quarter; and at the same time it isn't technically possible to send hundreds of thousands of words of text to the ChatGPT API. 

Therefore, we have the perfect use case for RAG: a collection of text that we want to analyze that is too large to be sent to ChatGPT in a single instance. 

I've manually downloaded the latest 10-Q reports for 35 of the largest S&P 500 companies; and used `PyPDF2` to extract the text from these PDFs. These won't be from exactly the same period of time, because of complexities in fiscal calenders and reporting dates, but these reports will all be from the last 6 months of 2024.

# The Retrieval Part of RAG

Now that we have a dataset and an conceptual idea of how we intend to use RAG, let's continue with a research question in mind: *what things did managers of these companies said about supply chain concerns in this quarter?*

To answer this question, we're going to want to look for certain paragraphs in the 10-Q reports that discuss supply chain dynamics. 

I've authored a class `TextHandler` which makes possible the "retrieval" part of RAG. As a preprocessing step, methods on this class decompose the given documents into groups of text, each group about 1,000 words, and then stores these text snippets in a way that allows for retrieving only the ones relevant to a given question.

This is the central idea of RAG - we create a way of efficiently searching for subsets of the larger corpus that are relevant to a given question and these subsets will be sent to ChatGPT.

In [4]:
from zero_rag.text_handler import TextHandler

I designed the `TextHandler` class to apply these preprocessing steps to the corpus upon instantiation. The corpus is broken up into chunks of 1,000 words, these chunks are turned into a vector format (also called "embedding"), which allows for comparing the text chunks to the given question, and these vectors are indexed for fast retrieval. 

A quick conceptual explanation of vectorization in this context: each paragraph is turned into a vector. Think of these vectors as each a list of numbers. When we ask a question, the text of the question will be vectorized in exactly the same way. Then, the vectorized question is compared to the vectorized paragraphs from the corpus, and the most similar paragraphs are returned. Numerical similarity of these vectors corresponds to textual similarity of different chunks of text. 

Put shortly, vectorization allows for evaluating similarity between strings of text. OpenAI makes available a function that accomplishes this vector embedding. This and other steps cost time and money, so `TextHandler` saves the embedding so that this step is only done once:

In [3]:
%%time
retriever = TextHandler(path_text_input='/l/pdfs/txt/')

do not have embeddings, re-creating (probably desirable). 
wrote '/l/pdfs/txt/embeddings.pkl'
CPU times: user 7.8 s, sys: 217 ms, total: 8.02 s
Wall time: 10min 54s


Now we have an instance of `TextHandler` assigned to the variable `retriever`. 

Here is our research question:

In [5]:
question = 'what things did managers of these companies said about supply chain concerns in this quarter?'

...we use the `retrieve_relevant_chunks` method to return three paragraphs that are most similar to the given question:

In [6]:
sample_chunks = retriever.retrieve_relevant_chunks(question, n=3)

...we're just going to collect the three most similar paragraphs for demonstration. Later we'll collect a large number of paragraphs. 

Here is the first result:

In [7]:
print(sample_chunks[1].replace('\n', ' ')[:3000])

 reliance on any such forwar d-looking statements, which speak only as of the date they are made. We undertake no obligation to update any forwar d-looking statement, whether as a r esult of new information, futur e events or otherwise. Risks Associated with Commodities and Our Supply Chain During the 12 and 36 weeks ended September 7, 2024, we continued to experience higher operating costs, including on transportation and labor costs, which may continue for the remainder of 2024. Many of the commodities used in the production and transportation of our products are purchased in the open market. The prices  we pay for such items are subject to fluctuation, and we manage this risk through the use of fixed-price contracts and purchase orders, pricing agreements and derivative instruments, including swaps and futures. A number of external factors, including the ongoing conflict in Ukraine, the inflationary cost environment, adverse weather conditions, supply chain disruptions and labor sho

...the formatting has been lost (due to the way that the text was extracted from the PDFs) and the text has been cut off in the middle of sentences. However, ChatGPT will likely still be able to get the relevant ideas, and especially when spanning many such paragraphs. 

I've written code that allows for programatically interacting with ChatGPT. The code is available in this repository, but we'll just use the code here without getting into the details of it. It is fundamentally similar to the ChatGPT web app in the sense that we send a question and get a response.

We want to verify that these paragraphs are indeed relevant to supply chain concerns. Let's use ChatGPT to summarize the three paragraphs that were retrieved above:

In [8]:
from zero_rag import chatgpt_convo as chat

In [9]:
base_message = 'In about three sentences, summarize this text: '

for paragraph in sample_chunks:
    convo = chat.init_convo()
    message = f"{base_message} {paragraph}"
    convo, reply = chat.new_message(message, convo, model='gpt-4o')
    print(reply)
    print('\n')

The text discusses the challenges faced by a company due to non-linear sales patterns, manufacturing issues, and supply chain disruptions, which can result in unpredictable revenue and operating results from quarter to quarter. Large orders and their timing further complicate revenue forecasting, as they can significantly impact operating results depending on when they are recognized as revenue. The company also faces risks related to inventory and purchase commitments that could lead to excess or obsolete inventory if demand decreases, and it is heavily reliant on contract manufacturers and suppliers, making it vulnerable to supply chain issues and financial problems within the supply chain.


The text discusses several risks and challenges the company faces, including higher operating costs and commodity price fluctuations due to factors like the Ukraine conflict and inflation. Climate change regulations may also lead to increased costs, and the company is monitoring potential impact

Looking okay so far! There are some irrelevant concepts included in the paragraphs; but ChatGPT should be able to easily parse this kind of thing out$^2$. 

Let's modify the question slightly to examine these paragraphs in a different way:

In [10]:
base_message = 'In about three sentences, summarize the following text as it pertains corporate supply chains: '

for paragraph in sample_chunks:
    convo = chat.init_convo()
    message = f"{base_message} {paragraph}"
    convo, reply = chat.new_message(message, convo, model='gpt-4o')
    print(reply)
    print('\n')

Challenges in corporate supply chains can cause significant impacts on business operations and financial results. Nonlinear shipping patterns and manufacturing issues can increase operational costs and make revenue forecasting difficult due to periods of underutilized capacity or overtime expenses. A dependency on external manufacturers and suppliers further complicates supply chain management, as financial instability, capacity constraints, or external disruptions can lead to increased costs, delays, or inadequate supply fulfillment, impacting gross margins and operating results.


The corporation is experiencing rising operational costs, particularly in transportation and labor, due to fluctuating commodity prices influenced by external factors such as the conflict in Ukraine and inflation. This is managed through fixed-price contracts, pricing agreements, and derivative instruments, but may still impact revenue and operating results if increased costs are not passed on to customers.

Okay! It appears that the retrieved paragraphs are actually discussing supply chain concerns. Let's now get an answer to our research question!

# The Generative Part of RAG

Now we'll put everything together to generate a robust answer to our research question. We'll do a little bit of prompt engineering for the purpose of getting the best possible answer from ChatGPT; also we'll use a cloud of keywords to subset paragraphs:

In [11]:
# words and phrases related to supply chain. 
keywords = """
Supply Chain. Logistics management. Inventory optimization. Demand forecasting. Supplier relationships. 
Procurement strategies. Distribution networks. Warehouse operations. Supply chain analytics. Just-in-time manufacturing. 
Freight transportation. Vendor management. Order fulfillment. Supply chain resilience. Production planning. 
Inventory turnover. Global sourcing. Reverse logistics. Supply chain risk management. Cost efficiency. 
Lead time reduction.
"""

In [12]:
# spot check the first three paragraphs again, with expanded keywords. 
sample_chunks = retriever.retrieve_relevant_chunks(keywords, n=3)

base_message = """
In about three sentences, summarize the following text as it pertains corporate supply chains, ignoring 
concepts that aren't related to supply chain:
"""

for paragraph in sample_chunks:
    convo = chat.init_convo()
    message = f"{base_message} {paragraph}"
    convo, reply = chat.new_message(message, convo, model='gpt-4o')
    print(reply)
    print('\n')

Corporate supply chains face challenges from multiple factors, including rising costs of transportation and resources, constrained labor markets, and disruptions due to natural disasters or geopolitical events. The complexity of operating fulfillment networks and data centers increases as businesses expand, with potential risks related to forecasting demand, managing staffing levels, and handling inventory for third parties. Additionally, reliance on a limited number of shipping companies and the impact of labor and environmental issues can negatively affect operations and customer satisfaction.


The text highlights several risks related to corporate supply chains. Key concerns include reliance on significant suppliers, some of which are single or limited sources, without long-term agreements to ensure supply stability. Supplier issues, such as bankruptcies, geopolitical events, or unethical practices, could disrupt the supply chain, affecting the company's operations and reputation. 

...using a large number of keywords seems to improve the relevence of paragraphs. Let's get a larger number of paragraphs:

In [13]:
# get the 100 paragraphs that are most similar to the keyword cloud
paragraphs = retriever.retrieve_relevant_chunks(keywords, n=100)

In [14]:
len(paragraphs)

100

In [15]:
prompt = """
I'm macroeconomic analyst at a large hedge fund. I need to identify current trends and factors that impact 
supply chains for a given basket of public companies. In this project, I'm not interested in the dynamics 
of the individual companies per se; I want to analyze management comments about supply chain dynamics with the 
desire to identify large macro factors that impact the entire economy. 

The following are snippets of text taken from the most recent 10-Q report for a selection of publicly 
traded companies. Note that these paragraphs may contain a lot of information that isn't related to supply chain 
dynamics, i.e. it will be your task to identify concepts that are pertinent. I'm especially focused on management 
comments which are contextual to the most recent quarter and the upcoming year. 

Analyze these paragraphs and return a SWOT analysis of the supply chain aspects of these companies. 
One paragraph for each of Strengths, Weaknesses, Opportunities and Threats. 

The following are the aforementioned paragraphs: 
"""
prompt = f"{prompt} {'---'.join(paragraphs)}"

In [16]:
convo = chat.init_convo()
convo, reply = chat.new_message(prompt, convo, model='gpt-4o')
print(reply)

A SWOT analysis of the supply chain aspects of the selection of publicly traded companies mentioned in the 10-Q report snippets can be summarized as follows:

**Strengths:**
1. **Technological Investments:** Companies are heavily investing in technology and infrastructure, including AI and machine learning, which can streamline supply chain processes and improve efficiency.
2. **Diverse Global Operations:** The presence in multiple international markets provides a broad supply base and potentially diverse sources for procurement and production, allowing for risk mitigation.
3. **Flexibility in Operations:** Many companies exhibit strong operational flexibility, maintaining various agreements with third-party logistics providers and allowing them to adapt to changes in demand or supply chain disruptions.
4. **Brand Reputation:** Strong brands and a focus on customer service can enhance negotiation power with suppliers and logistics partners, as well as ensuring customer loyalty even in 

Okay! We now have what appears to be, superficially, a fairly good analysis of corporate supply chains based on the quarterly reports. 

# Evaluating the Results

So, we now have an analysis, and one which appears to be a decent one. But, how do we know that the text generated by ChatGPT is actually representative of the input text, perhaps ChatGPT is hallucinating? 

Ideally, there would be a hard quantitative method of evaluting the quality of the output. However, both the output of and inputs to this process are qualitative in nature; therefore evaluating the utility of the results will have to be qualitative as well. 

The first step in evaluation should be to merely read through some of the input documents to get a sense of what information is actually there and then compare to the SWOT analysis. Similarly, in a real-world application, there would likely be an internal subject matter expert who would be primed to spot any inaccuracies with the SWOT analysis. This approach is relatively easy to implement and would likely spot any broad and obvious issues; but it suffers from a lack of breadth. There are only around 35 10-Q reports in this project, but consider how costly it would be to manually span hundreds or thousands of large documents. 

A more robust option for evaluating the accuracy of the output of this process would be to modify the code to utilize the outputs of multiple LLMs; then the consensus (or lack thereof) would provide some insight. If a few different LLMs arrive at a similar conclusion, then we could have some degree of confidence in the quality of the output. However, the downsides of this approach include the possibility that multiple LLMs could be making similar mistake and, alternatively, different LLMs might arrive at substantially different outputs, given the same inputs. 

In conclusion, we can evaluate the output through manual review or seeking consensus between multiple LLMs; but these methods have significant limitations. Ultimately, the only truly valid test is to put the tool into the hands of subject matter experts for them to determine the utility of the project. In this hypothetical, we'd ideally be communicating closely with subject matter experts from the beginning of the project and allow them to start using it, given that there is a mutual understanding that the project is being launched in an experimental status and that there is always much that can be done to impove it. 

# Footnotes

\[1\]: OpenAI's ChatGPT API defines the length of this context in terms of "tokens". Conceptually, tokens can be thought of as word, however in reality tokens are defined slightly differently than words.

\[2\]: Note that the paragraphs don't necessarily include any information about what company the paragraph is discussing. This was planned in this project, as the idea is to analyze dynamics that are impacting many of the companies included in the corpus. 