# RAG for SEC Filing

I was reading about RAG and came acorss [this repo](https://gist.github.com/virattt/985a352b945a0e1164e91415f1ab2eeb). I wanted to try this out with my implementation of RAG on Chroma. So here is my attempt.

## Step 1: Get the SEC filings as plain text

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

In [15]:
# URLs for the SEC filing
appl_filing = 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/f8aaeabb-7a2a-479d-bf72-9559ff51ea5d.pdf'
meta_filing = 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/ba763267-0ccb-4870-a7c5-e1bfd92a9ca7.pdf'

In [32]:
# helper function to load the PDF from the URL and extract each page content.
def load_sec_pdf_data(url):
    doc = PyPDFLoader(url)
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=0)
    doc = splitter.split_documents(doc.load())
    #text = ''
    #for pg in doc:
    #    text += pg.page_content
    return doc

In [33]:
appl_text = load_sec_pdf_data(appl_filing)
meta_text = load_sec_pdf_data(meta_filing)

## Step 2: Convert the filing information into vector embeddings

In [34]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

In [35]:
def store_text_embeddings(text):
    embedding = OpenAIEmbeddings()
    #get the emebedding
    db = Chroma.from_documents(
        documents=text,
        embedding=embedding,
        persist_directory='./sec_data'
    )
    #now store it
    db.persist()    

In [36]:
store_text_embeddings(appl_text)
store_text_embeddings(meta_text)

## Step 3 (optional): Test if data has been stored

In [37]:
db = Chroma(persist_directory='./sec_data', embedding_function=OpenAIEmbeddings())

In [38]:
query = 'What are the risks mentioned by Apple?'

In [42]:
print(db.similarity_search(query)[0].page_content)

Table of Contents
Risks Related to Data, Security, Platform Integrity, and Intellectual Property
•the occurrence of security breaches, improper access to or disclosure of our data or user data, and other cyber incidents, as well as intentional misuse
of our services and other undesirable activity on our platform;
•our ability to obtain, maintain, protect, and enforce our intellectual property rights; and
Risks Related to Ownership of Our Class A Common Stock
•limitations on the ability of holders of our Class A Common Stock to influence corporate matters due to the dual class structure of our common stock
and the control of a majority of the voting power of our outstanding capital stock by our founder, Board Chair, and CEO.
Risks Related to Our Product Offerings
If we fail to retain existing users or add new users, or if our users decrease their level of engagement with our products, our revenue, financial results, and
business may be significantly harmed.


## Step 4: Question the data

In [50]:
# this part is based on the doc https://python.langchain.com/docs/use_cases/question_answering/
from langchain import hub
from langchain_core.runnables import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain.chat_models import ChatOpenAI

In [44]:
# first connect to database
db = Chroma(persist_directory='./sec_data', embedding_function=OpenAIEmbeddings())
retriever = db.as_retriever()

In [51]:
# now setup the LLM & RAG prompt
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [52]:
# helper function to simply docs
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [53]:
# this is a chain expressed using LangChain's special protocol called LECL
# more info at: https://python.langchain.com/docs/expression_language/ 
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [55]:
rag_chain.invoke('What was the revenue of Apple in the reporting period?')

'The revenue of Apple in the reporting period was $81,797 million.'

In [56]:
rag_chain.invoke('What was the revenue of Meta in the reporting period?')

'The revenue of Meta in the reporting period was $34,146 million for the three months ended September 30, 2023, and $94,791 million for the nine months ended September 30, 2023.'

## Step 5: Q&A With Source

In [59]:
from operator import itemgetter
from langchain_core.runnables import RunnableParallel

In [62]:
rag_chain_from_docs = (
    {
        "context": lambda input: format_docs(input["documents"]),
        "question": itemgetter("question"),
    }
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"documents": retriever, "question": RunnablePassthrough()}
) | {
    "documents": lambda input: [doc.metadata for doc in input["documents"]],
    "answer": rag_chain_from_docs,
}

In [63]:
rag_chain_with_source.invoke('What was the revenue of Apple in the reporting period?')

{'documents': [{'page': 9,
   'source': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/f8aaeabb-7a2a-479d-bf72-9559ff51ea5d.pdf'},
  {'page': 3,
   'source': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/f8aaeabb-7a2a-479d-bf72-9559ff51ea5d.pdf'},
  {'page': 18,
   'source': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/f8aaeabb-7a2a-479d-bf72-9559ff51ea5d.pdf'},
  {'page': 12,
   'source': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/f8aaeabb-7a2a-479d-bf72-9559ff51ea5d.pdf'}],
 'answer': 'The revenue of Apple in the reporting period was $81,797 million.'}