# Chat with SEC Documents
This project aims to build very simple replicate to the https://www.secinsights.ai/ where you can chat with financial documents built by LlamaIndex. Sec Insights let you chat with financial documents such as 10-Ks and 10-Qs

Some techstack we will use:
- LangChain
- FAISS
- OpenAI API
- Instructor

First, download the SEC documents, make sure to change the base_dir to your local environment as well as the institution, email address to comply with sec edgar api. We will then process the raw html to pdf for better readability for LLMs

In [2]:
from sec_edgar_downloader import Downloader
from itertools import product
from pathlib import Path
import pdfkit

DEFAULT_CIKS = [
    # AAPL
    "320193",
    # MSFT
    "789019",
    # TSLA
    "1318605",
    # NVDA
    "901832",
    # LLY (Eli Lilly)
    "59478",
]
DEFAULT_FILING_TYPES = [
    "10-K",
    "10-Q",
]


# Initialize the downloader
base_dir = "/Users/gyf/Desktop/workshop" # Change this to your desired output directory
dl = Downloader("JHU", "gyf@pm.me", base_dir) # Change this to your desired username and email

def download_filing(cik, filing_type, after="2023-01-01", before="2024-01-01"): # Change the dates to your desired range
    dl.get(filing_type, cik, after=after, before=before, include_amends=True, download_details=True)

def convert_to_pdf(output_dir: str):
    """Converts all html files in a directory to pdf files."""

    data_dir = Path(output_dir) / "sec-edgar-filings"
    options = {'enable-local-file-access': None}
    for cik_dir in data_dir.iterdir():
        if cik_dir.is_dir():  # Check if it's a directory
            for filing_type_dir in cik_dir.iterdir():
                if filing_type_dir.is_dir():  # Check if it's a directory
                    for filing_dir in filing_type_dir.iterdir():
                        if filing_dir.is_dir():  # Check if it's a directory
                            filing_doc = filing_dir / "primary-document.html"
                            filing_pdf = filing_dir / "primary-document.pdf"
                            if filing_doc.exists() and not filing_pdf.exists():
                                print("- Converting {}".format(filing_doc))
                                input_path = str(filing_doc.absolute())
                                output_path = str(filing_pdf.absolute())
                                try:
                                    pdfkit.from_file(input_path, output_path, verbose=True, options=options)
                                except Exception as e:
                                    print(f"Error converting {input_path} to {output_path}: {e}")


# # Main function to download and convert filings
# def main():
#     for cik, filing_type in product(DEFAULT_CIKS, DEFAULT_FILING_TYPES):
#         download_filing(cik, filing_type)
#     convert_to_pdf(output_dir)

# main()  # Call the main function to run the script

Loading PDFs with PyMuPDFLoader from langchain, pdf is converted to text and loaded where list item is corresponding to each page in the pdf.


Each page will have page_content and metadata attribute.

In [3]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(f"{base_dir}/10K examples/LLY_10K.pdf")
documents = loader.load()
print(documents[0].page_content) # Print the first page of the first document

United States
Securities and Exchange Commission
Washington, D.C. 20549
Form 10-K
Annual report pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934
for the fiscal year ended December 31, 2022
Commission file number 001-06351
ELI LILLY AND COMPANY
(Exact name of Registrant as specified in its charter)
Indiana
 
35-0470950
(State or other jurisdiction of
(I.R.S. Employer
incorporation or organization)
Identification No.)
Lilly Corporate Center, Indianapolis, Indiana 46285
(Address and zip code of principal executive offices)
Registrant's telephone number, including area code (317) 276-2000
Securities registered pursuant to Section 12(b) of the Exchange Act:
Title of Each Class
Trading Symbol(s)
Name of Each Exchange On Which Registered
Common Stock (no par value)
LLY
New York Stock Exchange
7 1/8% Notes due 2025
LLY25
New York Stock Exchange
1.625% Notes due 2026
LLY26
New York Stock Exchange
2.125% Notes due 2030
LLY30
New York Stock Exchange
0.625% Notes due 2031
LLY

Checkout metadata

In [4]:
print(documents[0].metadata.keys())
print(documents[0].metadata)

dict_keys(['source', 'file_path', 'page', 'total_pages', 'format', 'title', 'author', 'subject', 'keywords', 'creator', 'producer', 'creationDate', 'modDate', 'trapped'])
{'source': '/Users/gyf/Desktop/workshop/10K examples/LLY_10K.pdf', 'file_path': '/Users/gyf/Desktop/workshop/10K examples/LLY_10K.pdf', 'page': 0, 'total_pages': 120, 'format': 'PDF 1.4', 'title': 'lly-20221231', 'author': '', 'subject': '', 'keywords': '', 'creator': 'wkhtmltopdf 0.12.6', 'producer': 'Qt 4.8.7', 'creationDate': "D:20240118054548-05'00'", 'modDate': '', 'trapped': ''}


## Embedding Models:
We can explore various embedding models from HuggingFace leaderboard: https://huggingface.co/spaces/mteb/leaderboard

#### **text-embedding-ada-002**:
Let's try out OpenAI's text-embedding-ada-002, it offers a sequence length of **8192 tokens** (~10 pages), it is currently one of the longest sequence length offered for embedding models. The model was released in 2021 so it is not the state of art embedding model, still works great.
#### What is the meaning of **sequence length** here?
It means maximum number of items (typically words, tokens, or characters) in a given input sequence that the model processes, we can **chunk** the text into several parts and embed them seperately to avoid exceeding the sequence length, and in often cases improve response quality.

 To better understand how vector search work, here is a function for calculate cosine similarity:

In [6]:
# Get all the page text from the pdf
docs = [i.page_content for i in documents]

import numpy as np
from openai import OpenAI
import os

api_key = os.environ.get("OPENAI_API_KEY")

client = OpenAI(api_key=api_key)

def get_embeddings(text):
    if isinstance(text, list):
        # If input is a list of texts, process each text
        embeddings = []
        for t in text:
            response = client.embeddings.create(
                model="text-embedding-ada-002", input=t.replace("\n", " ")
            )
            embeddings.append(response.data[0].embedding)
        return embeddings
    else:
        # If input is a single text
        response = client.embeddings.create(
            model="text-embedding-ada-002", input=text.replace("\n", " ")
        )
        return response.data[0].embedding

def embed_and_search_openai(documents, query, top_k=5):
    # Ensure top_k is not greater than the number of documents
    top_k = min(top_k, len(documents))

    # Encode your documents
    doc_embs = get_embeddings(documents)
    doc_embs = np.asarray(doc_embs)

    # Encode your query
    query_emb = get_embeddings(query)
    query_emb = np.asarray([query_emb])

    # Compute the dot product between query embedding and document embedding
    # Since the embeddings are normalized, this is equivalent to cosine similarity
    scores = np.dot(query_emb, doc_embs.T)[0]

    # Find the indices of the top k scores
    max_idx = np.argsort(-scores)[:top_k]

    # Return the sorted top k documents and their corresponding scores
    return [{'content': documents[idx], 'score': scores[idx], 'rank': idx + 1} for idx in max_idx]

def print_results(query, result):
    print(f"Search Query: '{query}'\n")
    print("Top Results:")
    for i, result in enumerate(result, 1):
        print(f"Rank: {i}")
        print(f"Content: {result['content']}")
        print(f"Score: {result['score']:.4f}")
        print("----------")

In [7]:
query = "what is the revenue of the company?"
result = embed_and_search_openai(docs, query, top_k=2)
# Prettifying the output
print_results(query, result)

Search Query: 'what is the revenue of the company?'

Top Results:
Rank: 1
Content: Net Product Revenue
Revenue from sales of products is recognized at the point where the customer obtains control of the goods and we satisfy our performance obligation,
which generally is at the time we ship the product to the customer. Payment terms differ by jurisdiction and customer, but payment terms in most of our
major jurisdictions typically range from 30 to 70 days from date of shipment. Revenue for our product sales has not been adjusted for the effects of a
financing component as we expect, at contract inception, that the period between when we transfer control of the product and when we receive payment
will be one year or less. Any exceptions are either not material or we collect interest for payments made after the due date. Provisions for rebates,
discounts, and returns are established in the same period the related product sales are recognized. We generally ship product shortly after orders

#### **Cohere English Embed V3**
Let's try the same approach with Cohere's embedding:


In [35]:
import cohere
import numpy as np

cohere_api_key = "vjYDYA9NEVqn7cr7jlnFoMVAFgWaIH3Yka9UvhZn"
co = cohere.Client(cohere_api_key)

def embed_and_search_cohere(documents, query, top_k=5):
    # Ensure top_k is not greater than the number of documents
    top_k = min(top_k, len(documents))

    # Encode your documents with input type 'search_document'
    doc_emb = co.embed(documents, input_type="search_document", model="embed-english-v3.0").embeddings
    doc_emb = np.asarray(doc_emb)

    # Encode your query with input type 'search_query'
    query_emb = co.embed([query], input_type="search_query", model="embed-english-v3.0").embeddings
    query_emb = np.asarray(query_emb)

    # Compute the dot product between query embedding and document embedding, since the embeddings are normalized, this is equivalent to cosine similarity
    scores = np.dot(query_emb, doc_emb.T)[0]

    # Find the indices of the top k scores
    max_idx = np.argsort(-scores)[:top_k]

    # Return the sorted top k documents and their corresponding scores
    return [{'content': documents[idx], 'score': scores[idx], 'rank': idx} for idx in max_idx]

search relevant page from SEC documents accordingly with cosine similarity

In [36]:
query = "what is the revenue of the company?"
result = embed_and_search_cohere(docs, query, top_k=3)
print_results(query, result)

Search Query: 'what is the revenue of the company?'

Top Results:
Rank: 1
Content: Disaggregation of Revenue
The following table summarizes revenue by product:
U.S.
Outside U.S.
2022
2021
2020
2022
2021
2020
Revenue—to unaffiliated customers:
Diabetes:
Trulicity
$
5,688.8 
$
4,914.4 
$
3,835.9 
$
1,750.9 
$
1,557.6 
$
1,232.2 
Jardiance
1,194.5 
807.3 
620.8 
871.5 
683.5 
533.0 
Humalog
1,191.9 
1,320.7 
1,485.6 
868.7 
1,132.3 
1,140.3 
Humulin
730.2 
832.9 
866.4 
289.2 
389.6 
393.2 
Basaglar
470.7 
588.3 
842.3 
289.7 
304.2 
282.1 
Mounjaro
366.6 
— 
— 
115.9 
— 
— 
Other diabetes
268.4 
255.7 
258.1 
367.8 
401.6 
344.5 
Total diabetes
9,911.1 
8,719.3 
7,909.1 
4,553.7 
4,468.8 
3,925.3 
Oncology:
Verzenio
1,653.2 
834.9 
618.2 
830.3 
515.0 
294.4 
Cyramza
351.4 
358.1 
381.9 
620.0 
674.8 
650.8 
Alimta
543.7 
1,233.9 
1,265.3 
384.0 
827.5 
1,064.7 
Erbitux
500.1 
481.8 
480.1 
66.4 
66.4 
56.3 
Tyvyt
— 
— 
— 
293.3 
418.1 
308.7 
Other oncology
169.7 
120.1 
46.6 
254.1 
21

The quality of the search can be significant with choice of embedding models, if embedding model does not work well, we need more efforts and engineering in retrieval. Poor quality of retrieval will increase the likelihood of hallucination from language models.

In [48]:
from openai import OpenAI

def get_completion_from_messages(messages, model='gpt-4-1106-preview', temperature=0.1, max_tokens=1000):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content

def chat(documents, query, model='gpt-4-1106-preview', search_function=embed_and_search_cohere):
    relevant_context = search_function(documents, query, top_k=3)
    system_message = f"""
    You are a seasoned financial analyst. You have been asked to analyze the financial statements of a company.
    Answer questions from user by inspecting closely at the context from assistant
    """
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": f"{query}"},
        {"role": "assistant", "content": f"Relevant context: \n {relevant_context}"}
    ]
    return get_completion_from_messages(messages, model=model)


In [45]:
query = "what is the net income of the company in the year 2022?"
chat(docs, query)

'The net income of the company for the year 2022 is $6,244.8 million.'

In [41]:
query = "what is the net income of the company in the year 2022?"
chat(docs, query, search_function=embed_and_search_openai)

"The provided context does not contain explicit information about the net income of the company for the year 2022. Net income is typically calculated as revenues minus expenses, taxes, and costs. To determine the net income, you would need to look at the company's income statement, which would provide details on the company's total revenues, cost of goods sold (COGS), operating expenses, interest, taxes, and other incomes or losses.\n\nIf you have access to the company's income statement for 2022, you can calculate the net income by subtracting the total expenses (including COGS, operating expenses, interest, taxes, and any other losses) from the total revenues. If you provide the income statement or specific figures from it, I can assist you in calculating the net income."

In [46]:
query = "what is the net income of the company in the year 2022?"
chat(docs, query, model='gpt-3.5-turbo', search_function=embed_and_search_openai)

"I'm sorry, but I couldn't find the specific information about the net income of the company in the year 2022 in the provided context."

### Introduce LangChain
Above code will embed documents for each call of the function, we can use vector store to embed the document only once and perform search within the database instead of wasting time and compute.

LangChain provides good abstractions for the pipeline: embed, store, retrive and chat:

In [13]:
from langchain.vectorstores import FAISS # vectorstore of our choice
from langchain_community.embeddings import CohereEmbeddings
from langchain_community.document_loaders import PyMuPDFLoader


loader = PyMuPDFLoader(f"{base_dir}/10K examples/LLY_10K.pdf")
embeddings = CohereEmbeddings(model="embed-english-v3.0")
documents = loader.load()

# create a vectorstore from documents and embeddings
db = FAISS.from_documents(documents=documents, embedding=embeddings)
# directly create a retriever from documents and embeddings
retriever = db.as_retriever(search_kwargs={"k": 5})

# or alternatively, create a search function from a vectorstore
def search(query, db, top_k=3):
    relevant_pages = db.similarity_search(query, k=top_k)
    return [page.page_content for page in relevant_pages]

In [16]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI


prompt = ChatPromptTemplate.from_template(
    """You are a seasoned financial analyst. You have been asked to analyze the financial statements of a company.
    Answer questions from user by inspecting closely at the context from assistant
    <context>
    {context}
    </context>
    Question: {input}"""
)

llm = ChatOpenAI(temperature=0.1, openai_api_key=api_key, model="gpt-4-1106-preview")

document_chain = create_stuff_documents_chain(llm, prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)

  warn_deprecated(


In [17]:
retrieval_chain.invoke({"input": "What is the revenue of the company for year 2022?"})

{'input': 'What is the revenue of the company for year 2022?',
 'context': [Document(page_content='The following table summarizes our revenue activity in 2022 compared with 2021:\nYear Ended December 31,\n \n2022\n2021\nPercent Change\nProduct\nU.S.\nOutside U.S.\nTotal\nTotal\nTrulicity\n$\n5,688.8 \n$\n1,750.9 \n$\n7,439.7 \n$\n6,471.9 \n15\nVerzenio\n1,653.2 \n830.3 \n2,483.5 \n1,349.9 \n84\nTaltz\n1,724.6 \n757.4 \n2,482.0 \n2,212.8 \n12\nJardiance\n1,194.5 \n871.5 \n2,066.0 \n1,490.8 \n39\nHumalog\n1,191.9 \n868.7 \n2,060.6 \n2,453.0 \n(16)\nCOVID-19 antibodies\n2,008.9 \n14.7 \n2,023.5 \n2,239.3 \n(10)\nHumulin\n730.2 \n289.2 \n1,019.4 \n1,222.6 \n(17)\nCyramza\n351.4 \n620.0 \n971.4 \n1,033.0 \n(6)\nAlimta\n543.7 \n384.0 \n927.7 \n2,061.4 \n(55)\nOlumiant\n148.2 \n682.3 \n830.5 \n1,115.1 \n(26)\nBasaglar\n470.7 \n289.7 \n760.4 \n892.5 \n(15)\nEmgality\n462.8 \n188.1 \n650.9 \n577.2 \n13\nForteo\n367.3 \n245.8 \n613.1 \n801.9 \n(24)\nCialis\n35.2 \n552.1 \n587.3 \n718.4 \n(18)\nE

In [59]:
retrieval_chain.invoke({"input": "What is the net income of the company for year 2022?"})['answer']

{'input': 'What is the net income of the company for year 2022?',
 'context': [Document(page_content='Gross Margin, Costs, and Expenses\nGross margin as a percent of revenue was 76.8 percent in 2022, an increase of 2.6 percentage points compared with 2021, primarily driven by a net\ninventory impairment charge related to our COVID-19 antibodies recognized in 2021 and the unfavorable effect of foreign exchange rates on international\ninventories sold in 2021. Additionally, in 2022, favorable product mix, including the impact of lower sales of COVID-19 antibodies and Olumiant for the\ntreatment of COVID-19, were offset by lower realized prices and increased expenses due to inflation and logistics costs.\nResearch and development expenses increased 4 percent to $7.19 billion in 2022, driven primarily by higher development expenses for late-stage\nassets, partially offset by lower development expenses for COVID-19 antibodies and the favorable impact of foreign exchange rates.\nMarketing, s

In [60]:
retrieval_chain.invoke({"input": "Who is the CEO of the company?"})['answer']

{'input': 'Who is the CEO of the company?',
 'context': [Document(page_content='Name\nAge\nTitles and Business Experience\nDavid Ricks\n55\nChair, President, and Chief Executive Officer (CEO) (since 2017). Previously, Mr. Ricks held various leadership roles with Lilly,\nincluding senior vice president and president, Lilly Bio-Medicines. Mr. Ricks has 26 years of service with Lilly.\nAnat Ashkenazi\n50\nExecutive Vice President and Chief Financial Officer (since 2021). Previously, Ms. Ashkenazi held various leadership roles with Lilly,\nincluding senior vice president, controller and chief financial officer, Lilly Research Laboratories, and vice president, finance and chief\nfinancial officer, Lilly Diabetes and Lilly global manufacturing and quality. Ms. Ashkenazi has 21 years of service with Lilly.\nEric Dozier\n56\nExecutive Vice President, Human Resources and Diversity (since 2022). Previously, Mr. Dozier held various leadership roles with\nLilly, including senior vice president, ch

In [78]:
retrieval_chain.invoke({"input": "Get the revenue, the net income of the company for year 2022."})['answer']

{'input': 'Get the revenue, the net income of the company for year 2022.',
 'context': [Document(page_content='Gross Margin, Costs, and Expenses\nGross margin as a percent of revenue was 76.8 percent in 2022, an increase of 2.6 percentage points compared with 2021, primarily driven by a net\ninventory impairment charge related to our COVID-19 antibodies recognized in 2021 and the unfavorable effect of foreign exchange rates on international\ninventories sold in 2021. Additionally, in 2022, favorable product mix, including the impact of lower sales of COVID-19 antibodies and Olumiant for the\ntreatment of COVID-19, were offset by lower realized prices and increased expenses due to inflation and logistics costs.\nResearch and development expenses increased 4 percent to $7.19 billion in 2022, driven primarily by higher development expenses for late-stage\nassets, partially offset by lower development expenses for COVID-19 antibodies and the favorable impact of foreign exchange rates.\nMar

Recall from yesterday, we can get strucutured output from documents with function Calling:

In [93]:
# patch the client to add `response_model` to the `create` method
client = OpenAI(api_key=api_key)

schema = {
    "properties": {
        "name": {"type": "string"},
        "net income": {"type": "string"},
        "revenue": {"type": "string"},
        "year": {"type": "integer"},
    },
    "required": ["net income", "revenue", "year"],
    "type": "object",
}

def extract(query, retrieval_chain=retrieval_chain):
    answer = retrieval_chain.invoke({"input": f"{query}"})['answer']
    resp = client.chat.completions.create(
        model="gpt-4-1106-preview",
        messages=[
            {
                "role": "user",
                "content": f'extract accordingly, make sure to include the unit. here is the text: {answer}',
            },
        ],
        functions=[{"name": "Company", "parameters": schema}],
        function_call="auto",
    )
    return resp.choices[0].message.function_call.arguments

In [94]:
query = "Get the revenue, the net income of the company for year 2022."
extract(query, retrieval_chain)

'{"revenue":"28541.4 million","net income":"6244.8 million","year":2022}'

Using Instructor is even simpler:

In [102]:
from pydantic import BaseModel, Field
import instructor
from typing import Optional

# patch the client to add `response_model` to the `create` method
client = instructor.patch(OpenAI(api_key=api_key), mode=instructor.Mode.MD_JSON)

class Company(BaseModel):
    name: Optional[str]
    revenue: str
    net_income: str

def extract_instructor(query, response_model, retrieval_chain=retrieval_chain, ):
    answer = retrieval_chain.invoke({"input": f"{query}"})['answer']
    resp = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=[
            {
                "role": "user",
                "content": f"""
                Extract the following information from this statement: {answer}
                """,
            },
        ],
        response_model=response_model,)
    return resp

query = "Get the revenue, the net income of the company for year 2022."
extract_instructor(query, Company, retrieval_chain)

Company(name=None, revenue='$28,541.4 million', net_income='$6,244.8 million')

In [103]:
class example(BaseModel):
    ticker: Optional[str]
    ceo: Optional[str]
    location: Optional[str]

query = "what is the ticker of the company? and who is the CEO of the company? and get the location of the company"
extract_instructor(query, example, retrieval_chain)

example(ticker=None, ceo='David Ricks', location='Indianapolis')

#### We can store mulitple documents of our choice in the VectorDB, as it is similar to real world use cases.

In [105]:
from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader


loader = DirectoryLoader(f'{base_dir}/pdfs/', glob="*.pdf", loader_cls=PyMuPDFLoader)
embeddings = CohereEmbeddings(model="embed-english-v3.0")
documents = loader.load()

In [114]:
db_new = FAISS.from_documents(documents=documents, embedding=embeddings)
retriever = db_new.as_retriever(search_kwargs={"k": 5})

In [115]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI


prompt = ChatPromptTemplate.from_template(
    """You are a seasoned financial analyst. You have been asked to analyze the financial statements of a company.
    Answer questions from user by inspecting closely at the context from assistant
    <context>
    {context}
    </context>
    Question: {input}"""
)

llm = ChatOpenAI(temperature=0.1, openai_api_key=api_key)

document_chain = create_stuff_documents_chain(llm, prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)

  warn_deprecated(


In [116]:
retrieval_chain.invoke({"input": "What is the revenue for Eli and Lilly for year 2022?"})['answer']

{'input': 'What is the revenue for Eli and Lilly for year 2022?',
 'context': [Document(page_content='Item 8. Financial Statements and Supplementary Data\nConsolidated Statements of Operations\nELI LILLY AND COMPANY AND SUBSIDIARIES\n(Dollars in millions and shares in thousands, except per-\nshare data)\nYear Ended December 31\n2022\n2021\n2020\nRevenue (Note 2)\n$\n28,541.4 \n$\n28,318.4 \n$\n24,539.8 \nCosts, expenses, and other:\nCost of sales\n6,629.8 \n7,312.8 \n5,483.3 \nResearch and development\n7,190.8 \n6,930.7 \n5,976.3 \nMarketing, selling, and administrative\n6,440.4 \n6,431.6 \n6,121.2 \nAcquired in-process research and development and development milestones (Note 3)\n908.5 \n970.1 \n769.8 \nAsset impairment, restructuring, and other special charges \n(Note 5)\n244.6 \n316.1 \n131.2 \nOther—net, (income) expense (Note 18)\n320.9 \n201.6 \n(1,171.9)\n21,735.0 \n22,162.9 \n17,309.9 \nIncome before income taxes\n6,806.4 \n6,155.5 \n7,229.9 \nIncome taxes (Note 14)\n561.6 \n57

In [117]:
retrieval_chain.invoke({"input": "What is the revenue for Microsoft for year 2022?"})['answer']

{'input': 'What is the revenue for Microsoft for year 2022?',
 'context': [Document(page_content='PART II\nItem 7\n \nSUMMARY RESULTS OF OPERATIONS\n \n(In millions, except percentages and per share amounts)\n \n2023\n2022\nPercentage\nChange  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nRevenue\n $\n211,915   $\n198,270\n \n7%\nGross margin\n  \n146,052    \n135,620\n \n8%\nOperating income\n  \n88,523    \n83,383\n \n6%\nNet income\n  \n72,361    \n72,738\n \n(1)%\nDiluted earnings per share\n  \n9.68   \n9.65\n \n0%\nAdjusted gross margin (non-GAAP)\n  \n146,204    \n135,620    \n8%  \nAdjusted operating income (non-GAAP)\n  \n89,694    \n83,383    \n8%  \nAdjusted net income (non-GAAP)\n \n \n73,307   \n \n69,447\n \n6%\nAdjusted diluted earnings per share (non-GAAP)\n  \n9.81   \n9.21\n \n7%\n \n \n \n \n \n \n \n \n \n \n \n \n \n \nAdjusted gross margin, operating income, net income, and diluted earnings per share (“EPS”) are non-GAAP financial measures. Current

In [119]:
retrieval_chain.invoke({"input": "What is revenue for Tesla for year 2022?"})['answer']

{'input': 'What is revenue for Tesla for year 2022?',
 'context': [Document(page_content=' \n2022 compared to 2021\nCost of automotive sales revenue increased $17.18 billion, or 53%, in the year ended December 31, 2022 as compared to the year ended December 31, 2021, in \nline with the growth in revenue year over year, as discussed above. The average combined cost per unit of Model 3 and Model Y increased year over year due to rising \nraw material, logistics and warranty costs. There were also idle capacity charges of $306 million primarily related to the temporary suspension of production at \nGigafactory Shanghai as well as the ramping up of production in Gigafactory Texas and our proprietary battery cells manufacturing during the year ended December 31, \n2022. We had also incurred costs related to the ramp up of production in Gigafactory Berlin-Brandenburg during the year ended December 31, 2022. These increases \nwere partially offset by a decrease in combined average Model S and

In [120]:
retrieval_chain.invoke({"input": "What are some key financials for Eli and Lilly for year 2022?"})['answer']

{'input': 'What are some key financials for Eli and Lilly for year 2022?',
 'context': [Document(page_content='Item 8. Financial Statements and Supplementary Data\nConsolidated Statements of Operations\nELI LILLY AND COMPANY AND SUBSIDIARIES\n(Dollars in millions and shares in thousands, except per-\nshare data)\nYear Ended December 31\n2022\n2021\n2020\nRevenue (Note 2)\n$\n28,541.4 \n$\n28,318.4 \n$\n24,539.8 \nCosts, expenses, and other:\nCost of sales\n6,629.8 \n7,312.8 \n5,483.3 \nResearch and development\n7,190.8 \n6,930.7 \n5,976.3 \nMarketing, selling, and administrative\n6,440.4 \n6,431.6 \n6,121.2 \nAcquired in-process research and development and development milestones (Note 3)\n908.5 \n970.1 \n769.8 \nAsset impairment, restructuring, and other special charges \n(Note 5)\n244.6 \n316.1 \n131.2 \nOther—net, (income) expense (Note 18)\n320.9 \n201.6 \n(1,171.9)\n21,735.0 \n22,162.9 \n17,309.9 \nIncome before income taxes\n6,806.4 \n6,155.5 \n7,229.9 \nIncome taxes (Note 14)\n5

In [122]:
retrieval_chain.invoke({"input": "What are net income for Eli and Lilly for year 2022 and 2021??"})['answer']

{'input': 'What are net income for Eli and Lilly for year 2022 and 2021??',
 'context': [Document(page_content='Consolidated Statements of Comprehensive Income (Loss)\nELI LILLY AND COMPANY AND SUBSIDIARIES\n(Dollars in millions)\nYear Ended December 31\n2022\n2021\n2020\nNet income\n$\n6,244.8 \n$\n5,581.7 \n$\n6,193.7 \nOther comprehensive income (loss):\nChange in foreign currency translation gains (losses)\n(248.1)\n13.5 \n122.1 \nChange in net unrealized gains (losses) on securities\n(53.2)\n(15.9)\n14.2 \nChange in defined benefit pension and retiree health benefit plans (Note 15)\n616.9 \n2,699.4 \n(157.1)\nChange in effective portion of cash flow hedges\n432.9 \n151.6 \n(152.9)\nOther comprehensive income (loss) before income taxes\n748.5 \n2,848.6 \n(173.7)\nBenefit (provision) for income taxes related to other comprehensive income (loss)\n(250.0)\n(695.3)\n200.9 \nOther comprehensive income, net of tax (Note 17)\n498.5 \n2,153.3 \n27.2 \nComprehensive income\n$\n6,743.3 \n$\n

#### Finally, let's add memory capabilities for better user experience
https://python.langchain.com/docs/expression_language/cookbook/retrieval

In [123]:
from operator import itemgetter
from langchain.memory import ConversationBufferMemory
from langchain.schema import format_document
from langchain_core.messages import AIMessage, HumanMessage, get_buffer_string
from langchain_core.runnables import RunnableParallel
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain.prompts.prompt import PromptTemplate
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
ANSWER_PROMPT = ChatPromptTemplate.from_template(template)

DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")


def _combine_documents(
    docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)

In [132]:
memory = ConversationBufferMemory(
    return_messages=True, output_key="answer", input_key="question"
)

# First we add a step to load memory
# This adds a "memory" key to the input object
loaded_memory = RunnablePassthrough.assign(
    chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("history"),
)
# Now we calculate the standalone question
standalone_question = {
    "standalone_question": {
        "question": lambda x: x["question"],
        "chat_history": lambda x: get_buffer_string(x["chat_history"]),
    }
    | CONDENSE_QUESTION_PROMPT
    | ChatOpenAI(temperature=0.1, openai_api_key=api_key, model="gpt-4-1106-preview")
    | StrOutputParser(),
}
# Now we retrieve the documents
retrieved_documents = {
    "docs": itemgetter("standalone_question") | retriever,
    "question": lambda x: x["standalone_question"],
}
# Now we construct the inputs for the final prompt
final_inputs = {
    "context": lambda x: _combine_documents(x["docs"]),
    "question": itemgetter("question"),
}
# And finally, we do the part that returns the answers
answer = {
    "answer": final_inputs | ANSWER_PROMPT | ChatOpenAI(temperature=0.1, openai_api_key=api_key, model="gpt-4-1106-preview"),
    "docs": itemgetter("docs"),
}
# And now we put it all together!
final_chain = loaded_memory | standalone_question | retrieved_documents | answer

  warn_deprecated(
  warn_deprecated(


In [139]:
inputs = {"question": "what is the revenue of microsoft for year 2022?"}
result = final_chain.invoke(inputs)
result

{'answer': AIMessage(content="Microsoft's revenue in 2022 was $198,270 million (or $198.27 billion)."),
 'docs': [Document(page_content='PART II\nItem 7\n \nSUMMARY RESULTS OF OPERATIONS\n \n(In millions, except percentages and per share amounts)\n \n2023\n2022\nPercentage\nChange  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nRevenue\n $\n211,915   $\n198,270\n \n7%\nGross margin\n  \n146,052    \n135,620\n \n8%\nOperating income\n  \n88,523    \n83,383\n \n6%\nNet income\n  \n72,361    \n72,738\n \n(1)%\nDiluted earnings per share\n  \n9.68   \n9.65\n \n0%\nAdjusted gross margin (non-GAAP)\n  \n146,204    \n135,620    \n8%  \nAdjusted operating income (non-GAAP)\n  \n89,694    \n83,383    \n8%  \nAdjusted net income (non-GAAP)\n \n \n73,307   \n \n69,447\n \n6%\nAdjusted diluted earnings per share (non-GAAP)\n  \n9.81   \n9.21\n \n7%\n \n \n \n \n \n \n \n \n \n \n \n \n \n \nAdjusted gross margin, operating income, net income, and diluted earnings per share (“EPS”) a

In [140]:
memory.save_context(inputs, {"answer": result["answer"]})
inputs = {"question": "what about it's net income?"}
result = final_chain.invoke(inputs)
result["answer"]

AIMessage(content="Microsoft's net income for the year 2022 was $72,738 million.")

In [141]:
memory.save_context(inputs, {"answer": result["answer"]})
inputs = {"question": "what about 2021?"}
result = final_chain.invoke(inputs)
result["answer"]

AIMessage(content="Microsoft's net income for the year ended June 30, 2021, was $72,738 million.")

In [142]:
inputs = {"question": "did the company have the same net income from 2021 to 2022?"}
result = final_chain.invoke(inputs)
result

{'answer': AIMessage(content="According to the provided context, the net income for Microsoft in the year 2021 is not explicitly stated in the text. The text provides detailed financial results for the fiscal years 2022 and 2023, but the net income for 2021 is not included in the excerpts provided. Therefore, it is not possible to calculate the difference in Microsoft's net income between the years 2021 and 2022 based on the information given."),
 'docs': [Document(page_content='PART II\nItem 7\n \nSUMMARY RESULTS OF OPERATIONS\n \n(In millions, except percentages and per share amounts)\n \n2023\n2022\nPercentage\nChange  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nRevenue\n $\n211,915   $\n198,270\n \n7%\nGross margin\n  \n146,052    \n135,620\n \n8%\nOperating income\n  \n88,523    \n83,383\n \n6%\nNet income\n  \n72,361    \n72,738\n \n(1)%\nDiluted earnings per share\n  \n9.68   \n9.65\n \n0%\nAdjusted gross margin (non-GAAP)\n  \n146,204    \n135,620    \n8%  \nA

## Final Remarks
#### RAG can be complicated in real world senario, there are mulitple tricks to enhance the response quality, some ideas includes:
- Chunking
- Reranking
- Classification
- Prompt Engineering
- Query Expansion

There are a lot of research and engineering in the field and is moving fast. Feel free to explore and go deeper and beyond!