# SEC 10-Q Eval

Evaluating Docugami KG-RAG against OpenAI Assistants Retrieval for this dataset: https://github.com/docugami/KG-RAG-datasets/tree/main/sec-10-q

## Set up Eval

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
!rm -rf temp
!git clone https://github.com/docugami/KG-RAG-datasets.git temp

In [3]:
import os
from pathlib import Path
from datetime import datetime

# Important: Create your OpenAI assistant via https://platform.openai.com/playground
#            and put the assistant ID here. Make sure you upload the identical set of
#            files listed below (these files will be uploaded automatically to Docugami)
OPENAI_ASSISTANT_ID = "asst_qY1M0SeFYlmqkEZsMVZX2VAK"

DOCSET_NAME = "SEC 10Q Filings"
EVAL_NAME = DOCSET_NAME + " " + datetime.now().strftime("%Y-%m-%d")
FILES_DIR = Path(os.getcwd()) / "temp/sec-10-q/data/v1/docs"
FILE_NAMES = [
    "2022 Q3 AAPL.pdf",
    "2022 Q3 AMZN.pdf",
    "2022 Q3 INTC.pdf",
    "2022 Q3 MSFT.pdf",
    "2022 Q3 NVDA.pdf",
    "2023 Q1 AAPL.pdf",
    "2023 Q1 AMZN.pdf",
    "2023 Q1 INTC.pdf",
    "2023 Q1 MSFT.pdf",
    "2023 Q1 NVDA.pdf",
    "2023 Q2 AAPL.pdf",
    "2023 Q2 AMZN.pdf",
    "2023 Q2 INTC.pdf",
    "2023 Q2 MSFT.pdf",
    "2023 Q2 NVDA.pdf",
    "2023 Q3 AAPL.pdf",
    "2023 Q3 AMZN.pdf",
    "2023 Q3 INTC.pdf",
    "2023 Q3 MSFT.pdf",
    "2023 Q3 NVDA.pdf",
]

# Using mini set to save cost while developing, use full set for actual runs (~$300 per run in OpenAI costs per run)
GROUND_TRUTH_CSV = Path(os.getcwd()) / "temp/sec-10-q/data/v1/qna_data_mini.csv"

# We will run each experiment multiple times and average,
# since results vary slightly over runs
PER_EXPERIMENT_RUN_COUNT = 5

# Note: Please specify ~6 (or more!) similar files to process together as a document set
# This is currently a requirement for Docugami to automatically detect motifs
# across the document set to generate a semantic XML Knowledge Graph.
assert len(FILE_NAMES) >= 6, "Please provide at least 6 files"

In [10]:
import pandas as pd
from langsmith import Client

# Read
df = pd.read_csv(GROUND_TRUTH_CSV)

# Dataset
client = Client()
dataset_name = EVAL_NAME
existing_datasets = list(client.list_datasets(dataset_name=dataset_name))
if existing_datasets:
    # read existing dataset
    dataset = client.read_dataset(dataset_name=dataset_name)
else:
    dataset = client.create_dataset(dataset_name=dataset_name)
    # Populate dataset
    for _, row in df.iterrows():
        q = row["Question"]
        a = row["Answer"]
        client.create_example(
            inputs={"question": q}, outputs={"answer": a}, dataset_id=dataset.id
        )

## Set up Docugami KG-RAG

#### Upload files to Docugami

In [None]:
from docugami import Docugami
from docugami.lib.upload import upload_to_named_docset, wait_for_dgml

dg_client = Docugami()
file_paths = [FILES_DIR / file_name for file_name in FILE_NAMES]

# Files will not be re-uploaded if they were previously uploaded (based on name)
dg_docs = upload_to_named_docset(dg_client, file_paths, DOCSET_NAME)

docset_id = ""
docset_name = ""
for doc in dg_docs:
    if not docset_id:
        docset_id = doc.docset.id
    else:
        # all docs must be in the same docset
        assert docset_id == doc.docset.id

    if not docset_name:
        docset_name = dg_client.docsets.retrieve(doc.docset.id).name

In [None]:
# Wait for files to finish processing (OCR, and zero-shot creation of XML knowledge graph)

# Note: This can take some time on the free docugami tier (up to ~20 mins). Please contact us for faster paid plans.
wait_for_dgml(dg_client, dg_docs)

In [None]:
# Run indexing
from docugami_kg_rag.indexing import index_docset

assert docset_id
assert docset_name

# Note: This can take some time since it is embedding and creating summaries for all the docs and chunks
index_docset(docset_id=docset_id, name=docset_name)

#### Create Docugami Agent

In [6]:
from docugami_kg_rag.agent import build_agent_runnable
from langchain_core.messages import HumanMessage

def predict_docugami_agent(input: dict, config: dict = None) -> str:
    docugami_agent = build_agent_runnable()
    question = input["question"]
    return docugami_agent.invoke(
        {
            "messages": [HumanMessage(content=question)],
        }
    )

  from .autonotebook import tqdm as notebook_tqdm


Loading default rankgpt3 model for language en
Loading RankGPTRanker model gpt-3.5-turbo


In [8]:
# Test the agent to make sure it is working
predict_docugami_agent({"question": "How much did Microsoft spend for opex in the latest quarter?"})

Parent run c43a0a12-b825-4ed0-bd80-cad6a4a57abc not found for run f4b7d6cd-06f4-43b1-a6b3-5fc8157708d3. Treating as a root run.


'The information provided does not specify the exact amount Microsoft spent on operating expenses (opex) for the latest quarter ended September 30, 2023, but it mentions that the operating expenses increased by $119 million, marking a 2% increase from the previous period. To find the exact amount spent on opex, one would need to look at the specific figures from the previous period and apply the mentioned increase.'

## Set up OpenAI Assistants Retrieval

### Create OpenAI Agent

Please go to https://platform.openai.com/playground and create your agent. 

In [4]:
from langchain.agents.openai_assistant import OpenAIAssistantRunnable

def predict_openai_agent(input: dict, config: dict = None) -> str:
    openai_agent = OpenAIAssistantRunnable(assistant_id=OPENAI_ASSISTANT_ID, as_agent=True).with_config(config)
    question = input["question"]
    result = openai_agent.invoke({"content": question})

    return result.return_values["output"]

In [5]:
# Test the agent to make sure it is working
predict_openai_agent({"question": "How much did Microsoft spend for opex in the latest quarter?"})

"Microsoft's operating expenses for the latest quarter, which ended on September 30, 2023, increased by $168 million or 1% compared to the previous year.\n\nSOURCE(S): 2023 Q3 MSFT.pdf "

## Run Evals


In [11]:
import uuid
from langsmith.client import Client
from langchain.smith import RunEvalConfig
from langchain.globals import set_llm_cache, get_llm_cache

eval_config = RunEvalConfig(
    evaluators=["qa"],
)


def run_eval(eval_func, eval_run_name):
    """
    Run eval
    """
    client = Client()
    client.run_on_dataset(
        dataset_name=EVAL_NAME,
        llm_or_chain_factory=eval_func,
        evaluation=eval_config,
        verbose=True,
        project_name=eval_run_name,
        concurrency_level=2,  # Reduced to help with rate limits, but will take longer
    )


# Experiments
agent_map = {
    "docugami_kg_rag_zero_shot": predict_docugami_agent,
    "openai_assistant_retrieval": predict_openai_agent,
}

try:
    # Disable global cache setting to get fresh results every time for all experiments
    # since no caching or temperature-0 is supported for the openai assistants API and
    # we want to measure under similar conditions
    cache = get_llm_cache()
    set_llm_cache(None)

    for i in range(PER_EXPERIMENT_RUN_COUNT):
        run_id = str(uuid.uuid4())
        for project_name, agent in agent_map.items():
            run_eval(agent, project_name + "_" + run_id)
finally:
    # Revert cache setting to global default
    set_llm_cache(cache)

View the evaluation results for project 'docugami_kg_rag_zero_shot_1def70be-e363-4459-bf3d-aaf6194e8bd0' at:
https://smith.langchain.com/o/530c4d06-5640-4c0f-94fe-0be7b769531f/datasets/a5db8a49-d0eb-4150-83e4-68bf08ad8ebf/compare?selectedSessions=84e6f6e4-6ef8-46e9-8c24-61b9941fa65c

View all tests for Dataset SEC 10Q Filings 2024-05-03 at:
https://smith.langchain.com/o/530c4d06-5640-4c0f-94fe-0be7b769531f/datasets/a5db8a49-d0eb-4150-83e4-68bf08ad8ebf
[>                                                 ] 0/9

Parent run 7d23199c-b4a2-4aa5-b10d-4981e912bf04 not found for run 1a663ee0-7a83-427c-b4ca-e10e1d234e6a. Treating as a root run.
Parent run ff0b5085-7a78-47f6-940c-ff3b56f17f43 not found for run 4b20bd8c-eb81-4ef0-a120-8e4528f71df2. Treating as a root run.


[---------->                                       ] 2/9

Parent run 188a3636-5427-465e-ab1d-b853199e483d not found for run c9c7073f-fcb2-477f-8b8e-ddc5717608bc. Treating as a root run.
Parent run 670fec85-bd3f-45d0-8d19-813bc9018eda not found for run f23a87b5-8369-4ff1-9538-953d5bb5f44b. Treating as a root run.


[---------------->                                 ] 3/9

Parent run 0f229247-54c7-47e2-bf03-7892064ba501 not found for run 6a175ad3-b0a0-40ac-b8d5-dab469bd9e64. Treating as a root run.


[--------------------->                            ] 4/9

Parent run a729d3af-8ac2-4e6f-bef7-17a5da57688c not found for run 3f2974b3-f1dd-4c9f-9813-d06c59fc7803. Treating as a root run.


[--------------------------->                      ] 5/9

Parent run db8dc473-881f-407d-85c7-bd34fe64dd68 not found for run 476a438d-35ff-439a-a5f7-6b2b5324e279. Treating as a root run.


[-------------------------------->                 ] 6/9

Parent run 73899a93-d0de-45f0-9458-9fa6ef4acaeb not found for run 97164282-e83b-4ed5-88a4-76d128ef54cd. Treating as a root run.


[------------------------------------------->      ] 8/9

Parent run e116b345-5838-4118-b0df-1caeee971c86 not found for run 98168214-859a-4e11-8a87-82d7e6ce0894. Treating as a root run.


[------------------------------------------------->] 9/9

Unnamed: 0,feedback.correctness,error,execution_time,run_id
count,9.0,0.0,9.0,9
unique,,0.0,,9
top,,,,2c8c1f77-99cf-4863-8c89-646ad3ec96aa
freq,,,,1
mean,0.555556,,13.302264,
std,0.527046,,4.30579,
min,0.0,,9.893197,
25%,0.0,,10.643519,
50%,1.0,,11.694967,
75%,1.0,,13.202559,


View the evaluation results for project 'openai_assistant_retrieval_1def70be-e363-4459-bf3d-aaf6194e8bd0' at:
https://smith.langchain.com/o/530c4d06-5640-4c0f-94fe-0be7b769531f/datasets/a5db8a49-d0eb-4150-83e4-68bf08ad8ebf/compare?selectedSessions=49dc2afa-de66-43e4-8a56-605989a9f5e0

View all tests for Dataset SEC 10Q Filings 2024-05-03 at:
https://smith.langchain.com/o/530c4d06-5640-4c0f-94fe-0be7b769531f/datasets/a5db8a49-d0eb-4150-83e4-68bf08ad8ebf
[------------------------------------------------->] 9/9

Unnamed: 0,feedback.correctness,error,execution_time,run_id
count,9.0,0.0,9.0,9
unique,,0.0,,9
top,,,,0eac9a43-0ddf-453b-b2e8-e6c4d6010c33
freq,,,,1
mean,0.888889,,15.88657,
std,0.333333,,5.047527,
min,0.0,,10.196605,
25%,1.0,,11.405077,
50%,1.0,,14.915465,
75%,1.0,,18.377867,


View the evaluation results for project 'docugami_kg_rag_zero_shot_e0d8d8e5-94b8-4fa8-829c-92ad7d3f9f4b' at:
https://smith.langchain.com/o/530c4d06-5640-4c0f-94fe-0be7b769531f/datasets/a5db8a49-d0eb-4150-83e4-68bf08ad8ebf/compare?selectedSessions=e5cd6b66-4fdf-48e5-bae0-5699f73147ae

View all tests for Dataset SEC 10Q Filings 2024-05-03 at:
https://smith.langchain.com/o/530c4d06-5640-4c0f-94fe-0be7b769531f/datasets/a5db8a49-d0eb-4150-83e4-68bf08ad8ebf
[>                                                 ] 0/9

Parent run 9258d0f6-2d6f-4c9b-86da-a34dca82b41a not found for run 15089bce-b23e-41c3-9b37-8ed703db9ed2. Treating as a root run.
Parent run c033e3b4-cb93-4b5a-87f8-706a50d613a0 not found for run 9d4be14c-2c2f-4b7f-b8e7-503245a43a4d. Treating as a root run.


[---------->                                       ] 2/9

Parent run 356cb38e-2410-468d-a616-745d4a14867f not found for run d0358ce8-0165-49f6-83e6-f20eb8588ab6. Treating as a root run.
Parent run c397d7d2-7df5-4065-8ef8-69e94e6aaa9f not found for run d76ecea9-7cfe-42e1-95df-8aa56f19b4a4. Treating as a root run.


[--------------------->                            ] 4/9

Parent run f28d76e6-897f-4fe3-b39d-238882f1ddad not found for run f3f5f6a9-7c07-4aa7-af63-1c7cf8d15360. Treating as a root run.
Parent run 7678a7f7-f140-4736-9590-3f9bf8337e6e not found for run 6a54b6ac-e1d7-49e4-9200-2d5a0e13c369. Treating as a root run.


[-------------------------------->                 ] 6/9

Parent run 6c67638f-b72f-496c-9b94-9db6d524ee36 not found for run 314e3c28-88bc-4501-bf56-d464efeab225. Treating as a root run.
Parent run 41d0e40d-e158-4d25-9bb9-168b1e537224 not found for run bf310d5d-1d69-4626-83d8-42b0930d9231. Treating as a root run.


[------------------------------------------->      ] 8/9

Parent run 58e74066-a088-4972-a576-22197f970abf not found for run b33ff39f-9791-49c4-8539-37db00fbeb73. Treating as a root run.


[------------------------------------------------->] 9/9

Unnamed: 0,feedback.correctness,error,execution_time,run_id
count,9.0,0.0,9.0,9
unique,,0.0,,9
top,,,,c78f8ea0-04de-4848-ba8e-24c152e5a750
freq,,,,1
mean,0.555556,,1.883969,
std,0.527046,,0.22163,
min,0.0,,1.631498,
25%,0.0,,1.769093,
50%,1.0,,1.820031,
75%,1.0,,1.974672,


View the evaluation results for project 'openai_assistant_retrieval_e0d8d8e5-94b8-4fa8-829c-92ad7d3f9f4b' at:
https://smith.langchain.com/o/530c4d06-5640-4c0f-94fe-0be7b769531f/datasets/a5db8a49-d0eb-4150-83e4-68bf08ad8ebf/compare?selectedSessions=d3fe0fbf-4e05-4e2d-ba06-31b4e72154c6

View all tests for Dataset SEC 10Q Filings 2024-05-03 at:
https://smith.langchain.com/o/530c4d06-5640-4c0f-94fe-0be7b769531f/datasets/a5db8a49-d0eb-4150-83e4-68bf08ad8ebf
[--------------------------->                      ] 5/9

KeyboardInterrupt: 