# Evaluations on indexing and retrieval
<!-- We want to evalute:
* The performance of different indexing and retrieval strategies, spanning sparse retrieval, classic dense embedding, and advanced retrieval model.

* The influence of precise retrieval on the quality of LLM interpretation -->


## Different Models

We consider the performance of the following 5 indexing and retrieval methods:

1) BM-25, a lightweight sparse retrieval method without complex neural networks, ranking document segments based on the appearing frequency of query terms.

2) all-MiniLM-L6, from SentenceTransformer, a prevalent dense embedding model, mapping sentences to a 384-dimensional dense vector space. 

3) all-mpnet-base, another widely utilized embedding model from SentenceTransformer, noted for its larger architecture and improved performance. 

4) text-embedding-3-large-model, the latest embedding model from OpenAI, with enhanced capability. 

5) ColBERT, an advanced retrieval model, relying on token-level embedding and fine-grained contextual late interaction.

In [1]:
# Preparation
import sys
import os
from pathlib import Path

# Get the project root directory
root_dir = Path(os.path.abspath("")).resolve().parents[1]
sys.path.append(str(root_dir))
# Change the working directory to the project root
os.chdir(root_dir)


In [2]:
# Set up the configs for this demo
DEMO_SIZE = 2
res_dir = f"experiment/retrieval/res/"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)

Run the retrieval experiments, utilizing the functional implementation provided within the `uda.utils` module.

In [3]:
from uda.utils import retrieve as rt
from uda.utils import retrieve_exp as rt_exp
from uda.utils import preprocess as pre
import json

DATASET_NAME_LIST = ["fin", "paper_tab", "paper_text", "nq", "feta"]
RT_MODEL_LIST = ["bm25", "all-MiniLM-L6-v2", "all-mpnet-base-v2", "openai", "colbert"]
# The procedure of complex models may be time-consuming, you can choose to run a sub-list of models and datasets
DATASET_NAME_LIST = DATASET_NAME_LIST[:]
RT_MODEL_LIST = RT_MODEL_LIST[:1]


for DATASET_NAME in DATASET_NAME_LIST:
    for RT_MODEL in RT_MODEL_LIST:
        print(f"=== Start {DATASET_NAME} on {RT_MODEL} ===")
        res_file = os.path.join(res_dir, f"{DATASET_NAME}_{RT_MODEL}.jsonl")
        bench_json_file = pre.meta_data[DATASET_NAME]["bench_json_file"]
        with open(bench_json_file, "r") as f:
            bench_data = json.load(f)
        doc_list = list(bench_data.keys())
        for doc in doc_list[:1]:
            pdf_path = pre.get_example_pdf_path(DATASET_NAME, doc)
            if pdf_path is None:
                continue
            # Prepare the index for the document
            collection_name = f"{DATASET_NAME}_vector_db"
            collection = rt.prepare_collection(pdf_path, collection_name, RT_MODEL)
            for qa_item in bench_data[doc]:
                question = qa_item["question"]
                q_uid = qa_item["q_uid"]
                # Retrieve the contexts
                contexts = rt.get_contexts(collection, question, RT_MODEL)
                # Save the results
                rt_exp.log_score(
                    contexts, doc, q_uid, DATASET_NAME, res_file, bench_json_file
                )
            rt.reset_collection(collection_name, RT_MODEL)
    print(f"=== Finish {DATASET_NAME} ===\n")


=== Start fin on bm25 ===
Retrieval-Match-Scores {'doc_name': 'GS_2016', 'q_uid': 'GS/2016/page_79.pdf-3', 'Top-1': 0.7555555555555555, 'Top-5': 0.9333333333333333, 'Top-10': 0.9333333333333333, 'Top-20': 0.9333333333333333, 'Top-30': 0.9333333333333333}
Retrieval-Match-Scores {'doc_name': 'GS_2016', 'q_uid': 'GS/2016/page_79.pdf-1', 'Top-1': 0.7555555555555555, 'Top-5': 0.9333333333333333, 'Top-10': 0.9333333333333333, 'Top-20': 0.9333333333333333, 'Top-30': 0.9333333333333333}
Retrieval-Match-Scores {'doc_name': 'GS_2016', 'q_uid': 'GS/2016/page_161.pdf-1', 'Top-1': 0.631578947368421, 'Top-5': 0.9836065573770492, 'Top-10': 0.9836065573770492, 'Top-20': 0.9836065573770492, 'Top-30': 0.9836065573770492}
Retrieval-Match-Scores {'doc_name': 'GS_2016', 'q_uid': 'GS/2016/page_183.pdf-3', 'Top-1': 0.2857142857142857, 'Top-5': 1.0, 'Top-10': 1.0, 'Top-20': 1.0, 'Top-30': 1.0}
Retrieval-Match-Scores {'doc_name': 'GS_2016', 'q_uid': 'GS/2016/page_186.pdf-2', 'Top-1': 0.9411764705882353, 'Top-5

Get the averaged retrieval matching scores

In [4]:
import json
import pandas as pd

def get_avg_score(file_path):
    with open(file_path,"r") as f:
        lines=f.readlines()
        data_list=[json.loads(l) for l in lines]
    df=pd.DataFrame(data_list)
    avg_1_score=df["Top-1"].mean()
    avg_5_score=df["Top-5"].mean()
    avg_10_score=df["Top-10"].mean()
    avg_20_score=df["Top-20"].mean()
    res_df=pd.DataFrame({"avg_1_score":[avg_1_score],"avg_5_score":[avg_5_score],"avg_10_score":[avg_10_score],"avg_20_score":[avg_20_score]})
    return res_df

# rt_models=["bm25","all-MiniLM-L6-v2","all-mpnet-base-v2","openai","colbert"]
dataset_name="fin"
rt_model="bm25"
# relative path based on the project root
res_file_name=f"experiment/retrieval/res/{dataset_name}_{rt_model}.jsonl" 
res_df=get_avg_score(res_file_name)
print(f"===== {rt_model} on {dataset_name} =====")
print(res_df)


===== bm25 on fin =====
   avg_1_score  avg_5_score  avg_10_score  avg_20_score
0     0.672708     0.918458      0.918458      0.918458


## Influence of precise retrieval on the quality of LLM generation

We use human-annotated evidence to evaluate how precise retrieval can affect LLM generation.

In [5]:
import PyPDF2
from uda.utils import retrieve as rt
from uda.utils.parsing_exp import paper_raw_extract_table
import numpy as np

#  The contexts are not retrieved using embedding models
#  They are directly extracted from the human-annotated evidence
#  The following codes extract the direct evidence for various datasets
 
def direct_fin(pdf_path, qa_item):
    # fetch the target page
    q_uid = qa_item["q_uid"]
    target_page = int(q_uid.split("page_")[-1].split(".pdf")[0]) - 1
    with open(pdf_path, "rb") as file:
        # read the pdf file of target page
        reader = PyPDF2.PdfReader(file)
        pdf_page = reader.pages[target_page]
        text = pdf_page.extract_text()
        return text


def direct_tat(qa_item):
    # fetch the target page(s)
    page_uid = qa_item["doc_page_uid"]
    page_path = f"dataset/extended_qa_info/tat_doc_page_example/{page_uid}.pdf"
    if os.path.exists(page_path) is False:
        raise Exception(f"TAT Page not found: {page_path}")
    page_text = rt.extract_text_from_pdf(page_path)
    return page_text


def direct_paper_text(qa_item):
    # fetch the direct evidence
    evidence = qa_item["evidence"]
    contexts = []
    for ev in evidence:
        hi_evs = ev["highlighted_evidence"]
        for hi_ev in hi_evs:
            if hi_ev not in contexts:
                contexts.append(hi_ev)
    return "\n".join(contexts)


def direct_paper_tab(pdf_path, qa_item):
    # fetch the direct evidence
    evidence = qa_item["evidence"]
    contexts = []
    for ev in evidence:
        hi_evs = ev["highlighted_evidence"]
        for hi_ev in hi_evs:
            if ev.startswith("FLOAT SELECTED: Table"):
            # extract the evidential table chunk
                table_name = ev.split("FLOAT SELECTED: ")[-1]
                table_context= paper_raw_extract_table(pdf_path, table_name)
                if table_context not in contexts:
                    contexts.append(table_context)
            elif hi_ev not in contexts:
            # directly use the text evidence
                contexts.append(hi_ev)
    return "\n".join(contexts)

def direct_feta(pdf_path, qa_item):
    # fetch the chunk that similar to the well-parsed table
    table = qa_item["evidence"]["table_array"]
    well_parsed_table_text = "\n".join([" ".join(row) for row in table])
    text = rt.extract_text_from_pdf(pdf_path)
    chunks = rt.split_text(text,chunk_size=3000,overlap_size=500)
    chunk_scores = []
    for chunk in chunks:
        score, _ = rt.char_lcs(well_parsed_table_text, chunk)
        chunk_scores.append(score)
    target_chunk_idx = int(np.argmax(chunk_scores))
    return chunks[target_chunk_idx]


def direct_nq(pdf_path, qa_item):
    # fetch the chunks that similar to the paragraph-level long answer
    long_answer = qa_item["answers"]["long_answer"]
    chunks = rt.split_text(rt.extract_text_from_pdf(pdf_path))
    scores = []
    for chunk in chunks:
        score, _ = rt.word_lcs(long_answer, chunk)
        scores.append(score)
    target_chunk_idx = int(np.argmax(scores))
    return chunks[target_chunk_idx]

def get_direct_context(dataset_name, pdf_path, qa_item):
    if dataset_name == "fin":
        return direct_fin(pdf_path, qa_item)
    elif dataset_name == "tat":
        return direct_tat(qa_item)
    elif dataset_name == "paper_text":
        return direct_paper_text(qa_item)
    elif dataset_name == "paper_tab":
        return direct_paper_tab(pdf_path, qa_item)
    elif dataset_name == "nq":
        return direct_nq(pdf_path, qa_item)
    elif dataset_name == "feta":
        return direct_feta(pdf_path, qa_item)
    else:
        raise Exception(f"Dataset name not found: {dataset_name}")

In [6]:
# Experimental Configurations

res_dir = f"experiment/retrieval/res/"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)

DATASET_NAME_LIST = ["fin", "tat", "feta",  "paper_text", "nq", "paper_tab"]
LOCAL_LLM_DICT = {"llama-8B": "meta-llama/Meta-Llama-3-8B-Instruct"}
LLM_LIST = ["gpt4", "llama-8B"]

# Sample a subset for faster demo
DATASET_NAME_LIST = DATASET_NAME_LIST[:2]
LLM_LIST = LLM_LIST[:1]
DEMO_DOC_NUM = 2

Run the RAG procedure with human-annotated evidence

In [8]:
from uda.utils import retrieve as rt
from uda.utils import preprocess as pre
import pandas as pd
from uda.utils import llm
from uda.utils import inference
import json

for DATASET_NAME in DATASET_NAME_LIST:
    for LLM_MODEL in LLM_LIST:
        print(f"=== Start {DATASET_NAME} on {LLM_MODEL} ===")
        res_file = os.path.join(res_dir, f"{DATASET_NAME}_{LLM_MODEL}_direct.jsonl")

        # If use the local LLM, initialize the model
        if LLM_MODEL in LOCAL_LLM_DICT:
            llm_name = LOCAL_LLM_DICT[LLM_MODEL]
            llm_service = inference.LLM(llm_name)
            llm_service.init_llm()

        # Load the benchmark data
        bench_json_file = pre.meta_data[DATASET_NAME]["bench_json_file"]
        with open(bench_json_file, "r") as f:
            bench_data = json.load(f)

        # Run experiments on the demo docs
        doc_list = list(bench_data.keys())
        for doc in doc_list[:DEMO_DOC_NUM]:
            pdf_path = pre.get_example_pdf_path(DATASET_NAME, doc)
            if pdf_path is None:
                continue
            for qa_item in bench_data[doc]:
                question = qa_item["question"]
                # Directly retrieve the human-annotated evidence 
                context_text = get_direct_context(DATASET_NAME, pdf_path, qa_item)
                # Create the prompt
                llm_message = llm.make_prompt(question, context_text, DATASET_NAME, LLM_MODEL)
                # Generate the answer
                if LLM_MODEL in LOCAL_LLM_DICT:
                    response = llm_service.infer(llm_message)
                elif LLM_MODEL == "gpt4":
                    # Set up with your own GPT4 service using environment variables
                    response = llm.call_gpt(messages=llm_message)
                    if response is None:
                        print("Make sure your gpt4 service is set up correctly.")
                        raise Exception("GPT4 service")

                # log the results
                res_dict = {"model": LLM_MODEL, "question": question, "response": response, "doc": doc, "q_uid": qa_item["q_uid"], "answers": qa_item["answers"]}
                print(res_dict)
                with open(res_file, "a") as f:
                    f.write(json.dumps(res_dict) + "\n")

    print(f"=== Finish {DATASET_NAME} ===\n")


=== Start fin on gpt4 ===


2024-07-05 10:46:44,360 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'what percentage of total long-term assets under supervision are comprised of fixed income in 2015?', 'response': 'The answer is: 57.5%', 'doc': 'GS_2016', 'q_uid': 'GS/2016/page_79.pdf-3', 'answers': {'str_answer': '57%', 'exe_answer': 0.57484}}


2024-07-05 10:46:48,941 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'what percentage of total long-term assets under supervision are comprised of fixed income in 2016?', 'response': 'The total long-term assets under supervision in 2016 are $983 billion, and the fixed income assets under supervision are $578 billion.\n\nTo calculate the percentage of fixed income assets in the total long-term assets under supervision for 2016:\n\n\\( \\frac{578}{983} \\times 100 = 58.8\\% \\)\n\nThe answer is: 58.8%', 'doc': 'GS_2016', 'q_uid': 'GS/2016/page_79.pdf-1', 'answers': {'str_answer': '59%', 'exe_answer': 0.588}}


2024-07-05 10:46:53,961 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'what percentage of total loans receivable gross in 2016 were loans backed by commercial real estate?', 'response': 'The total loans receivable, gross in 2016 were $50,181 million, and the loans backed by commercial real estate were $4,761 million.\n\nTo calculate the percentage:\n\n(Loans backed by commercial real estate / Total loans receivable, gross) * 100\n= ($4,761 million / $50,181 million) * 100\n= 9.48%\n\nThe answer is: 9.48%', 'doc': 'GS_2016', 'q_uid': 'GS/2016/page_161.pdf-1', 'answers': {'str_answer': '9%', 'exe_answer': 0.09488}}


2024-07-05 10:47:06,577 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'what percentage of future minimum rental payments are due in 2018?', 'response': 'The total future minimum rental payments as of December 2016 are $1,941 million. The payments due in 2018 are $282 million.\n\nTo calculate the percentage due in 2018, we use the formula:\n\n(Annual payments due / Total future payments) * 100\n\nSo for 2018, it would be:\n\n($282 million / $1,941 million) * 100 = 14.53%\n\nThe answer is: 14.53%', 'doc': 'GS_2016', 'q_uid': 'GS/2016/page_183.pdf-3', 'answers': {'str_answer': '15%', 'exe_answer': 0.14529}}


2024-07-05 10:47:11,377 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'in millions , for 2016 , 2015 , and 2014 what was the total amount of common share repurchases?', 'response': 'The answer is: 2016 - 36.6 million, 2015 - 22.1 million, 2014 - 31.8 million', 'doc': 'GS_2016', 'q_uid': 'GS/2016/page_186.pdf-2', 'answers': {'str_answer': '90.1', 'exe_answer': 90.5}}


2024-07-05 10:47:12,859 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'of the total aus net inflows/ ( outflows ) for 2014 were fixed income asset inflows in connection with our acquisition of deutsche asset & wealth management 2019s stable value business greater than the liquidity products inflows in connection with our acquisition of rbs asset management 2019s money market funds?', 'response': 'The answer is: Yes', 'doc': 'GS_2016', 'q_uid': 'GS/2016/page_79.pdf-4', 'answers': {'str_answer': 'yes', 'exe_answer': 'yes'}}


2024-07-05 10:47:19,087 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'what was the percentage change in the 5 year annual performance of the peer group stock from 2010 to 2011', 'response': 'The percentage change in the 5 year annual performance of the peer group stock from 2010 to 2011 is calculated as follows:\n\n((Value in 2011 - Value in 2010) / Value in 2010) * 100\n\nUsing the values from the report:\n\n((136.78 - 100.00) / 100.00) * 100 = 36.78%\n\nThe answer is: 36.78%', 'doc': 'JKHY_2015', 'q_uid': 'JKHY/2015/page_20.pdf-2', 'answers': {'str_answer': '8.3%', 'exe_answer': 0.08276}}


2024-07-05 10:47:20,771 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'in 2010 , what was the cumulative total return of the s&p 500?', 'response': 'The answer is: 100.00', 'doc': 'JKHY_2015', 'q_uid': 'JKHY/2015/page_20.pdf-3', 'answers': {'str_answer': '30.69', 'exe_answer': 30.69}}


2024-07-05 10:47:22,415 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'for the 2010 , what was the cumulative total return on jkhy?', 'response': 'The answer is: 100.00', 'doc': 'JKHY_2015', 'q_uid': 'JKHY/2015/page_20.pdf-1', 'answers': {'str_answer': '27.44', 'exe_answer': 27.44}}
=== Finish fin ===

=== Start tat on gpt4 ===


2024-07-05 10:47:24,149 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': "What is the difference between  Richard E. Belluzzo's total compensation as compared to Laura Black?", 'response': 'The answer is: 92,500.00', 'doc': 'viavi-solutions-inc_2019', 'q_uid': 'f1b63dd3002720ebbffa8f36d852d351', 'answers': {'answer': 92500, 'answer_type': 'arithmetic', 'scale': ''}}


2024-07-05 10:47:28,347 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'How much did Donald Colvin earn in cash?', 'response': 'The answer is: 97,500', 'doc': 'viavi-solutions-inc_2019', 'q_uid': 'd1a4ba4e1af05c60efa468f953a4f8ca', 'answers': {'answer': ['97,500'], 'answer_type': 'span', 'scale': ''}}


2024-07-05 10:47:33,083 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'What is the percentage difference of the total compensation between Timothy Campos and Tor Braham?', 'response': "The total compensation for Timothy Campos is $255,817 and for Tor Braham is $245,817. To calculate the percentage difference:\n\nDifference = $255,817 - $245,817 = $10,000\n\nPercentage difference = (Difference / Tor Braham's total compensation) * 100\nPercentage difference = ($10,000 / $245,817) * 100 ≈ 4.07%\n\nThe answer is: 4.07%", 'doc': 'viavi-solutions-inc_2019', 'q_uid': '5cff0439063c6a496073ce4b55591d0b', 'answers': {'answer': 4.07, 'answer_type': 'arithmetic', 'scale': 'percent'}}


2024-07-05 10:47:35,502 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'What does cash and cash equivalents consist of?', 'response': 'The answer is: cash, checking accounts, money market accounts, and temporary investments with maturities of three months or less when purchased.', 'doc': 'inpixon_2019', 'q_uid': 'c97f3158996e35cf911eec06d020badc', 'answers': {'answer': ['cash, checking accounts, money market accounts and temporary investments with maturities of three months or less when purchased.'], 'answer_type': 'span', 'scale': ''}}


2024-07-05 10:47:43,382 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'What was the average Professional and legal fees for 2018 and 2019?', 'response': 'The total Professional and legal fees for 2018 and 2019 are $545,000 and $242,000 respectively. To find the average, we add these two amounts together and divide by 2.\n\nAverage Professional and legal fees = ($545,000 + $242,000) / 2 = $787,000 / 2 = $393,500\n\nThe answer is: $393,500', 'doc': 'inpixon_2019', 'q_uid': 'd7d2b6427fb0f566a3ac2e90d842dd06', 'answers': {'answer': 393.5, 'answer_type': 'arithmetic', 'scale': 'thousand'}}


2024-07-05 10:47:48,447 - INFO - HTTP Request: POST https://qinchuan-hui.openai.azure.com/openai/deployments/yulong-4/chat/completions?api-version=2024-04-01-preview "HTTP/1.1 200 OK"


{'model': 'gpt4', 'question': 'How does the company account for options granted to employees?', 'response': 'The company accounts for options granted to employees by measuring the cost of services received in exchange for the award of equity instruments based upon the fair value of the award on the date of grant. The fair value of that award is then ratably recognized as an expense over the period during which the recipient is required to provide services in exchange for that award. Forfeitures of unvested stock options are recorded when they occur.\n\nThe answer is: By measuring the cost of services received in exchange for the award based on the fair value of the award on the grant date and recognizing it over the service period.', 'doc': 'inpixon_2019', 'q_uid': 'fde5598a4f22f1775d27e488f367cdb7', 'answers': {'answer': ['by measuring the cost of services received in exchange for the award of equity instruments based upon the fair value of the award on the date of grant.'], 'answer_t

### Evaluate the accuracy of the generation results

We show the generating performance with the human-annotated evidential contexts in this demo. 

To evaluate the performance with model-retrieved contexts, please refer to the basic end-to-end demo in [e2e.ipynb](../e2e/e2e.ipynb) 

In [11]:
dataset_name="fin"
llm_model="gpt4"
res_file_name=f"experiment/retrieval/res/{dataset_name}_{llm_model}_direct.jsonl"

from uda.eval.my_eval import eval_from_file
eval_from_file(dataset_name, res_file_name)

Exact-match accuracy: 66.67
