# End-to-End Evaluations

We construct an end-to-end RAG pipeline: 
* Parse unstructured data from PDFs leveraging the raw-text extraction approach
* Employ OpenAI’s text-embedding-3-large model to index and retrieve relevant data chunks. 
* In the generation phase, we incorporate the Chain-of-Thought approach to handle arithmetic-intensive tasks.

 Based on this end-to-end pipeline, we evaluate 8 LLMs spanning various model sizes and architectures.

In [1]:
# Preparation
import sys
import os
from pathlib import Path

# Get the project root directory
root_dir = Path(os.path.abspath("")).resolve().parents[1]
sys.path.append(str(root_dir))
# Change the working directory to the project root
os.chdir(root_dir)

# Set up the configs for this demo
DEMO_SIZE = 2
res_dir = f"experiment/e2e/res/"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)

import warnings
warnings.filterwarnings('ignore')

In our paper, we use the powerful `text-embedding-3-large-model` model with AzureOpenAI-API . But you need to set up with your own api-key, endpoint and deploy-model in [uda/utils/retrieve.py (line-80)](../../uda/utils/retrieve.py#L81). 

For convenience, we choose the `BM25` retriever here, while you can also select other strategies.

In [2]:
# Experimental Configurations

# Available retrieval model_name: "bm25", "all-MiniLM-L6-v2", "all-mpnet-base-v2", "openai", "colbert"
# We choose bm25 for convenience
RT_MODEL = "bm25" 


DATASET_NAME_LIST = ["paper_tab", "paper_text", "nq", "feta","tat","fin"]

LOCAL_LLM_DICT = {
    "llama-8B": "meta-llama/Meta-Llama-3-8B-Instruct",
    "qwen-7B": "Qwen/Qwen1.5-7B-Chat",
    "mistral": "mistralai/Mistral-7B-Instruct-v0.2",
    "qwen-32B": "Qwen/Qwen1.5-32B-Chat",
    "mixtral": "mistralai/Mixtral-8x7B-Instruct-v0.1",
    "llama-70B": "meta-llama/Meta-Llama-3-70B-Instruct",
}
LLM_LIST = ["gpt4", "llama-8B", "mistral", "qwen-32B", "mixtral", "llama-70B", "qwen-7B"]
# You can sample a subset of LLMs for faster demo
LLM_LIST = LLM_LIST[:1]

In our implementation, the AzureOpenAI-API serves as the interface for accessing GPT models. Users should set up the service with their own api-key and endpoint within the `call_gpt()` function, contained in the following codes.

If you want to use alternative platforms, the `call_gpt()` can be replaced by the target model-calling function.

In [None]:
from uda.utils import retrieve as rt
from uda.utils import retrieve_exp as rt_exp
from uda.utils import preprocess as pre
import pandas as pd
from uda.utils import llm
from uda.utils import inference
import json

for DATASET_NAME in DATASET_NAME_LIST:
    for LLM_MODEL in LLM_LIST:
        print(f"=== Start {DATASET_NAME} on {LLM_MODEL} ===")
        res_file = os.path.join(res_dir, f"{DATASET_NAME}_{LLM_MODEL}_{RT_MODEL}.jsonl")

        # If use the local LLM, initialize the model
        if LLM_MODEL in LOCAL_LLM_DICT:
            llm_name = LOCAL_LLM_DICT[LLM_MODEL]
            llm_service = inference.LLM(llm_name)
            llm_service.init_llm()

        # Load the benchmark data
        bench_json_file = pre.meta_data[DATASET_NAME]["bench_json_file"]
        with open(bench_json_file, "r") as f:
            bench_data = json.load(f)

        # Run experiments on the demo docs
        doc_list = list(bench_data.keys())
        for doc in doc_list[:DEMO_SIZE]:
            pdf_path = pre.get_example_pdf_path(DATASET_NAME, doc)
            if pdf_path is None:
                continue
            for qa_item in bench_data[doc]:
                question = qa_item["question"]
                q_uid = qa_item["q_uid"]
                collection_name = f"{DATASET_NAME}_vector_db"
                # Prepare the index
                collection = rt.prepare_collection(pdf_path, collection_name, RT_MODEL)
                # Retrieve the contexts
                contexts = rt.get_contexts(collection, question, RT_MODEL)
                context_text = '\n'.join(contexts)
                # Create the prompt
                llm_message = llm.make_prompt(question, context_text, DATASET_NAME, LLM_MODEL)
                # Generate the answer
                if LLM_MODEL in LOCAL_LLM_DICT:
                    response = llm_service.infer(llm_message)
                elif LLM_MODEL == "gpt4":
                    # Set up with your own GPT4 service
                    response = llm.call_gpt(
                        messages=llm_message,
                        api_key="abcd",
                        endpoint="https://abcd",
                        deployment_name="abcd",
                    )
                    if response is None:
                        print("Make sure your gpt4 service is set up correctly.")
                        raise Exception("GPT4 service")

                # log the results
                res_dict = {"model": LLM_MODEL, "question": question, "response": response, "doc": doc, "q_uid": q_uid, "answers": qa_item["answers"]}
                print(res_dict)
                with open(res_file, "a") as f:
                    f.write(json.dumps(res_dict) + "\n")

    print(f"=== Finish {DATASET_NAME} ===\n")


### Evaluate the end-to-end results

In [3]:
dataset_name="fin"
llm_model="gpt4"
rt_model="bm25"
res_file_name=f"experiment/e2e/res/{dataset_name}_{llm_model}_{rt_model}.jsonl"

from uda.eval.my_eval import eval_from_file
eval_from_file(dataset_name, res_file_name)

Exact-match accuracy: 66.67
