# Create the Ground truth dataset for retrieval evaluation

Ground truth is the dataset that contains all the relevant documents that should be retrieved from each query. Consider this as a label dataset that we know in advance the correct documents we need to retrieve for each query.

- You can create a ground truth in various ways such as Human annotators or User interaction annotators however in our case we will use LLM synthetic data
- We will use an LLM to generate a number of synthetic user questions for each record/document that we want to retrieve

In [59]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from tqdm import tqdm
from openai import OpenAI
from dotenv import load_dotenv
import os
import json

In [4]:
# Load environment variables from .envrc for the Chat GPT
load_dotenv("../.envrc")

True

In [5]:
# Read the prepared data
data = pd.read_csv('../data/investment_data.csv')
# Explore the first rows of data
data.head()

Unnamed: 0,question,answer,context,ticker,filing,company,id
0,What area did NVIDIA initially focus on before...,NVIDIA initially focused on PC graphics.,"Since our original focus on PC graphics, we ha...",NVDA,2023_10K,Nvidia Corporation,4f2ccc3b
1,What are some of the recent applications of GP...,Recent applications of GPU-powered deep learni...,Some of the most recent applications of GPU-po...,NVDA,2023_10K,Nvidia Corporation,ee4ed04f
2,What significant invention did NVIDIA create i...,NVIDIA invented the GPU in 1999.,Our invention of the GPU in 1999 defined moder...,NVDA,2023_10K,Nvidia Corporation,7eac6b57
3,How does NVIDIA's platform strategy contribute...,NVIDIA's platform strategy brings together har...,"NVIDIA has a platform strategy, bringing toget...",NVDA,2023_10K,Nvidia Corporation,eb49bbd0
4,What does NVIDIA's CUDA programming model enable?,NVIDIA's CUDA programming model opened the par...,With our introduction of the CUDA programming ...,NVDA,2023_10K,Nvidia Corporation,3e4c199c


In [28]:
# Create a sample dataset from these records to evaluate the retieval part
data.ticker.unique()

array(['NVDA', 'AAPL', 'TSLA', 'LULU', 'PG', 'COST', 'ABNB', 'MSFT',
       'BRK-A', 'META', 'AXP', 'PTON', 'SBUX', 'NKE', 'PLTR', 'AMZN',
       'NFLX', 'GOOGL', 'ABBV', 'V', 'GME', 'AMC', 'CRM', 'LLY', 'AVGO',
       'UNH', 'JNJ', 'HD', 'WMT', 'AMD', 'CVX', 'BAC', 'KO', 'T', 'AZO',
       'CAT', 'SCHW', 'CMG', 'CB', 'CMCSA', 'CVS', 'DVA', 'DAL', 'DLTR',
       'EBAY', 'EA', 'ENPH', 'EFX', 'ETSY', 'FDX', 'F', 'GRMN', 'GIS',
       'GM', 'GILD', 'GS', 'HAS', 'HSY', 'HPE', 'HLT', 'HPQ', 'HUM',
       'IBM', 'ICE', 'INTU', 'IRM', 'JPM', 'KR', 'LVS'], dtype=object)

In [22]:
# To create a reproducable sample of data to evaluate the retrieval method taking into account 15 records per company
np.random.seed(123)
# Create the sample dataset to generate the ground trouth for
sample = data.groupby('ticker').apply(lambda x: x.sample(15)).reset_index(drop=True)

In [30]:
# Convert the dataframe into a list of dictionaries for each record
records = sample.to_dict(orient='records')
# Examine the first 2 QA records
records[:2]

[{'question': 'During 2023, what percentage of the Company’s net sales came from direct sales channels?',
  'answer': "During 2023, 37% of the Company's net sales came from direct sales channels.",
  'context': 'During 2023, the Company’s net sales through its direct and indirect distribution channels accounted for 37% and 63%, respectively, of total net sales.',
  'ticker': 'AAPL',
  'filing': '2023_10K',
  'company': 'Apple Inc.',
  'id': '23a4214a'},
 {'question': 'What were the main reasons for the year-over-year growth in R&D expense in 2023?',
  'answer': 'Increases in headcount-related expenses.',
  'context': 'Research and Development The year-over-year growth in R&D expense in 2023 was driven primarily by increases in headcount-related expenses.',
  'ticker': 'AAPL',
  'filing': '2023_10K',
  'company': 'Apple Inc.',
  'id': 'f5f5ee2a'}]

In [37]:
# Create the prompt template for the LLM
prompt_template = """
You emulate a user who wants to ask for information about some companies in the stock market.
Formulate 5 questions this user might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

question: {question}
answer: {answer}
context: {context}
company; {company}
ticker: {ticker}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [39]:
# Initialize the openai instance
client = OpenAI()

In [40]:
# Create the function to generate the questions
def generate_questions(record):
    # Create the prompt from a template
    prompt = prompt_template.format(**record)
    # Request from the model
    full_response = client.chat.completions.create(
        model = 'gpt-4o-mini',
        messages = [{"role": "user", "content": prompt}])
    # Parse the response
    response = full_response.choices[0].message.content
    return response

In [45]:
# Initialize the answers
ground_truth_answers = {}

In [46]:
for record in tqdm(records): 
    record_id = record['id']
    # Create a cache so in case tha the rag breaks we don't need to rerun it
    if record_id in ground_truth_answers:
        continue
    # Generate the user question by the LLM
    questions = generate_questions(record)
    # Saving the questions for each record
    ground_truth_answers[record_id] = questions

100%|██████████| 1035/1035 [28:30<00:00,  1.65s/it]


In [61]:
# Parsing the generated results
gt_ids =[]
gt_questions = []
gt_company = []

for ids, user_q in ground_truth_answers.items():
    # Parsing all model generated user queries
    user_queries = json.loads(user_q)
    for u_q in user_queries:
        # Getting all the document ids in a list
        gt_ids.append(ids)
        # Getting the company of the record
        gt_company.append(sample[sample['id'] == ids].company.values[0])
        # Adding the user queries
        gt_questions.append(u_q)
# Shaping the result into a dictionary
gt_results = {'question': gt_questions, 'id': gt_ids, 'company': gt_company}

In [65]:
# Creating the Ground truth dataframe
ground_truth = pd.DataFrame(gt_results)
# Examine the first entries of the ground truth dataset
ground_truth.head(10)

Unnamed: 0,question,id,company
0,What was the contribution of direct sales chan...,23a4214a,Apple Inc.
1,Can you tell me the percentage of total net sa...,23a4214a,Apple Inc.
2,"In 2023, how much of the Company's sales came ...",23a4214a,Apple Inc.
3,"For the Company in 2023, what fraction of net ...",23a4214a,Apple Inc.
4,What portion of the Company's net sales was so...,23a4214a,Apple Inc.
5,What factors contributed to the growth in R&D ...,f5f5ee2a,Apple Inc.
6,Can you explain why there was a year-over-year...,f5f5ee2a,Apple Inc.
7,What drove the rise in R&D spending in 2023?,f5f5ee2a,Apple Inc.
8,Why did headcount-related expenses influence R...,f5f5ee2a,Apple Inc.
9,What aspect of R&D expenditures saw significan...,f5f5ee2a,Apple Inc.


In [66]:
# Saving the data for further use
ground_truth.to_csv('ground_truth.csv', index=False)