Step 1: For each question's chunks (list of articles), generate 3 questions using Ragas.
Step 2: Test distibrution (distributions={simple: 0.3, reasoning: 0.3, multi_context: 0.4})

### Installation

In [9]:
# !pip install -q ragas
# !pip install token-count
import pandas as pd
from tqdm import tqdm
import openai
import numpy as np
from datasets import Dataset, load_dataset
from langchain.document_loaders import DirectoryLoader
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_entity_recall
)
from ragas import evaluate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import os
os.environ['OPENAI_API_KEY'] = 'OPENAPI_KEY'

### Documents

In [2]:
### Using QA Dataset
qaDataFilename = 'qa_data_5'
qaData = pd.read_csv(f'data_backup/{qaDataFilename}.csv') # QA dataset CSV with article chunks
qaDataforRAGAS = qaData.iloc[10:]['chunk_text'].values # Take the last 10 questions ONLY

for i, articles in enumerate(qaDataforRAGAS): # iterate through questions
    articles = eval(articles)
    with open(f'datasource/{qaDataFilename}_ques_{11+i}.txt', 'w') as f:
        f.write('\n'.join(articles))
    

In [3]:
# Create Langchain Document format for each test set
all_documents = []
for i in range(11, 21): # for questions 11 - 20
    loader = DirectoryLoader(f'datasource/', glob=f'{qaDataFilename}_ques_{i}.txt') # load all articles relating to that question
    documents = loader.load()
    for doc in documents:
        doc.metadata['filename'] = doc.metadata['source']
    all_documents.append(documents)

### Test-set Generation

In [4]:
# generator with openai models
generator_llm = ChatOpenAI(model="gpt-4-turbo-preview")
critic_llm = ChatOpenAI(model="gpt-4-turbo-preview")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

In [6]:
# Test set generation
# Iterate for all 10 questions
for i, documents in enumerate(all_documents):
    # generate testset
    testset = generator.generate_with_langchain_docs(documents, test_size=3, distributions={simple: 0.3, reasoning: 0.3, multi_context: 0.4})
    testdf = testset.to_pandas()
    testdf.to_csv(f'testsets/testset_{qaDataFilename}_question_{11+i}.csv', index=False)

Generating: 100%|██████████| 3/3 [03:37<00:00, 72.52s/it]       
Generating: 100%|██████████| 3/3 [00:53<00:00, 17.87s/it]      
Generating: 100%|██████████| 3/3 [02:48<00:00, 56.05s/it]       
Generating: 100%|██████████| 3/3 [02:20<00:00, 46.76s/it]      


### Making the final Evaluation Dataset Structure

1. Question (string)
2. Contexts (list of strings)
3. Ground Truth (string)
4. Ground Truth Chunk IDs (list)
5. Retrieved Chunk IDs (list) 
6. Generated Answer (string)

In [11]:
ragas_evaluation = pd.DataFrame(
    columns = ['question', 'contexts', 'ground_truth', 'ground_truth_chunk_ids', 'answer', 'answer_chunk_ids']
)

for doc_no in range(0, 6): # Each of the 6 qa_data_documents
    qaDataFilename = f'qa_data_{doc_no}'
    qaDataFilePath = 'data_backup/' + qaDataFilename + '.csv'
    qaData = pd.read_csv(qaDataFilePath)
    for row_no in range(11, 21): # Each of the 10 last rows in each document
        testsetFilePath = f'testsets/testset_{qaDataFilename}_question_{row_no}.csv'
        testset = pd.read_csv(testsetFilePath)
        for index, row in testset.iterrows(): # Each of the Q/A pairs in each testset (max 3)
            data = {
                'question': row['question'],
                'contexts': row['contexts'],
                'ground_truth': row['ground_truth'],
                'ground_truth_chunk_ids': eval(qaData.loc[(row_no-1),'chunk_id']),
                'answer': None,
                'answer_chunk_ids': None
            }
            ragas_evaluation.loc[len(ragas_evaluation.index)] = data

ragas_evaluation.dropna(subset=['ground_truth'], inplace=True)
ragas_evaluation.reset_index(inplace = True)
ragas_evaluation.drop(columns=['index'], inplace=True)

In [12]:
ragas_evaluation.head()

Unnamed: 0,question,contexts,ground_truth,ground_truth_chunk_ids,answer,answer_chunk_ids
0,"Which semiconductor stock, NVIDIA or Taiwan Se...",[' product launch cadence to meet a growing an...,"NVIDIA has performed better this year, gaining...","[25c2084b-d2c7-2c1d-2bdf-5ac4b97e8d7e, 3e619c5...",,
1,What was the percentage increase in TSMC's Oct...,"['Taiwan Semiconductor Manufacturing (TSM), be...",34.8%,"[25c2084b-d2c7-2c1d-2bdf-5ac4b97e8d7e, 3e619c5...",,
2,What are the risks of a complex supply chain w...,['Item 7. Management\'s Discussion and Analysi...,The risks of a complex supply chain with facto...,"[25c2084b-d2c7-2c1d-2bdf-5ac4b97e8d7e, 3e619c5...",,
3,What factors have contributed to Nvidia's posi...,[' programming code.\n\nFor those looking for ...,Nvidia's position as a leader in the AI chip m...,"[911edbf3-396c-c1f4-e97f-18c212dee4c4, 960ff40...",,
4,What strategies does NVIDIA employ to support ...,['G informed us the licensing requirements wer...,NVIDIA employs several strategies to support t...,"[3e619c5b-8801-886f-1153-21429e404e1b, 77899e5...",,


In [42]:
ragas_evaluation.to_csv('testsets/ragas_evaluation.csv', index=False)

#### Combine Ragas and Manual Evaluation Files

In [21]:
evaluation_dataset = ragas_evaluation.copy()
evaluation_dataset['type'] = 'ragas'
manual_datasets = ['aishwarya', 'krisha', 'leo', 'tanya', 'vinay']
for name in manual_datasets:
    fileName = f'final_evaluation_datasets/{name}_manual_questions.csv'
    dataset = pd.read_csv(fileName)
    dataset['type'] = 'manual'
    evaluation_dataset = pd.concat([evaluation_dataset, dataset])
    evaluation_dataset.reset_index(inplace = True)
    evaluation_dataset.drop(columns=['index'], inplace=True)
evaluation_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   question                180 non-null    object
 1   contexts                180 non-null    object
 2   ground_truth            180 non-null    object
 3   ground_truth_chunk_ids  180 non-null    object
 4   answer                  0 non-null      object
 5   answer_chunk_ids        0 non-null      object
 6   type                    180 non-null    object
dtypes: object(7)
memory usage: 10.0+ KB


In [22]:
evaluation_dataset.to_csv('final_evaluation_datasets/evaluation_dataset.csv', index=False)

### Running Evaluations

#### Baseline 1: GPT3.5 with Own Knowledge

In [31]:
def promptGPT(prompt):
    response = openai.chat.completions.create(
        model = 'gpt-3.5-turbo-16k',
        messages=[
            {
                'role': 'user',
                'content': prompt,
            },
        ],
    )
    return response.choices[0].message.content

In [7]:
def gptBaselineAnswerGeneration(evaluation_dataset):
    eval_dataset = evaluation_dataset.copy()
    dataset_length = len(eval_dataset.index)

    ### Answer Generation
    print('Running Answer Generation module...')
    for i in tqdm(range(dataset_length), desc='Generating Answers'):
        question = eval_dataset.loc[i, 'question']
        prompt = f'Hey! Using your own knowledge, please answer this question:\n{question}\nPlease provide a concise answer in paragraph format (no bullet points) to the question and return ONLY the answer, nothing else. If you do not know the answer to a question, please return an EMPTY string. Please do not provide additional start and end comments, reply with only the answer.'
        answer = promptGPT(prompt)
        eval_dataset.loc[i, 'answer'] = answer
    print('Answer Generation Done!')
    return eval_dataset

def BaselineEvaluation(evaluation_dataset):
    ### Evaluation
    print('Running Evaluation...')
    # Convert to Ragas format
    evaluation_dataset = evaluation_dataset[['question', 'answer', 'contexts', 'ground_truth']] 
    evaluation_dataset['contexts'] = evaluation_dataset['contexts'].apply(lambda x: np.array(eval(x)))
    evaluation_dataset = Dataset.from_pandas(evaluation_dataset)
    result = evaluate(
        evaluation_dataset,
        metrics=[
            context_precision,
            faithfulness,
            answer_relevancy,
            context_recall,
            context_entity_recall,
        ],
    )
    print('Evaluation done!')
    
    print(f'Here are the results:\n{result}')
    return result.to_pandas()

In [67]:
# Answer Generation
gptBaselineDataset = gptBaselineAnswerGeneration(evaluation_dataset)
# Evaluation (split into 4 chunks due to rate limit)
tmp1 = gptBaselineDataset.iloc[0:50, :]
tmp2 = gptBaselineDataset.iloc[50:100, :]
tmp3 = gptBaselineDataset.iloc[100:150, :]
tmp4 = gptBaselineDataset.iloc[150:, :]
gptBaselineResult1 = BaselineEvaluation(tmp1) 
gptBaselineResult2 = BaselineEvaluation(tmp2) 
gptBaselineResult3 = BaselineEvaluation(tmp3) 
gptBaselineResult4 = BaselineEvaluation(tmp4) 
gptBaselineResult = pd.concat([gptBaselineResult1, gptBaselineResult2, gptBaselineResult3, gptBaselineResult4])
gptBaselineResult.reset_index(inplace=True)
gptBaselineResult.drop(columns=['index'], inplace=True)
gptBaselineResult.to_csv('final_evaluation_datasets/gpt_baseline_results.csv', index = False)

In [91]:
# Results for whole dataset
description = gptBaselineResult.describe()
total_results = {
    'context_precision': description.loc['mean', 'context_precision'],
    'faithfulness': description.loc['mean', 'faithfulness'],
    'answer_relevancy': description.loc['mean', 'answer_relevancy'],
    'context_recall': description.loc['mean', 'context_recall'],
    'context_entity_recall': description.loc['mean', 'context_entity_recall'],
}
total_results

{'context_precision': 0.8838734567165447,
 'faithfulness': 0.6924694398832331,
 'answer_relevancy': 0.8958493206186454,
 'context_recall': 0.9627777777777778,
 'context_entity_recall': 0.27091403535645997}

In [11]:
evaluation_dataset = pd.read_csv('final_evaluation_datasets/evaluation_dataset.csv')
evaluation_dataset

Unnamed: 0,question,contexts,ground_truth,ground_truth_chunk_ids,answer,answer_chunk_ids,type
0,"Which semiconductor stock, NVIDIA or Taiwan Semiconductor, has performed better this year?",[' product launch cadence to meet a growing and diverse set of AI opportunities. The increased frequency of these transitions may magnify the challenges associated with managing our supply and dem...,"NVIDIA has performed better this year, gaining 180% year-to-date compared to Taiwan Semiconductor's 38%.","['25c2084b-d2c7-2c1d-2bdf-5ac4b97e8d7e', '3e619c5b-8801-886f-1153-21429e404e1b', 'b78da971-cede-623b-d604-234e42dda7f8', 'bf715864-7c6d-03f2-2587-13ce20a99fcc', 'd87a0bff-9952-0e44-46ff-553112acde...",,,ragas
1,"What was the percentage increase in TSMC's October sales compared to September, driven by demand for AI chips and Apple's iPhone 15?","['Taiwan Semiconductor Manufacturing (TSM), better known as TSMC, reported better-than-expected sales for October, thanks to demand for chips for artificial intelligence and Apple\'s (AAPL) iPhone...",34.8%,"['25c2084b-d2c7-2c1d-2bdf-5ac4b97e8d7e', '3e619c5b-8801-886f-1153-21429e404e1b', 'b78da971-cede-623b-d604-234e42dda7f8', 'bf715864-7c6d-03f2-2587-13ce20a99fcc', 'd87a0bff-9952-0e44-46ff-553112acde...",,,ragas
2,"What are the risks of a complex supply chain with factors like demand visibility, purchase volumes, new vendors, and changing demand expectations?",['Item 7. Management\'s Discussion and Analysis of Financial Condition and Results of Operations The following discussion and analysis of our financial condition and results of operations should b...,"The risks of a complex supply chain with factors like demand visibility, purchase volumes, new vendors, and changing demand expectations include increased complexity and execution risk, potential ...","['25c2084b-d2c7-2c1d-2bdf-5ac4b97e8d7e', '3e619c5b-8801-886f-1153-21429e404e1b', 'b78da971-cede-623b-d604-234e42dda7f8', 'bf715864-7c6d-03f2-2587-13ce20a99fcc', 'd87a0bff-9952-0e44-46ff-553112acde...",,,ragas
3,What factors have contributed to Nvidia's position as a leader in the AI chip market?,"[' programming code.\n\nFor those looking for the top large-cap stocks to buy now, here\'s a dive into NVDA.\n\nNvidia Stock Technical Analysis\n\nThe AI chip leader broke out past a 476.09 buy po...","Nvidia's position as a leader in the AI chip market is attributed to its pioneering of graphics processing units (GPUs) for more realistic video games, expansion into AI chips used in supercompute...","['911edbf3-396c-c1f4-e97f-18c212dee4c4', '960ff408-d29b-e155-298c-15dbc123661a', 'a9a2c391-b9c0-2c62-b5f0-42411e07809c', 'abcc1e96-6637-b573-e425-a668f873dfa6', 'd8934572-b4ab-7b73-e2b0-fd72bf732b...",,,ragas
4,"What strategies does NVIDIA employ to support the compensation, benefits, and well-being of its employees?","['G informed us the licensing requirements were effective immediately for shipments of our A100, A800, H100, H800, and L40S products. Our sales to China decreased as a percentage of total Data Cen...","NVIDIA employs several strategies to support the compensation, benefits, and well-being of its employees, including offering equity tied to the company's stock price, comprehensive benefits for ph...","['3e619c5b-8801-886f-1153-21429e404e1b', '77899e50-1aae-2b16-f8d4-65c30e9717d0', '911edbf3-396c-c1f4-e97f-18c212dee4c4', 'b78da971-cede-623b-d604-234e42dda7f8', 'bf715864-7c6d-03f2-2587-13ce20a99f...",,,ragas
...,...,...,...,...,...,...,...
175,"Who is the CEO of the company which recently launched its Verified services, and is also responsible for the expected global AI spending to exceed $300 billion by 2026?","['Update on June 27, 2023 at 7:30 AM PT:\n\nWe’re excited to begin rolling out Meta Verified to most markets globally over the coming months.\n\nWe’ve heard positive feedback from creators in our ...","Mark Zuckerberg is the CEO of Meta, who recently launched their verified services, is responsible for the expected global AI spending to exceed $300 billion by 2026.","['19857181-dd62-9d08-2747-3331540f0b81', 'adfada1a-9035-2536-c50b-c83d3f18a36a', 'b212ddfc-cb92-43ac-de99-bf0def886139', 'd486c571-c3f4-46d5-86b4-876a28a89790', 'fcfd1b5a-f708-0a3d-3ac6-172751a4c6...",,,manual
176,"Considering the latest strategic and financial developments, how are Meta and NVIDIA adjusting their operations in response to market demands and regulatory challenges?",['Our policy is to include interest and penalties related to unrecognized tax benefits as a component of income tax expense. Revenue Recognition Revenue Allowances For products sold with a right o...,"Meta is reducing its workforce by about 10,000 employees and closing 5,000 unfilled job postings to streamline operations, while NVIDIA has launched the RTX 4090D, a China-specific gaming chip com...","['25c2084b-d2c7-2c1d-2bdf-5ac4b97e8d7e', '3c673603-bc65-e197-6c90-bcb66c977b8c', '77899e50-1aae-2b16-f8d4-65c30e9717d0', 'b78da971-cede-623b-d604-234e42dda7f8', 'd8934572-b4ab-7b73-e2b0-fd72bf732b...",,,manual
177,"As compared to Alphabet, how much more did the stock price of the company that laid off 10,000 people increase?","[""Check out the companies making headlines before the bell.\n\nAlphabet — The Google parent popped more than 6% after topping Wall Street's second-quarter earnings expectations, fueled by growth i...","Meta, who laid off more that 10,000 people increased its stock price by more than 1% as compared to Alphabet","['36a0b9b8-5aeb-f0d1-673e-044da5890e01', '3e619c5b-8801-886f-1153-21429e404e1b', '7b28a0c9-9260-0b69-8d07-4b383b4ec9ee', 'b78da971-cede-623b-d604-234e42dda7f8', 'bf715864-7c6d-03f2-2587-13ce20a99f...",,,manual
178,"Amongst Apple, Google, Meta, Microsoft, NVIDIA and TSM; which company or companies are flagged for high-risk to invest in due to geopolitical risks?",['Our policy is to include interest and penalties related to unrecognized tax benefits as a component of income tax expense. Revenue Recognition Revenue Allowances For products sold with a right o...,"Among the listed companies, Taiwan Semiconductor Manufacturing Co. (TSM) is flagged for high-risk investment due to geopolitical risks, particularly because of the tensions between Taiwan and China.","['0f064687-3f51-7c2c-9ad1-d77b09f66b36', '77899e50-1aae-2b16-f8d4-65c30e9717d0', 'b78da971-cede-623b-d604-234e42dda7f8', 'd8934572-b4ab-7b73-e2b0-fd72bf732bad', 'e5b52cd6-97dd-1b43-0a6c-d523f5f654...",,,manual


In [24]:
evaluation_dataset.loc[160, 'ground_truth']

"Yes, this would benefit Samsung. This is because Samsung has a partnership to license AMD's GPUs to handle advanced graphics that could be used in gaming or for on-device AI in any apps. This would enrich the user experience for Microsoft apps on Samsung devices, and they would have better customer satisfaction."