## Preparing data for training
- Augmenting with context-answer set up
- Splitting in Train, Valid and Test based on topics so there are not duplicated topics accross sets


1. **Extractive question-answering**.  
   Extracts the answer to a question from a given context. Meaning, the answer to the question is in the context and we just extract it from it as it is.
   For this case, we need the question per se, the answer, and the context within the answer exists. We do not have the context, but we can process our dataset in order to create this context.
   I will assume that the answer provided to the question, is the correct one. 
3. **Generative question-answering with RAG**.  
   Uses a language generation model (e.g: pre-trained LLM or Foundational Model) to create an answer to a question given a context. The context is retrieved from a knowledge base and passed to the GenAI model to generate the answer.

In [77]:
import pandas as pd
import numpy as np
import os

In [4]:
abs_path = os.path.abspath('../../')
path_to_data = 'data/processed/'
filename = 'question_answer_with_labels.csv'

In [5]:
dataset = pd.read_csv(os.path.join(abs_path, path_to_data, filename))

In [6]:
dataset

Unnamed: 0,question,answer,voted_label
0,What is (are) Glaucoma?,Glaucoma is a group of diseases that can damag...,glaucoma_prevent
1,Who is at risk for Glaucoma??,Anyone can develop glaucoma. Some people are a...,glaucoma_prevent
2,How to prevent Glaucoma?,"At this time, we do not know how to prevent gl...",glaucoma_prevent
3,What are the symptoms of Glaucoma?,"At first, open-angle glaucoma has no symptoms....",glaucoma_prevent
4,What are the treatments for Glaucoma?,"Yes. Immediate treatment for early stage, open...",glaucoma_prevent
...,...,...,...
13822,What is (are) Diabetic Neuropathies: The Nerve...,Diabetic neuropathies are a family of nerve di...,nerve_diabetic
13823,What causes Diabetic Neuropathies: The Nerve D...,The causes are probably different for differen...,nerve_diabetic
13824,What are the symptoms of Diabetic Neuropathies...,Symptoms depend on the type of neuropathy and ...,nerve_diabetic
13825,How to prevent Diabetic Neuropathies: The Nerv...,The best way to prevent neuropathy is to keep ...,nerve_diabetic


In [7]:
dataset['answer'].values[0]

"Glaucoma is a group of diseases that can damage the eye's optic nerve and result in vision loss and blindness. The most common form of the disease is open-angle glaucoma. With early treatment, you can often protect your eyes against serious vision loss. (Watch the video to learn more about glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.)  See this graphic for a quick overview of glaucoma, including how many people it affects, whos at risk, what to do if you have it, and how to learn more.  See a glossary of glaucoma terms."

In [20]:
from dotenv import load_dotenv
from pathlib import Path
from openai import OpenAI
import os

In [21]:
dotenv_path = Path(os.path.join(abs_path, '.env'))

In [22]:
load_dotenv(dotenv_path=dotenv_path)

True

In [25]:
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

In [26]:
# decided to use openai here just because it is faster this way

In [36]:
question = dataset['question'].values[150]
context = dataset['answer'].values[150]

In [None]:
## Commenting this since it was used to extract answer from the context because in the original dataset there was not specific information for this.
"""
answers = []

counter = 0
for question,context in zip(dataset['question'].values,dataset['answer'].values):
    print(counter)
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are given a question and a context. Extract the portion of the context where you think the answer is. If the answer is not in the context, say: no-answer, do not generate an answer but extract the sentence or paragraph where the answer is in the context"},
            {
                "role": "user",
                "content": f"This is the question: {question}. This is the context: {context}"
            }
        ]
    )
    counter +=1
    answers.append(completion.choices[0].message.content)
"""

In [43]:
temp_answer = dataset['answer'].values

In [47]:
dataset['answer'] = answers
dataset['context'] = temp_answer

In [49]:
dataset = dataset[['question', 'context', 'answer','voted_label']]

In [52]:
dataset['voted_label'] = pd.read_csv(os.path.join(abs_path, 'data/processed', 'processed_dataset.csv'))['ac_labels']

In [53]:
dataset.head()

Unnamed: 0,question,context,answer,voted_label
0,What is (are) Glaucoma?,Glaucoma is a group of diseases that can damag...,Glaucoma is a group of diseases that can damag...,glaucoma_earlyonset
1,Who is at risk for Glaucoma??,Anyone can develop glaucoma. Some people are a...,They include - African-Americans over age 40 ...,glaucoma_earlyonset
2,How to prevent Glaucoma?,"At this time, we do not know how to prevent gl...","At this time, we do not know how to prevent gl...",glaucoma_earlyonset
3,What are the symptoms of Glaucoma?,"At first, open-angle glaucoma has no symptoms....","At first, open-angle glaucoma has no symptoms....",glaucoma_earlyonset
4,What are the treatments for Glaucoma?,"Yes. Immediate treatment for early stage, open...","Glaucoma treatments include medicines, laser s...",glaucoma_earlyonset


In [54]:
# dataset.to_csv(os.path.join(abs_path, 'data/processed/question_answer_with_context_and_labels.csv'), index= False)

In [63]:
def find_start_end_indexes_in_context(answers, context):
    start_end_indexes = []
    for a,c in zip(answers,context):
        if a in c:
            start_index = c.index(a)
            end_index = len(a)
            start_end_indexes.append((start_index, end_index))
        else:
            start_end_indexes.append((0,0))
    return start_end_indexes

In [69]:
start_end_indexes = find_start_end_indexes_in_context(dataset['answer'].tolist(), dataset['context'].tolist())

In [70]:
start_end_indexes[0]

(0, 170)

In [73]:
dataset['start_end_indexes'] = start_end_indexes

In [74]:
dataset.head()

Unnamed: 0,question,context,answer,voted_label,start_end_indexes
0,What is (are) Glaucoma?,Glaucoma is a group of diseases that can damag...,Glaucoma is a group of diseases that can damag...,glaucoma_earlyonset,"(0, 170)"
1,Who is at risk for Glaucoma??,Anyone can develop glaucoma. Some people are a...,They include - African-Americans over age 40 ...,glaucoma_earlyonset,"(73, 143)"
2,How to prevent Glaucoma?,"At this time, we do not know how to prevent gl...","At this time, we do not know how to prevent gl...",glaucoma_earlyonset,"(0, 366)"
3,What are the symptoms of Glaucoma?,"At first, open-angle glaucoma has no symptoms....","At first, open-angle glaucoma has no symptoms....",glaucoma_earlyonset,"(0, 290)"
4,What are the treatments for Glaucoma?,"Yes. Immediate treatment for early stage, open...","Glaucoma treatments include medicines, laser s...",glaucoma_earlyonset,"(146, 109)"


### SPLIT THE DATASET INTO TRAIN, VALID, and TEST according to the voted_labels (which in our case would be like topics) so when we evaluate, we evaluate on new unseen topics

In [75]:
def get_topic_splits(total_topics, size=0.1):
    train_topics = np.unique(total_topics)
    num_of_topics_for_dev_test = int(len(train_topics)*size)
    valid_topics = np.random.choice(train_topics, size=num_of_topics_for_dev_test)
    train_topics = [x for x in train_topics if x not in valid_topics]
    test_topics = np.random.choice(train_topics, size=num_of_topics_for_dev_test)
    train_topics = [x for x in train_topics if x not in test_topics]

    return train_topics, valid_topics, test_topics

In [78]:
train_topics, valid_topics, test_topics = get_topic_splits(dataset['voted_label'])

In [80]:
train_dataset = dataset[dataset['voted_label'].map(lambda x: x in train_topics)].copy()
valid_dataset = dataset[dataset['voted_label'].map(lambda x: x in valid_topics)].copy()
test_dataset = dataset[dataset['voted_label'].map(lambda x: x in test_topics)].copy()

In [85]:
print(f"Train samples after splitting: {train_dataset.shape[0]}")
print(f"Train context with no answers:{train_dataset[train_dataset['answer']=='no-answer'].shape[0]}")

Train samples after splitting: 11137
Train context with no answers:1468


In [86]:
print(f"Valid samples after splitting: {valid_dataset.shape[0]}")
print(f"Valid context with no answers:{valid_dataset[valid_dataset['answer']=='no-answer'].shape[0]}")

Valid samples after splitting: 1376
Valid context with no answers:163


In [83]:
print(f"Valid samples after splitting: {test_dataset.shape[0]}")
print(f"Valid context with no answers:{test_dataset[test_dataset['answer']=='no-answer'].shape[0]}")

(1314, 5)

## Save the csv sets to disk

In [88]:
train_dataset.to_csv(os.path.join(abs_path, 'data/processed', 'train.csv'), index=False)
valid_dataset.to_csv(os.path.join(abs_path, 'data/processed', 'valid.csv'), index=False)
test_dataset.to_csv(os.path.join(abs_path, 'data/processed', 'test.csv'), index=False)

### Now, since we are going to experiment with RAG as well, let's create txt documents that are going to be used to retrieve the context from

In [89]:
def write_set_to_txt_documents(context_set, set_name):
    for topic in np.unique(context_set['voted_label'].values):
        filename = os.path.join(abs_path,'data/processed/documents',set_name,f"{topic}.txt")
        content = '\n'.join(context_set[context_set['voted_label']==topic]['context'].values)
        with open(filename, mode="wt") as f:
            f.write(content)

In [90]:
write_set_to_txt_documents(train_dataset, 'train')

In [91]:
write_set_to_txt_documents(valid_dataset, 'valid')

In [92]:
write_set_to_txt_documents(test_dataset, 'test')