## Building a Retrieval-Augmented Generation (RAG) System
#### Alan García Zermeño
06/11/2024

### Section 1: Understanding the Dataset
#### This section includes:
- A summary of the dataset.
- Code snippets used for loading and cleaning the data.
- A brief report on the data cleaning process.

The dataset is a xlsx file in 'Dataset' directory, we will use Pandas module.

In [2]:
import pandas as pd
data_dir = "../Dataset/DS_exam.xlsx"
data = pd.read_excel(data_dir)
data.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,,,,,23011(Q) Huma Q1 Qual Follow-ups,,,,
1,Country,Requester_Type,Product,Indication,Question,,,,
2,,,,,,Channel,Date_Time_Open,Date_Time_Closed,Answer/Solution
3,UK,HCP,Keytruda,NSCLC,What are the common side effects of Keytruda?,email,2023-06-27 17:00:00,2023-07-21 18:00:00,"Common side effects include fatigue, nausea, a..."
4,US,Researcher,Keytruda,NSCLC,Can Keytruda cause immune-related adverse effe...,email,2023-06-08 08:00:00,2023-07-06 12:30:00,"Yes, Keytruda can cause immune-related adverse..."


- The DS_exam database consists of a set of queries related to questions and answers regarding the Keytruda treatment, along with other data that may be useful for potential filtering.

- The column names do not align in the same row, and there are 'nan' values.

- To achieve this, we will reload the database, appropriately naming each column.

In [38]:
data = pd.read_excel(data_dir, skiprows = [0,1], header = None)     #Take only usefull rows
col = pd.concat([data.iloc[0][:5], data.iloc[1][5:]])               #Organize data columns
data = data[2:]
data.reset_index(drop=True, inplace=True)                     
data.columns = col                                                  #Update data columns
data.head()                                                         #Clean database

Unnamed: 0,Country,Requester_Type,Product,Indication,Question,Channel,Date_Time_Open,Date_Time_Closed,Answer/Solution
0,UK,HCP,Keytruda,NSCLC,What are the common side effects of Keytruda?,email,2023-06-27 17:00:00,2023-07-21 18:00:00,"Common side effects include fatigue, nausea, a..."
1,US,Researcher,Keytruda,NSCLC,Can Keytruda cause immune-related adverse effe...,email,2023-06-08 08:00:00,2023-07-06 12:30:00,"Yes, Keytruda can cause immune-related adverse..."
2,UK,HCP,Keytruda,NSCLC,Is Keytruda safe for pregnant women?,call,2023-03-06 04:00:00,2023-04-19 15:00:00,Keytruda is not recommended for use during pre...
3,UK,Researcher,Keytruda,NSCLC,What should patients report immediately while ...,email,2023-02-05 10:30:00,2023-03-05 06:00:00,Patients should report any new or worsening sy...
4,US,Researcher,Keytruda,NSCLC,Are there any known interactions between Keytr...,call,2023-03-14 09:30:00,2023-04-16 18:30:00,"Yes, Keytruda can interact with steroids and c..."


We will save the indices of the rows that are empty in the question or answer columns. It is very likely that we will need them later and lets check what is the problem.

In [39]:
na_quest = data[data.Question.isnull()].index                       #Indices from null questions
na_answ = data[data['Answer/Solution'].isnull()].index              #Indices from null answers

print("Response empty instances:")
for Q in data.Question.iloc[na_answ]:
    print(f"---> {Q}")

print("\nQuestion empty instances:")
for Q in data['Answer/Solution'].iloc[na_quest]:
    print(f"---> {Q}")

Response empty instances:
---> What resources are available for patients on Keytruda to manage side effects?
---> Q: What was the impact of Keytruda on quality of life for NSCLC patients in KEYNOTE-123?
   A: Keytruda improved the quality of life by reducing symptoms related to NSCLC and decreasing the side effects typically associated with chemotherapy.
---> Q: Were there any notable differences in efficacy between Keytruda and chemotherapy in NSCLC patients in the KEYNOTE-456 trial?
   A: Yes, Keytruda demonstrated superior efficacy in terms of overall survival and progression-free survival compared to chemotherapy in NSCLC patients.
---> Q: What are the recommended dosing guidelines for Keytruda in NSCLC treatment according to KEYNOTE-123 findings?
   A: The recommended dosing is 200 mg every three weeks for a duration determined by the patient's response and tolerance to the treatment

Question empty instances:
---> Q: What was the primary outcome of the KEYNOTE-123 trial for NSCLC

- The 'nan' error seems to be that there are instances where the question and the answer are in the same column. Let's fix it to build the corpus.
- It appears that only one of these instances is unanswered. We are going to delete this instance.
- The main idea is to have Question/Answer pairs to build a corpus of documents to which we can associate an input query for use in our RAG.

In [42]:
data.Question.iloc[30]

'Q: Were there any notable differences in efficacy between Keytruda and chemotherapy in NSCLC patients in the KEYNOTE-456 trial?\n   A: Yes, Keytruda demonstrated superior efficacy in terms of overall survival and progression-free survival compared to chemotherapy in NSCLC patients.'

In [40]:
data = data.drop(index = na_answ[0])
data = data.reset_index(drop=True)
na_quest = data[data.Question.isnull()].index                       #New indices from null questions
na_answ = data[data['Answer/Solution'].isnull()].index              #New indices from null answers
data.iloc[18:23]

Unnamed: 0,Country,Requester_Type,Product,Indication,Question,Channel,Date_Time_Open,Date_Time_Closed,Answer/Solution
18,Canada,HCP,Keytruda,NSCLC,Can Keytruda be used in combination with other...,call,2023-03-30 06:30:00,2023-05-11 02:30:00,"Yes, Keytruda can be used in combination with ..."
19,Canada,HCP,Keytruda,NSCLC,What are the guidelines for Keytruda in treati...,email,2023-05-23 00:00:00,2023-06-06 18:00:00,Keytruda is recommended for relapsed or refrac...
20,US,HCP,Keytruda,NSCLC,Are there patient support programs available f...,email,2023-03-05 21:30:00,2023-04-04 04:30:00,"Yes, there are several patient support program..."
21,US,Pharmacist,Keytruda,NSCLC,How is awareness being raised for Keytruda as ...,call,2023-05-11 08:00:00,2023-06-04 13:00:00,Awareness is being raised through patient advo...
22,Canada,Researcher,Keytruda,NSCLC,Is there a community support group for patient...,call,2023-05-16 11:30:00,2023-06-10 14:00:00,"Yes, several online and local support groups a..."


Now, we create our corpus, doc_quest is an array for Q/A pairs. Here is when we 'fix' the 'nan' problem. 

In [30]:
doc_quest = [f"Question: {data.Question.iloc[i]}:\n Answer: {data['Answer/Solution'].iloc[i]}"\
              for i in range(len(data))]  #Q/A Corpus

for i,Q in zip(na_answ,data.Question.iloc[na_answ]):
    doc_quest[i] = Q.replace("Q: ","Question: ").replace("A: ","Answer: ")
for i,A in zip(na_quest,data['Answer/Solution'].iloc[na_quest]):
    doc_quest[i] = A.replace("Q: ","Question: ").replace("A: ","Answer: ")

print(doc_quest[36])

Question: How did Keytruda perform in NSCLC patients with high PD-L1 expression in the KEYNOTE-051 trial?
   Answer: Keytruda showed significantly improved response rates and survival outcomes in NSCLC patients with high PD-L1 expression in the KEYNOTE-456 trial.


In [29]:
data.describe()

  data.describe()
  data.describe()


Unnamed: 0,Country,Requester_Type,Product,Indication,Question,Channel,Date_Time_Open,Date_Time_Closed,Answer/Solution
count,49,49,49,49,47,49,49,49,46
unique,3,3,1,1,47,2,49,49,44
top,UK,HCP,Keytruda,NSCLC,What are the common side effects of Keytruda?,call,2023-06-27 17:00:00,2023-07-21 18:00:00,No detailed information aviable on the given t...
freq,20,19,49,49,1,29,1,1,3


Broadly speaking, the database consists of a set of 50 queries from 49 Question/Answer pairs ans 1 question without answer concerning the Keytruda treatment.

In [34]:
for i,d in enumerate(doc_quest): print(f"{i}- {d}")

0- Question: What are the common side effects of Keytruda?:
 Answer: Common side effects include fatigue, nausea, and skin rash.
1- Question: Can Keytruda cause immune-related adverse effects?:
 Answer: Yes, Keytruda can cause immune-related adverse effects such as colitis, hepatitis, and pneumonitis.
2- Question: Is Keytruda safe for pregnant women?:
 Answer: Keytruda is not recommended for use during pregnancy due to potential risks to the fetus.
3- Question: What should patients report immediately while on Keytruda treatment?:
 Answer: Patients should report any new or worsening symptoms such as cough, chest pain, or changes in vision immediately.
4- Question: Are there any known interactions between Keytruda and other medications?:
 Answer: Yes, Keytruda can interact with steroids and certain immunosuppressants, potentially affecting its efficacy and safety.
5- Question: How effective is Keytruda in treating non-small cell lung cancer?:
 Answer: Keytruda has shown to improve surviv