# Data Processing and Preparation  

The dataset does not have a context from where the answer should be extracted so I think of two approaches to solve this problem.  
Depending on the approach, there are different ways to preprocess the data.  

1. **Extractive question-answering**.  
   Extracts the answer to a question from a given context. Meaning, the answer to the question is in the context and we just extract it from it as it is.
   For this case, we need the question per se, the answer, and the context within the answer exists. We do not have the context, but we can process our dataset in order to create this context.
   I will assume that the answer provided to the question, is the correct one. 
3. **Generative question-answering with RAG**.  
   Uses a language generation model (e.g: pre-trained LLM or Foundational Model) to create an answer to a question given a context. The context is retrieved from a knowledge base and passed to the GenAI model to generate the answer.

## General preprocessing  

Identified from data exploration:
- Remove repeated question at the beginning of the answer

In [194]:
import pandas as pd
import re
import os

In [3]:
abs_path = os.path.abspath('../../')
path_to_data = 'data/raw/'
filename = 'augmented_intern_screening_dataset.csv'

In [4]:
dataset = pd.read_csv(os.path.join(abs_path, path_to_data, filename))

In [5]:
dataset

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels
0,What is (are) Glaucoma?,Glaucoma is a group of diseases that can damag...,what is are glaucoma,glaucoma is a group of diseases that can damag...,4,111,0.722222,glaucoma,326,486,261,0
1,What is (are) Glaucoma?,The optic nerve is a bundle of more than 1 mil...,what is are glaucoma,the optic nerve is a bundle of more than 1 mil...,4,20,0.733333,glaucoma,326,486,261,0
2,What is (are) Glaucoma?,Open-angle glaucoma is the most common form of...,what is are glaucoma,openangle glaucoma is the most common form of ...,4,86,0.761905,glaucoma,326,486,261,0
3,Who is at risk for Glaucoma??,Anyone can develop glaucoma. Some people are a...,who is at risk for glaucoma,anyone can develop glaucoma some people are at...,6,78,0.773585,risk glaucoma,326,486,261,0
4,How to prevent Glaucoma?,"At this time, we do not know how to prevent gl...",how to prevent glaucoma,at this time we do not know how to prevent gla...,4,84,0.814815,prevent glaucoma,326,486,261,0
...,...,...,...,...,...,...,...,...,...,...,...,...
16342,What is (are) Diabetic Neuropathies: The Nerve...,Autonomic neuropathy affects the nerves that c...,what is are diabetic neuropathies the nerve da...,autonomic neuropathy affects the nerves that c...,10,492,0.733333,damage neuropathies nerve diabetes diabetic,183,1339,365,2229
16343,What is (are) Diabetic Neuropathies: The Nerve...,"Proximal neuropathy, sometimes called lumbosac...",what is are diabetic neuropathies the nerve da...,proximal neuropathy sometimes called lumbosacr...,10,91,0.704348,damage neuropathies nerve diabetes diabetic,183,1339,365,2229
16344,What is (are) Diabetic Neuropathies: The Nerve...,Focal neuropathy appears suddenly and affects ...,what is are diabetic neuropathies the nerve da...,focal neuropathy appears suddenly and affects ...,10,179,0.726190,damage neuropathies nerve diabetes diabetic,183,1339,365,2229
16345,How to prevent Diabetic Neuropathies: The Nerv...,The best way to prevent neuropathy is to keep ...,how to prevent diabetic neuropathies the nerve...,the best way to prevent neuropathy is to keep ...,10,30,0.681159,prevent damage neuropathies nerve diabetes dia...,183,1339,365,2229


### Remove repeated question at the beginning of the answer  
Repeated questions at the beginning of the answer can be idenfified by checking distances column.  
If question is repeated at the beginning of the answer, distance is approx < 0.15  
In the exploration step, I used a margin equal to '3'.  This is because sometimes the question in the answer was slightly different.  
For example *q: what are the signs of xxx, a: what are the signs or symptoms of xxx..*

In [56]:
## example
"""
question: What are the symptoms of Natal teeth, intestinal pseudoobstruction and patent ductus?
answer: What are the signs and symptoms of Natal teeth, intestinal pseudoobstruction and patent ductus? The Human Phenotype Ontology provides the following list of signs and symptoms for Natal teeth, intestin

question: What are the symptoms of Familial visceral myopathy with external ophthalmoplegia?
answer: What are the signs and symptoms of Familial visceral myopathy with external ophthalmoplegia? The Human Phenotype Ontology provides the following list of signs and symptoms for Familial visceral myopat

question: What are the symptoms of Hypermanganesemia with dystonia polycythemia and cirrhosis?
answer: What are the signs and symptoms of Hypermanganesemia with dystonia polycythemia and cirrhosis? The Human Phenotype Ontology provides the following list of signs and symptoms for Hypermanganesemia with

question: What are the symptoms of Metachromatic leukodystrophy due to saposin B deficiency?
answer: What are the signs and symptoms of Metachromatic leukodystrophy due to saposin B deficiency? The Human Phenotype Ontology provides the following list of signs and symptoms for Metachromatic leukodystr

question: What are the symptoms of Diabetes insipidus nephrogenic mental retardation and intracerebral calcification?
answer: What are the signs and symptoms of Diabetes insipidus nephrogenic mental retardation and intracerebral calcification? The Human Phenotype Ontology provides the following list of signs and symptoms for
"""
print(f"question: {dataset[dataset['distances']<0.15]['question'].values[0]}")
print(f"answer: {dataset[dataset['distances']<0.15]['answer'].values[0][:200]}")
print()
print(f"question: {dataset[dataset['distances']<0.15]['question'].values[5]}")
print(f"answer: {dataset[dataset['distances']<0.15]['answer'].values[5][:200]}")
print()
print(f"question: {dataset[dataset['distances']<0.15]['question'].values[25]}")
print(f"answer: {dataset[dataset['distances']<0.15]['answer'].values[25][:200]}")
print()
print(f"question: {dataset[dataset['distances']<0.15]['question'].values[100]}")
print(f"answer: {dataset[dataset['distances']<0.15]['answer'].values[100][:200]}")
print()
print(f"question: {dataset[dataset['distances']<0.15]['question'].values[120]}")
print(f"answer: {dataset[dataset['distances']<0.15]['answer'].values[120][:200]}")
print()

question: What are the symptoms of Natal teeth, intestinal pseudoobstruction and patent ductus?
answer: What are the signs and symptoms of Natal teeth, intestinal pseudoobstruction and patent ductus? The Human Phenotype Ontology provides the following list of signs and symptoms for Natal teeth, intestin

question: What are the symptoms of Familial visceral myopathy with external ophthalmoplegia?
answer: What are the signs and symptoms of Familial visceral myopathy with external ophthalmoplegia? The Human Phenotype Ontology provides the following list of signs and symptoms for Familial visceral myopat

question: What are the symptoms of Hypermanganesemia with dystonia polycythemia and cirrhosis?
answer: What are the signs and symptoms of Hypermanganesemia with dystonia polycythemia and cirrhosis? The Human Phenotype Ontology provides the following list of signs and symptoms for Hypermanganesemia with

question: What are the symptoms of Metachromatic leukodystrophy due to saposin B defic

In [16]:
## remove where question is identical to the answer
threshold = 0.05
index_to_remove = dataset[dataset['distances']<threshold]
dataset[dataset['distances']<threshold]

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels
5774,Is Septo-optic dysplasia inherited?,Is septo-optic dysplasia inherited?,is septooptic dysplasia inherited,is septooptic dysplasia inherited,4,4,0.0,dysplasia septooptic inherited,286,348,66,1074


In [17]:
dataset = dataset.drop(index=[index_to_remove]).reset_index(drop=True)

In [44]:
## remove where question is very similar to the answer
threshold = 0.30
margin = 3
condition = (dataset['distances']<threshold)&(abs(dataset['num_words_in_answer']-dataset['num_words_in_question']).map(lambda x: x<=margin))
index_to_remove = dataset[condition].index
dataset[condition]

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels
3020,Is Dentatorubral-pallidoluysian atrophy inheri...,How is dentatorubral-pallidoluysian atrophy (D...,is dentatorubralpallidoluysian atrophy inherited,how is dentatorubralpallidoluysian atrophy drp...,4,6,0.172414,atrophy inherited dentatorubralpallidoluysian,362,2386,608,469
3078,"Is 48,XXYY syndrome inherited?","Can 48,XXYY syndrome be inherited?",is 48xxyy syndrome inherited,can 48xxyy syndrome be inherited,4,5,0.1875,48xxyy inherited syndrome,486,962,825,485
5851,Is Pelizaeus-Merzbacher disease inherited?,How is Pelizaeus-Merzbacher disease inherited?,is pelizaeusmerzbacher disease inherited,how is pelizaeusmerzbacher disease inherited,4,5,0.090909,inherited disease pelizaeusmerzbacher,232,387,190,1088
7221,Is Oculopharyngeal muscular dystrophy inherited?,How is oculopharyngeal muscular dystrophy inhe...,is oculopharyngeal muscular dystrophy inherited,how is oculopharyngeal muscular dystrophy inhe...,5,6,0.078431,dystrophy oculopharyngeal muscular inherited,323,291,1160,1361


In [None]:
dataset = dataset.drop(index=[index_to_remove]).reset_index(drop=True)

In [61]:
## remove where question is very similar to the answer
threshold = 0.15
condition = (dataset['distances']<threshold)
index_to_modify = dataset[condition].index
dataset[condition]

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels
2477,"What are the symptoms of Natal teeth, intestin...",Human Phenotype Ontology provides the followin...,what are the symptoms of natal teeth intestina...,what are the signs and symptoms of natal teeth...,12,262,0.144330,symptoms pseudoobstruction teeth ductus patent...,225,3184,2160,-1
2482,What are the symptoms of Ventricular extrasyst...,What are the signs and symptoms of Ventricular...,what are the symptoms of ventricular extrasyst...,what are the signs and symptoms of ventricular...,13,314,0.120690,symptoms sequence episodes syncopal perodactyl...,392,2224,1396,-1
2516,"What are the symptoms of Symphalangism, distal...",What are the signs and symptoms of Symphalangi...,what are the symptoms of symphalangism distal ...,what are the signs and symptoms of symphalangi...,16,294,0.113821,zygomatic symptoms narrowed pulp symphalangism...,392,2014,2672,-1
2616,"What are the symptoms of Mental retardation, m...",What are the signs and symptoms of Mental reta...,what are the symptoms of mental retardation ma...,what are the signs and symptoms of mental reta...,13,272,0.123894,symptoms craniofacial mental macrocephaly stat...,125,337,2832,-1
2683,What are the symptoms of Mastocytosis cutaneou...,What are the signs and symptoms of Mastocytosi...,what are the symptoms of mastocytosis cutaneou...,what are the signs and symptoms of mastocytosi...,15,347,0.119658,symptoms cutaneous loss mastocytosis conductiv...,400,3253,548,-1
...,...,...,...,...,...,...,...,...,...,...,...,...
7534,What are the symptoms of Microcephalic osteody...,What are the signs and symptoms of Microcephal...,what are the symptoms of microcephalic osteody...,what are the signs and symptoms of microcephal...,11,492,0.147368,symptoms 2 dwarfism osteodysplastic type micro...,389,1279,2357,827
7657,What are the symptoms of Bifid nose with or wi...,What are the signs and symptoms of Bifid nose ...,what are the symptoms of bifid nose with or wi...,what are the signs and symptoms of bifid nose ...,14,255,0.147368,renal symptoms bifid anorectal anomalies witho...,404,1670,2534,-1
7716,What are the symptoms of Familial encephalopat...,What are the signs and symptoms of Familial en...,what are the symptoms of familial encephalopat...,what are the signs and symptoms of familial en...,11,258,0.145833,symptoms familial neuroserpin encephalopathy b...,353,1658,835,1448
7791,"What are the symptoms of Arthrogryposis, dista...",What are the signs and symptoms of Arthrogrypo...,what are the symptoms of arthrogryposis distal...,what are the signs and symptoms of arthrogrypo...,14,259,0.111111,arthrogryposis symptoms facial distal hypopitu...,416,2149,2217,-1


In [68]:
# add a column to the dataset to mark what we have already processed
# dataset['processed_already'] = False

In [70]:
for i in index_to_modify:
    dataset.at[i,'processed_already'] = True

In [77]:
## Commented this because we run this only once, if we run it twice we will be truncating some answers
"""
margin = 3
for i in index_to_modify[1:]:
    print(i)
    question = dataset.iloc[i]['question']
    answer = dataset.iloc[i]['answer']
    num_words_question = len(question.split(' '))
    print(question)
    print(answer)
    print(num_words_question)
    new_answer = ' '.join(answer.split(' ')[num_words_question+margin:])
    print(new_answer)
    dataset.at[i, 'answer'] = new_answer
"""

def modify_answer_in_dataset(modification_indexes, margin=2):
    for i in modification_indexes:
        #print(i)
        question = dataset.iloc[i]['question']
        answer = dataset.iloc[i]['answer']
        num_words_question = len(question.split(' '))
        #print(question)
        #print(answer)
        #print(num_words_question)
        new_answer = ' '.join(answer.split(' ')[num_words_question+margin:])
        #print(new_answer)
        dataset.at[i, 'answer'] = new_answer
        dataset.at[i, 'processed_already'] = True

def remove_question_from_answer()

In [139]:
index_to_modify

Index([7720, 7722, 7750, 7765, 7779, 7802, 7833, 7841, 7847], dtype='int64')

In [140]:
modify_answer_in_dataset(index_to_modify)

In [186]:
## remove where question is very similar to the answer
threshold = 0.2
condition = (dataset['distances']>=threshold)&(dataset['distances']<threshold+0.1)&(dataset['processed_already']==False)
index_to_modify = dataset[condition].index
dataset[condition]

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels,processed_already
847,What causes Childhood Extracranial Germ Cell T...,The cause of most childhood extracranial germ ...,what causes childhood extracranial germ cell t...,the cause of most childhood extracranial germ ...,7,11,0.216667,cell extracranial tumors causes germ childhood,47,204,34,70,False
1024,What is (are) Transitional Cell Cancer of the ...,Key Points\n - Transitional...,what is are transitional cell cancer of the re...,key points transitional cell cancer of the re...,12,303,0.300000,renal cell pelvis ureter cancer transitional,252,1389,928,94,False
1169,What is (are) Osteosarcoma and Malignant Fibro...,Key Points\n - Osteosarcoma...,what is are osteosarcoma and malignant fibrous...,key points osteosarcoma and malignant fibrous...,10,226,0.291139,bone histiocytoma fibrous malignant osteosarcoma,261,3348,771,110,False
1203,What is (are) Adult Central Nervous System Tum...,Key Points\n - An adult cen...,what is are adult central nervous system tumors,key points an adult central nervous system tu...,8,2244,0.277778,central adult tumors nervous system,234,118,1418,112,False
1467,What causes Childhood Brain and Spinal Cord Tu...,The cause of most childhood brain and spinal c...,what causes childhood brain and spinal cord tu...,the cause of most childhood brain and spinal c...,8,12,0.220339,spinal tumors causes brain cord childhood,165,1197,468,134,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13193,Is PDGFRA-associated chronic eosinophilic leuk...,PDGFRA-associated chronic eosinophilic leukemi...,is pdgfraassociated chronic eosinophilic leuke...,pdgfraassociated chronic eosinophilic leukemia...,6,80,0.283784,eosinophilic inherited leukemia pdgfraassociat...,468,681,562,78,False
15073,Is methylmalonic acidemia with homocystinuria ...,Methylmalonic acidemia with homocystinuria is ...,is methylmalonic acidemia with homocystinuria ...,methylmalonic acidemia with homocystinuria is ...,6,175,0.289855,inherited acidemia methylmalonic homocystinuria,187,842,798,365,False
15231,What causes Ulcerative Colitis?,The exact cause of ulcerative colitis is unkno...,what causes ulcerative colitis,the exact cause of ulcerative colitis is unkno...,4,247,0.300000,colitis ulcerative causes,90,1261,76,145,False
15450,What causes Short Bowel Syndrome?,The main cause of short bowel syndrome is surg...,what causes short bowel syndrome,the main cause of short bowel syndrome is surg...,5,275,0.292683,bowel syndrome causes short,61,1541,1193,1050,False


In [157]:
dataset.at[2494, 'processed_already'] = True

In [158]:
# remove exact question that appears in the answer
for i in dataset[[x in y for x,y in zip(dataset['question'].values,dataset['answer'].values)]].index:
    question = dataset.iloc[i]['question']
    dataset.at[i, 'answer'] = dataset.iloc[i]['answer'].replace(question,'')
    dataset.at[i, 'processed_already'] = True

In [181]:
# remove question that appears in the answer, even when is not an exact match
for i in dataset[[x in y for x,y in zip(dataset['question'].str.lower(),dataset['answer'].str.lower())]].index:
    question = dataset.iloc[i]['question'].lower()
    answer = dataset.iloc[i]['answer'].lower()

    start_index = answer.index(question)
    end_index = start_index + len(question)
    
    dataset.at[i, 'answer'] = dataset.iloc[i]['answer'][:start_index] + dataset.iloc[i]['answer'][end_index:]
    dataset.at[i, 'processed_already'] = True

In [209]:
processed_dataset = dataset.copy()

In [196]:
re.findall?

[0;31mSignature:[0m [0mre[0m[0;34m.[0m[0mfindall[0m[0;34m([0m[0mpattern[0m[0;34m,[0m [0mstring[0m[0;34m,[0m [0mflags[0m[0;34m=[0m[0;36m0[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a list of all non-overlapping matches in the string.

If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.

Empty matches are included in the result.
[0;31mFile:[0m      /mnt/d/projects/medical_assistant_bot_assignment/venv/lib/python3.10/re.py
[0;31mType:[0m      function

In [217]:
matches = re.findall(r'(.*\?)',processed_dataset.iloc[7830]['answer'][:100])
matches

0

In [218]:
for i in processed_dataset[['?' in x[:100] for x in dataset['answer'].str.lower()]].index:
    matches = re.findall(r'(.*\?)',processed_dataset.iloc[i]['answer'][:100])
    if len(matches)>0:
        processed_dataset.at[i, 'answer'] = processed_dataset.iloc[i]['answer'].replace(matches[0], '')
        processed_dataset.at[i, 'processed_already'] = True

In [221]:
processed_dataset.iloc[127]['answer']

'   - How long has the agency served the community?   - Does this agency provide the services my relative or friend needs?   - How are emergencies handled?   - Is the staff on duty around the clock?   - How much do services and supplies cost?   - Will agency staff be in regular contact with the doctor? Is the agency Medicare-approved? How long has the agency served the community? Does this agency provide the services my relative or friend needs? How are emergencies handled? Is the staff on duty around the clock? How much do services and supplies cost? Will agency staff be in regular contact with the doctor? You can use Medicare\'s "Home Health Compare" tool to compare home health agencies in your area. Visit http://www.medicare.gov. Under "Search Tools," select "Compare Home Health Agencies in Your Area."'

### Let's stop here since we have lot of preprocessing to do but we need to continue

In [222]:
# let's recount num_words in answer since we had some modifications
processed_dataset['num_words_in_answer'] = processed_dataset['answer'].map(lambda answer: len(answer.lower().split()))

In [223]:
processed_dataset[['num_words_in_answer']].describe()

Unnamed: 0,num_words_in_answer
count,16342.0
mean,198.194774
std,246.275907
min,0.0
25%,70.0
50%,137.0
75%,246.0
max,4281.0


#### It seems there are answers that have 0 words, let's check them

In [225]:
# let's remove them
processed_dataset[processed_dataset['num_words_in_answer']==0]
processed_dataset = processed_dataset.drop(index=processed_dataset[processed_dataset['num_words_in_answer']==0].index).reset_index(drop=True)

In [229]:
# let's check those that have 1 word
processed_dataset[processed_dataset['num_words_in_answer']==1]
# there's only one, remove it
processed_dataset = processed_dataset.drop(index=processed_dataset[processed_dataset['num_words_in_answer']==1].index).reset_index(drop=True)

In [237]:
# let's remove those that have less than 7 words and start with Frequently asked questions phrase
processed_dataset = processed_dataset.drop(index=processed_dataset[(processed_dataset['num_words_in_answer']<7)&(processed_dataset['answer'].map(lambda x: 'Frequently' in x))].index).reset_index(drop=True)

In [238]:
processed_dataset[(processed_dataset['num_words_in_answer']<7)&(processed_dataset['answer'].map(lambda x: 'Frequently' in x))]

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels,processed_already


In [239]:
processed_dataset.to_csv(os.path.join(abs_path, 'data/processed', 'processed_dataset.csv'), index=False)

In [231]:
processed_dataset[['num_words_in_answer']].describe()

Unnamed: 0,num_words_in_answer
count,16333.0
mean,198.303925
std,246.29984
min,4.0
25%,70.0
50%,137.0
75%,246.0
max,4281.0


In [219]:
processed_dataset[['?' in x[:100] for x in dataset['answer'].str.lower()]]

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels,processed_already
127,What is (are) Medicare and Continuing Care?,- How long has the agency served the commun...,what is are medicare and continuing care,here are questions to ask when considering a h...,7,144,0.696429,medicare care continuing,238,2794,149,12,True
144,What is (are) Balance Problems?,These can be very troublesome sensations. If ...,what is are balance problems,have you ever felt dizzy lightheaded or as if ...,5,625,0.785714,problems balance,186,2437,230,14,True
209,How to diagnose Osteoporosis?,The United States Preventive Service Task For...,how to diagnose osteoporosis,who should be tested the united states prevent...,4,796,0.736842,diagnose osteoporosis,119,452,888,19,True
210,What are the treatments for Osteoporosis?,"Although there is no cure for osteoporosis, i...",what are the treatments for osteoporosis,who treats osteoporosis although there is no c...,6,705,0.716981,treatments osteoporosis,119,452,888,19,True
236,What is (are) Kidney Disease?,- What is my urine albumin result? - What...,what is are kidney disease,when you visit your doctor here are questions ...,5,61,0.755556,kidney disease,59,203,47,11,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8059,how can the spread of visa and vrsa be prevented?,How do VISA and VRSA get their names? What sh...,how can the spread of visa and vrsa be prevented,on this page general information about visavrs...,10,826,0.726190,spread visa vrsa prevented,392,3317,2108,-1,True
8060,what is cdc doing to address visa and vrsa?,How do VISA and VRSA get their names? What sh...,what is cdc doing to address visa and vrsa,on this page general information about visavrs...,9,826,0.753086,visa address vrsa cdc,392,2998,2108,-1,True
8061,how vaccines prevent disease,It is always better to prevent a disease than...,how vaccines prevent disease,why are childhood vaccines so important it is ...,4,641,0.714286,prevent disease vaccines,193,3110,1638,-1,True
15509,What is (are) Kidney Failure: Choosing a Treat...,\n \nThe purpose of kidney tran...,what is are kidney failure choosing a treatmen...,what should i know about kidney transplantatio...,12,527,0.686275,right choosing thats failure kidney treatment,262,1414,251,2168,True
