# Data Processing and Preparation  

The dataset does not have a context from where the answer should be extracted so I think of two approaches to solve this problem.  
Depending on the approach, there are different ways to preprocess the data.  

1. **Extractive question-answering**.  
   Extracts the answer to a question from a given context. Meaning, the answer to the question is in the context and we just extract it from it as it is.
   For this case, we need the question per se, the answer, and the context within the answer exists. We do not have the context, but we can process our dataset in order to create this context.
   I will assume that the answer provided to the question, is the correct one. 
3. **Generative question-answering with RAG**.  
   Uses a language generation model (e.g: pre-trained LLM or Foundational Model) to create an answer to a question given a context. The context is retrieved from a knowledge base and passed to the GenAI model to generate the answer.

## General preprocessing  

Identified from data exploration:
- Remove repeated question at the beginning of the answer

In [261]:
import pandas as pd
import numpy as np
import re
import os

In [3]:
abs_path = os.path.abspath('../../')
path_to_data = 'data/raw/'
filename = 'augmented_intern_screening_dataset.csv'

In [4]:
dataset = pd.read_csv(os.path.join(abs_path, path_to_data, filename))

In [5]:
dataset

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels
0,What is (are) Glaucoma?,Glaucoma is a group of diseases that can damag...,what is are glaucoma,glaucoma is a group of diseases that can damag...,4,111,0.722222,glaucoma,326,486,261,0
1,What is (are) Glaucoma?,The optic nerve is a bundle of more than 1 mil...,what is are glaucoma,the optic nerve is a bundle of more than 1 mil...,4,20,0.733333,glaucoma,326,486,261,0
2,What is (are) Glaucoma?,Open-angle glaucoma is the most common form of...,what is are glaucoma,openangle glaucoma is the most common form of ...,4,86,0.761905,glaucoma,326,486,261,0
3,Who is at risk for Glaucoma??,Anyone can develop glaucoma. Some people are a...,who is at risk for glaucoma,anyone can develop glaucoma some people are at...,6,78,0.773585,risk glaucoma,326,486,261,0
4,How to prevent Glaucoma?,"At this time, we do not know how to prevent gl...",how to prevent glaucoma,at this time we do not know how to prevent gla...,4,84,0.814815,prevent glaucoma,326,486,261,0
...,...,...,...,...,...,...,...,...,...,...,...,...
16342,What is (are) Diabetic Neuropathies: The Nerve...,Autonomic neuropathy affects the nerves that c...,what is are diabetic neuropathies the nerve da...,autonomic neuropathy affects the nerves that c...,10,492,0.733333,damage neuropathies nerve diabetes diabetic,183,1339,365,2229
16343,What is (are) Diabetic Neuropathies: The Nerve...,"Proximal neuropathy, sometimes called lumbosac...",what is are diabetic neuropathies the nerve da...,proximal neuropathy sometimes called lumbosacr...,10,91,0.704348,damage neuropathies nerve diabetes diabetic,183,1339,365,2229
16344,What is (are) Diabetic Neuropathies: The Nerve...,Focal neuropathy appears suddenly and affects ...,what is are diabetic neuropathies the nerve da...,focal neuropathy appears suddenly and affects ...,10,179,0.726190,damage neuropathies nerve diabetes diabetic,183,1339,365,2229
16345,How to prevent Diabetic Neuropathies: The Nerv...,The best way to prevent neuropathy is to keep ...,how to prevent diabetic neuropathies the nerve...,the best way to prevent neuropathy is to keep ...,10,30,0.681159,prevent damage neuropathies nerve diabetes dia...,183,1339,365,2229


### Remove repeated question at the beginning of the answer  
Repeated questions at the beginning of the answer can be idenfified by checking distances column.  
If question is repeated at the beginning of the answer, distance is approx < 0.15  
In the exploration step, I used a margin equal to '3'.  This is because sometimes the question in the answer was slightly different.  
For example *q: what are the signs of xxx, a: what are the signs or symptoms of xxx..*

In [56]:
## example
"""
question: What are the symptoms of Natal teeth, intestinal pseudoobstruction and patent ductus?
answer: What are the signs and symptoms of Natal teeth, intestinal pseudoobstruction and patent ductus? The Human Phenotype Ontology provides the following list of signs and symptoms for Natal teeth, intestin

question: What are the symptoms of Familial visceral myopathy with external ophthalmoplegia?
answer: What are the signs and symptoms of Familial visceral myopathy with external ophthalmoplegia? The Human Phenotype Ontology provides the following list of signs and symptoms for Familial visceral myopat

question: What are the symptoms of Hypermanganesemia with dystonia polycythemia and cirrhosis?
answer: What are the signs and symptoms of Hypermanganesemia with dystonia polycythemia and cirrhosis? The Human Phenotype Ontology provides the following list of signs and symptoms for Hypermanganesemia with

question: What are the symptoms of Metachromatic leukodystrophy due to saposin B deficiency?
answer: What are the signs and symptoms of Metachromatic leukodystrophy due to saposin B deficiency? The Human Phenotype Ontology provides the following list of signs and symptoms for Metachromatic leukodystr

question: What are the symptoms of Diabetes insipidus nephrogenic mental retardation and intracerebral calcification?
answer: What are the signs and symptoms of Diabetes insipidus nephrogenic mental retardation and intracerebral calcification? The Human Phenotype Ontology provides the following list of signs and symptoms for
"""
print(f"question: {dataset[dataset['distances']<0.15]['question'].values[0]}")
print(f"answer: {dataset[dataset['distances']<0.15]['answer'].values[0][:200]}")
print()
print(f"question: {dataset[dataset['distances']<0.15]['question'].values[5]}")
print(f"answer: {dataset[dataset['distances']<0.15]['answer'].values[5][:200]}")
print()
print(f"question: {dataset[dataset['distances']<0.15]['question'].values[25]}")
print(f"answer: {dataset[dataset['distances']<0.15]['answer'].values[25][:200]}")
print()
print(f"question: {dataset[dataset['distances']<0.15]['question'].values[100]}")
print(f"answer: {dataset[dataset['distances']<0.15]['answer'].values[100][:200]}")
print()
print(f"question: {dataset[dataset['distances']<0.15]['question'].values[120]}")
print(f"answer: {dataset[dataset['distances']<0.15]['answer'].values[120][:200]}")
print()

question: What are the symptoms of Natal teeth, intestinal pseudoobstruction and patent ductus?
answer: What are the signs and symptoms of Natal teeth, intestinal pseudoobstruction and patent ductus? The Human Phenotype Ontology provides the following list of signs and symptoms for Natal teeth, intestin

question: What are the symptoms of Familial visceral myopathy with external ophthalmoplegia?
answer: What are the signs and symptoms of Familial visceral myopathy with external ophthalmoplegia? The Human Phenotype Ontology provides the following list of signs and symptoms for Familial visceral myopat

question: What are the symptoms of Hypermanganesemia with dystonia polycythemia and cirrhosis?
answer: What are the signs and symptoms of Hypermanganesemia with dystonia polycythemia and cirrhosis? The Human Phenotype Ontology provides the following list of signs and symptoms for Hypermanganesemia with

question: What are the symptoms of Metachromatic leukodystrophy due to saposin B defic

In [16]:
## remove where question is identical to the answer
threshold = 0.05
index_to_remove = dataset[dataset['distances']<threshold]
dataset[dataset['distances']<threshold]

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels
5774,Is Septo-optic dysplasia inherited?,Is septo-optic dysplasia inherited?,is septooptic dysplasia inherited,is septooptic dysplasia inherited,4,4,0.0,dysplasia septooptic inherited,286,348,66,1074


In [17]:
dataset = dataset.drop(index=[index_to_remove]).reset_index(drop=True)

In [44]:
## remove where question is very similar to the answer
threshold = 0.30
margin = 3
condition = (dataset['distances']<threshold)&(abs(dataset['num_words_in_answer']-dataset['num_words_in_question']).map(lambda x: x<=margin))
index_to_remove = dataset[condition].index
dataset[condition]

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels
3020,Is Dentatorubral-pallidoluysian atrophy inheri...,How is dentatorubral-pallidoluysian atrophy (D...,is dentatorubralpallidoluysian atrophy inherited,how is dentatorubralpallidoluysian atrophy drp...,4,6,0.172414,atrophy inherited dentatorubralpallidoluysian,362,2386,608,469
3078,"Is 48,XXYY syndrome inherited?","Can 48,XXYY syndrome be inherited?",is 48xxyy syndrome inherited,can 48xxyy syndrome be inherited,4,5,0.1875,48xxyy inherited syndrome,486,962,825,485
5851,Is Pelizaeus-Merzbacher disease inherited?,How is Pelizaeus-Merzbacher disease inherited?,is pelizaeusmerzbacher disease inherited,how is pelizaeusmerzbacher disease inherited,4,5,0.090909,inherited disease pelizaeusmerzbacher,232,387,190,1088
7221,Is Oculopharyngeal muscular dystrophy inherited?,How is oculopharyngeal muscular dystrophy inhe...,is oculopharyngeal muscular dystrophy inherited,how is oculopharyngeal muscular dystrophy inhe...,5,6,0.078431,dystrophy oculopharyngeal muscular inherited,323,291,1160,1361


In [None]:
dataset = dataset.drop(index=[index_to_remove]).reset_index(drop=True)

In [61]:
## remove where question is very similar to the answer
threshold = 0.15
condition = (dataset['distances']<threshold)
index_to_modify = dataset[condition].index
dataset[condition]

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels
2477,"What are the symptoms of Natal teeth, intestin...",Human Phenotype Ontology provides the followin...,what are the symptoms of natal teeth intestina...,what are the signs and symptoms of natal teeth...,12,262,0.144330,symptoms pseudoobstruction teeth ductus patent...,225,3184,2160,-1
2482,What are the symptoms of Ventricular extrasyst...,What are the signs and symptoms of Ventricular...,what are the symptoms of ventricular extrasyst...,what are the signs and symptoms of ventricular...,13,314,0.120690,symptoms sequence episodes syncopal perodactyl...,392,2224,1396,-1
2516,"What are the symptoms of Symphalangism, distal...",What are the signs and symptoms of Symphalangi...,what are the symptoms of symphalangism distal ...,what are the signs and symptoms of symphalangi...,16,294,0.113821,zygomatic symptoms narrowed pulp symphalangism...,392,2014,2672,-1
2616,"What are the symptoms of Mental retardation, m...",What are the signs and symptoms of Mental reta...,what are the symptoms of mental retardation ma...,what are the signs and symptoms of mental reta...,13,272,0.123894,symptoms craniofacial mental macrocephaly stat...,125,337,2832,-1
2683,What are the symptoms of Mastocytosis cutaneou...,What are the signs and symptoms of Mastocytosi...,what are the symptoms of mastocytosis cutaneou...,what are the signs and symptoms of mastocytosi...,15,347,0.119658,symptoms cutaneous loss mastocytosis conductiv...,400,3253,548,-1
...,...,...,...,...,...,...,...,...,...,...,...,...
7534,What are the symptoms of Microcephalic osteody...,What are the signs and symptoms of Microcephal...,what are the symptoms of microcephalic osteody...,what are the signs and symptoms of microcephal...,11,492,0.147368,symptoms 2 dwarfism osteodysplastic type micro...,389,1279,2357,827
7657,What are the symptoms of Bifid nose with or wi...,What are the signs and symptoms of Bifid nose ...,what are the symptoms of bifid nose with or wi...,what are the signs and symptoms of bifid nose ...,14,255,0.147368,renal symptoms bifid anorectal anomalies witho...,404,1670,2534,-1
7716,What are the symptoms of Familial encephalopat...,What are the signs and symptoms of Familial en...,what are the symptoms of familial encephalopat...,what are the signs and symptoms of familial en...,11,258,0.145833,symptoms familial neuroserpin encephalopathy b...,353,1658,835,1448
7791,"What are the symptoms of Arthrogryposis, dista...",What are the signs and symptoms of Arthrogrypo...,what are the symptoms of arthrogryposis distal...,what are the signs and symptoms of arthrogrypo...,14,259,0.111111,arthrogryposis symptoms facial distal hypopitu...,416,2149,2217,-1


In [68]:
# add a column to the dataset to mark what we have already processed
# dataset['processed_already'] = False

In [70]:
for i in index_to_modify:
    dataset.at[i,'processed_already'] = True

In [77]:
## Commented this because we run this only once, if we run it twice we will be truncating some answers
"""
margin = 3
for i in index_to_modify[1:]:
    print(i)
    question = dataset.iloc[i]['question']
    answer = dataset.iloc[i]['answer']
    num_words_question = len(question.split(' '))
    print(question)
    print(answer)
    print(num_words_question)
    new_answer = ' '.join(answer.split(' ')[num_words_question+margin:])
    print(new_answer)
    dataset.at[i, 'answer'] = new_answer
"""

def modify_answer_in_dataset(modification_indexes, margin=2):
    for i in modification_indexes:
        #print(i)
        question = dataset.iloc[i]['question']
        answer = dataset.iloc[i]['answer']
        num_words_question = len(question.split(' '))
        #print(question)
        #print(answer)
        #print(num_words_question)
        new_answer = ' '.join(answer.split(' ')[num_words_question+margin:])
        #print(new_answer)
        dataset.at[i, 'answer'] = new_answer
        dataset.at[i, 'processed_already'] = True

def remove_question_from_answer()

In [139]:
index_to_modify

Index([7720, 7722, 7750, 7765, 7779, 7802, 7833, 7841, 7847], dtype='int64')

In [140]:
modify_answer_in_dataset(index_to_modify)

In [186]:
## remove where question is very similar to the answer
threshold = 0.2
condition = (dataset['distances']>=threshold)&(dataset['distances']<threshold+0.1)&(dataset['processed_already']==False)
index_to_modify = dataset[condition].index
dataset[condition]

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels,processed_already
847,What causes Childhood Extracranial Germ Cell T...,The cause of most childhood extracranial germ ...,what causes childhood extracranial germ cell t...,the cause of most childhood extracranial germ ...,7,11,0.216667,cell extracranial tumors causes germ childhood,47,204,34,70,False
1024,What is (are) Transitional Cell Cancer of the ...,Key Points\n - Transitional...,what is are transitional cell cancer of the re...,key points transitional cell cancer of the re...,12,303,0.300000,renal cell pelvis ureter cancer transitional,252,1389,928,94,False
1169,What is (are) Osteosarcoma and Malignant Fibro...,Key Points\n - Osteosarcoma...,what is are osteosarcoma and malignant fibrous...,key points osteosarcoma and malignant fibrous...,10,226,0.291139,bone histiocytoma fibrous malignant osteosarcoma,261,3348,771,110,False
1203,What is (are) Adult Central Nervous System Tum...,Key Points\n - An adult cen...,what is are adult central nervous system tumors,key points an adult central nervous system tu...,8,2244,0.277778,central adult tumors nervous system,234,118,1418,112,False
1467,What causes Childhood Brain and Spinal Cord Tu...,The cause of most childhood brain and spinal c...,what causes childhood brain and spinal cord tu...,the cause of most childhood brain and spinal c...,8,12,0.220339,spinal tumors causes brain cord childhood,165,1197,468,134,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13193,Is PDGFRA-associated chronic eosinophilic leuk...,PDGFRA-associated chronic eosinophilic leukemi...,is pdgfraassociated chronic eosinophilic leuke...,pdgfraassociated chronic eosinophilic leukemia...,6,80,0.283784,eosinophilic inherited leukemia pdgfraassociat...,468,681,562,78,False
15073,Is methylmalonic acidemia with homocystinuria ...,Methylmalonic acidemia with homocystinuria is ...,is methylmalonic acidemia with homocystinuria ...,methylmalonic acidemia with homocystinuria is ...,6,175,0.289855,inherited acidemia methylmalonic homocystinuria,187,842,798,365,False
15231,What causes Ulcerative Colitis?,The exact cause of ulcerative colitis is unkno...,what causes ulcerative colitis,the exact cause of ulcerative colitis is unkno...,4,247,0.300000,colitis ulcerative causes,90,1261,76,145,False
15450,What causes Short Bowel Syndrome?,The main cause of short bowel syndrome is surg...,what causes short bowel syndrome,the main cause of short bowel syndrome is surg...,5,275,0.292683,bowel syndrome causes short,61,1541,1193,1050,False


In [157]:
dataset.at[2494, 'processed_already'] = True

In [158]:
# remove exact question that appears in the answer
for i in dataset[[x in y for x,y in zip(dataset['question'].values,dataset['answer'].values)]].index:
    question = dataset.iloc[i]['question']
    dataset.at[i, 'answer'] = dataset.iloc[i]['answer'].replace(question,'')
    dataset.at[i, 'processed_already'] = True

In [181]:
# remove question that appears in the answer, even when is not an exact match
for i in dataset[[x in y for x,y in zip(dataset['question'].str.lower(),dataset['answer'].str.lower())]].index:
    question = dataset.iloc[i]['question'].lower()
    answer = dataset.iloc[i]['answer'].lower()

    start_index = answer.index(question)
    end_index = start_index + len(question)
    
    dataset.at[i, 'answer'] = dataset.iloc[i]['answer'][:start_index] + dataset.iloc[i]['answer'][end_index:]
    dataset.at[i, 'processed_already'] = True

In [209]:
processed_dataset = dataset.copy()

In [196]:
re.findall?

[0;31mSignature:[0m [0mre[0m[0;34m.[0m[0mfindall[0m[0;34m([0m[0mpattern[0m[0;34m,[0m [0mstring[0m[0;34m,[0m [0mflags[0m[0;34m=[0m[0;36m0[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a list of all non-overlapping matches in the string.

If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.

Empty matches are included in the result.
[0;31mFile:[0m      /mnt/d/projects/medical_assistant_bot_assignment/venv/lib/python3.10/re.py
[0;31mType:[0m      function

In [217]:
matches = re.findall(r'(.*\?)',processed_dataset.iloc[7830]['answer'][:100])
matches

0

In [218]:
for i in processed_dataset[['?' in x[:100] for x in dataset['answer'].str.lower()]].index:
    matches = re.findall(r'(.*\?)',processed_dataset.iloc[i]['answer'][:100])
    if len(matches)>0:
        processed_dataset.at[i, 'answer'] = processed_dataset.iloc[i]['answer'].replace(matches[0], '')
        processed_dataset.at[i, 'processed_already'] = True

In [221]:
processed_dataset.iloc[127]['answer']

'   - How long has the agency served the community?   - Does this agency provide the services my relative or friend needs?   - How are emergencies handled?   - Is the staff on duty around the clock?   - How much do services and supplies cost?   - Will agency staff be in regular contact with the doctor? Is the agency Medicare-approved? How long has the agency served the community? Does this agency provide the services my relative or friend needs? How are emergencies handled? Is the staff on duty around the clock? How much do services and supplies cost? Will agency staff be in regular contact with the doctor? You can use Medicare\'s "Home Health Compare" tool to compare home health agencies in your area. Visit http://www.medicare.gov. Under "Search Tools," select "Compare Home Health Agencies in Your Area."'

### Let's stop here since we have lot of preprocessing to do but we need to continue

In [222]:
# let's recount num_words in answer since we had some modifications
processed_dataset['num_words_in_answer'] = processed_dataset['answer'].map(lambda answer: len(answer.lower().split()))

In [223]:
processed_dataset[['num_words_in_answer']].describe()

Unnamed: 0,num_words_in_answer
count,16342.0
mean,198.194774
std,246.275907
min,0.0
25%,70.0
50%,137.0
75%,246.0
max,4281.0


#### It seems there are answers that have 0 words, let's check them

In [225]:
# let's remove them
processed_dataset[processed_dataset['num_words_in_answer']==0]
processed_dataset = processed_dataset.drop(index=processed_dataset[processed_dataset['num_words_in_answer']==0].index).reset_index(drop=True)

In [229]:
# let's check those that have 1 word
processed_dataset[processed_dataset['num_words_in_answer']==1]
# there's only one, remove it
processed_dataset = processed_dataset.drop(index=processed_dataset[processed_dataset['num_words_in_answer']==1].index).reset_index(drop=True)

In [237]:
# let's remove those that have less than 7 words and start with Frequently asked questions phrase
processed_dataset = processed_dataset.drop(index=processed_dataset[(processed_dataset['num_words_in_answer']<7)&(processed_dataset['answer'].map(lambda x: 'Frequently' in x))].index).reset_index(drop=True)

In [241]:
processed_dataset[['num_words_in_answer']].describe()

Unnamed: 0,num_words_in_answer
count,16329.0
mean,198.351277
std,246.311421
min,6.0
25%,70.0
50%,137.0
75%,246.0
max,4281.0


#### From this description it seems that the average # of words in answers are around 200 words but we see there are answers with more than 4000 words, those seem odd, let's remove them.  
Usually I will take a closer look at those answers, check if the answer the the questions are really there and clean it, but since I do not have much time, I will quickly check them and if there are too noisy, I will remove them. 

In [285]:
## There's only one example with more than 4000 words in the dataset
## There are a few hundred that have more than 1000
## There seem to be very long documents containing lot of information about the questions, but they are too long for my question-answering model, remove them from now, next
## iteration we would have more time to work with them.

## Since the average of words in answers are no more than 200, I will prune the dataset, leaving only those answers with less or equal than 400 words
processed_dataset[processed_dataset['num_words_in_answer']>400]
processed_dataset = processed_dataset.drop(index = processed_dataset[processed_dataset['num_words_in_answer']>400].index).reset_index(drop=True)

In [286]:
processed_dataset[['num_words_in_answer']].describe()

Unnamed: 0,num_words_in_answer
count,14919.0
mean,145.689926
std,95.068216
min,6.0
25%,66.0
50%,124.0
75%,220.0
max,400.0


#### Are there duplicated answers after this prunning?

In [329]:
# yup, there are, let's quickly review them
# I checked some, remove them. No have time to analyze them one by one
processed_dataset[processed_dataset.duplicated(subset=['answer'], keep=False)]

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels,processed_already
749,What causes Childhood Brain Stem Glioma?,The cause of most childhood brain tumors is un...,what causes childhood brain stem glioma,the cause of most childhood brain tumors is un...,6,9,0.470588,glioma stem causes brain childhood,165,158,128,93,False
910,What causes Childhood Ependymoma?,The cause of most childhood brain tumors is un...,what causes childhood ependymoma,the cause of most childhood brain tumors is un...,4,9,0.475000,childhood ependymoma causes,623,829,510,123,False
2213,What is (are) Axenfeld-Rieger syndrome type 1?,Axenfeld-Rieger syndrome is a group of eye dis...,what is are axenfeldrieger syndrome type 1,axenfeldrieger syndrome is a group of eye diso...,7,239,0.765625,type axenfeldrieger 1 syndrome,581,2906,43,391,False
2218,What is (are) Noonan syndrome 3?,Noonan syndrome is a genetic disorder that cau...,what is are noonan syndrome 3,noonan syndrome is a genetic disorder that cau...,6,111,0.775862,noonan syndrome 3,222,2108,2487,392,False
2220,What are the treatments for Noonan syndrome 3?,Management generally focuses on the specific ...,what are the treatments for noonan syndrome 3,how might noonan syndrome be treated managemen...,8,84,0.736111,treatments noonan syndrome 3,222,2108,2487,392,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
14497,What is (are) Hematuria (Blood in the Urine)?,The urinary tract is the bodys drainage system...,what is are hematuria blood in the urine,the urinary tract is the bodys drainage system...,8,131,0.727273,hematuria urine blood,380,1716,582,2190,False
14551,What to do for Primary Biliary Cirrhosis?,A healthy diet is important in all stages of c...,what to do for primary biliary cirrhosis,a healthy diet is important in all stages of c...,7,194,0.611111,biliary cirrhosis primary,150,1544,1247,2197,False
14600,What is (are) Nutrition for Early Chronic Kidn...,CKD usually takes a long time to develop and d...,what is are nutrition for early chronic kidney...,ckd usually takes a long time to develop and d...,11,112,0.828125,nutrition early adults kidney chronic disease,382,620,2466,2202,False
14658,What is (are) Urinary Tract Infections in Chil...,A UTI is an infection in the urinary tract. In...,what is are urinary tract infections in children,a uti is an infection in the urinary tract inf...,8,94,0.526316,tract infections children urinary,102,1572,81,3,False


### Let's remove duplicated questions

In [343]:
processed_dataset = processed_dataset.drop(index=processed_dataset[processed_dataset.duplicated(subset=['question'], keep='first')].index).reset_index(drop=True)

In [344]:
processed_dataset.shape

(13827, 13)

In [345]:
processed_dataset.describe()

Unnamed: 0,num_words_in_question,num_words_in_answer,distances,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels
count,13827.0,13827.0,13827.0,13827.0,13827.0,13827.0,13827.0
mean,7.275548,144.791133,0.596665,303.323498,1160.867506,1214.002387,899.702972
std,2.337225,95.374848,0.206474,185.340612,887.152393,811.931634,706.33782
min,3.0,6.0,0.090909,0.0,0.0,0.0,-1.0
25%,5.0,65.0,0.457627,128.0,424.0,511.0,257.0
50%,7.0,123.0,0.688172,305.0,962.0,1100.0,810.0
75%,9.0,220.0,0.759259,466.0,1737.5,1848.0,1502.0
max,26.0,400.0,0.933333,649.0,3367.0,2999.0,2229.0


### Now that we have prunned our dataset, let's see if we can identify categories based on the labels assigned by each of the clustering algorithms we use in the exploratory analysis

In [487]:
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from collections import Counter

In [400]:
ignored_words = ['syndrome','symptoms','treatments','inherited','people','affected','many','disease','changes','type','diagnose','causes','research','done','trials','clinical','information','1','2','risk','need','know','3','factor','problems','diseases','ii','infections','also','known','4','v','c','iii','keep','a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z','0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [349]:
agglomerative_labels = np.unique(processed_dataset['agglomerative_labels'])
kmean_labels = np.unique(processed_dataset['kmean_labels_n_3000'])
dbscan_labels = np.unique(processed_dataset['dbscan_labels'])

In [464]:
processed_dataset.columns.tolist().index('ac_labels')

13

In [475]:
def define_labels(labels, column, name_for_column_label):
    if name_for_column_label not in processed_dataset.columns:
        processed_dataset[name_for_column_label] = ''

    for l in labels:
        indexes_to_modify = processed_dataset[processed_dataset[column]==l].index
        kwords = word_tokenize(' '.join(processed_dataset[processed_dataset[column]==l]['kwords'].values))
        kwords = [kw for kw in kwords if kw not in ignored_words]
        frequencies = FreqDist(kwords)
        extracted_label = '_'.join([x for x,y in frequencies.most_common(10)[:2]])
        if extracted_label == '':
            print(f'no label for {l} in {column}')
            extracted_label = 'no_label'
        processed_dataset.loc[indexes_to_modify, name_for_column_label] = extracted_label

In [476]:
define_labels(agglomerative_labels, 'agglomerative_labels', 'ac_labels')

no label for 162 in agglomerative_labels
no label for 2830 in agglomerative_labels
no label for 2946 in agglomerative_labels
no label for 3004 in agglomerative_labels
no label for 3252 in agglomerative_labels


In [477]:
define_labels(kmean_labels, 'kmean_labels_n_3000', 'km_labels')

no label for 8 in kmean_labels_n_3000
no label for 15 in kmean_labels_n_3000
no label for 1520 in kmean_labels_n_3000


In [478]:
define_labels(dbscan_labels, 'dbscan_labels', 'db_labels')

no label for 218 in dbscan_labels
no label for 798 in dbscan_labels
no label for 801 in dbscan_labels


In [497]:
for row in processed_dataset.iterrows():
    
    break

glaucoma_prevent


In [498]:
## 
def vote_for_labels(dataset):
    dataset['voted_label'] = ''
    for row in dataset.iterrows():
        index = row[0]
        labels = row[1][['ac_labels', 'km_labels', 'db_labels']]
        dataset.loc[index, 'voted_label'] = Counter(labels.values).popitem()[0]

In [499]:
vote_for_labels(processed_dataset)

In [501]:
processed_dataset.iloc[5]['question']

'what research (or clinical trials) is being done for Glaucoma?'

In [502]:
processed_dataset.head(5)

Unnamed: 0,question,answer,processed_question,processed_answer,num_words_in_question,num_words_in_answer,distances,kwords,kmean_labels,agglomerative_labels,kmean_labels_n_3000,dbscan_labels,processed_already,ac_labels,km_labels,db_labels,voted_label
0,What is (are) Glaucoma?,Glaucoma is a group of diseases that can damag...,what is are glaucoma,glaucoma is a group of diseases that can damag...,4,111,0.722222,glaucoma,326,486,261,0,False,glaucoma_earlyonset,glaucoma_prevent,glaucoma_prevent,glaucoma_prevent
1,Who is at risk for Glaucoma??,Anyone can develop glaucoma. Some people are a...,who is at risk for glaucoma,anyone can develop glaucoma some people are at...,6,81,0.773585,risk glaucoma,326,486,261,0,False,glaucoma_earlyonset,glaucoma_prevent,glaucoma_prevent,glaucoma_prevent
2,How to prevent Glaucoma?,"At this time, we do not know how to prevent gl...",how to prevent glaucoma,at this time we do not know how to prevent gla...,4,84,0.814815,prevent glaucoma,326,486,261,0,False,glaucoma_earlyonset,glaucoma_prevent,glaucoma_prevent,glaucoma_prevent
3,What are the symptoms of Glaucoma?,"At first, open-angle glaucoma has no symptoms....",what are the symptoms of glaucoma,at first openangle glaucoma has no symptoms it...,6,45,0.679245,symptoms glaucoma,326,486,261,0,False,glaucoma_earlyonset,glaucoma_prevent,glaucoma_prevent,glaucoma_prevent
4,What are the treatments for Glaucoma?,"Yes. Immediate treatment for early stage, open...",what are the treatments for glaucoma,yes immediate treatment for early stage openan...,6,52,0.612903,treatments glaucoma,326,486,261,0,False,glaucoma_earlyonset,glaucoma_prevent,glaucoma_prevent,glaucoma_prevent


In [507]:
processed_dataset = processed_dataset.reset_index(drop=True)

In [508]:
processed_dataset[['question','answer','voted_label']]

Unnamed: 0,question,answer,voted_label
0,What is (are) Glaucoma?,Glaucoma is a group of diseases that can damag...,glaucoma_prevent
1,Who is at risk for Glaucoma??,Anyone can develop glaucoma. Some people are a...,glaucoma_prevent
2,How to prevent Glaucoma?,"At this time, we do not know how to prevent gl...",glaucoma_prevent
3,What are the symptoms of Glaucoma?,"At first, open-angle glaucoma has no symptoms....",glaucoma_prevent
4,What are the treatments for Glaucoma?,"Yes. Immediate treatment for early stage, open...",glaucoma_prevent
...,...,...,...
13822,What is (are) Diabetic Neuropathies: The Nerve...,Diabetic neuropathies are a family of nerve di...,nerve_diabetic
13823,What causes Diabetic Neuropathies: The Nerve D...,The causes are probably different for differen...,nerve_diabetic
13824,What are the symptoms of Diabetic Neuropathies...,Symptoms depend on the type of neuropathy and ...,nerve_diabetic
13825,How to prevent Diabetic Neuropathies: The Nerv...,The best way to prevent neuropathy is to keep ...,nerve_diabetic


In [509]:
#processed_dataset[['question','answer','voted_label']].to_csv(os.path.join(abs_path, 'data/processed', 'question_answer_with_labels.csv'), index=False)

In [510]:
# processed_dataset.to_csv(os.path.join(abs_path, 'data/processed', 'processed_dataset.csv'), index=False)