## Data Processing

This notebook will process the data into train and test sets so that they can be further used

In [21]:
from datasets import load_dataset
import json
from tqdm import tqdm

### Get the Medical Healthcare dataset

In [3]:
dataset_name = "Kabatubare/medical"
dataset = load_dataset(dataset_name, split="all")
dataset

Dataset({
    features: ['Context', 'Question', 'Answer'],
    num_rows: 23437
})

Shuffling the dataset and dividing into train test spli

In [5]:
dataset = dataset.shuffle(seed=42)
dataset = dataset.train_test_split(test_size=0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['Context', 'Question', 'Answer'],
        num_rows: 21093
    })
    test: Dataset({
        features: ['Context', 'Question', 'Answer'],
        num_rows: 2344
    })
})

### Putting both the training and test data into jsonl files

Converting to Pandas Dataframe

In [18]:
dataset_train = dataset['train'].to_pandas()
dataset_train = dataset_train.reset_index().rename(columns={'index':'sample_id'}) #extracting the sample_id
dataset_train

Unnamed: 0,sample_id,Context,Question,Answer
0,0,You are a medical knowledge assistant trained ...,i want to know what i can eat while i have a s...,it used to be recommended that people with ulc...
1,1,You are a medical knowledge assistant trained ...,are immigrants putting the u. s. at increased ...,people applying to enter the u. s. with immigr...
2,2,You are a medical knowledge assistant trained ...,my 6 year old keeps triggering the sensation t...,this is completely normal. he is a normal sens...
3,3,You are a medical knowledge assistant trained ...,my daughter 'zones out' for 5-20 seconds often...,sounds like my history--epilepsy at age 14 com...
4,4,You are a medical knowledge assistant trained ...,is surgery the only option for a smaller fibro...,hi as its started top grow this is normal to h...
...,...,...,...,...
21088,21088,You are a medical knowledge assistant trained ...,how can i manage nausea or vomiting during pre...,an upset stomach is one of the most common dis...
21089,21089,You are a medical knowledge assistant trained ...,2 days after my shingle vaccine i have a red p...,it is not uncommon to get local inflammation f...
21090,21090,You are a medical knowledge assistant trained ...,sharp back neck and arm pain on the left side....,you should see your doctor. sharp pain like th...
21091,21091,You are a medical knowledge assistant trained ...,what are the symptoms of hypothyroidism?,features of hypothyroidism can go unsuspected ...


In [19]:
dataset_test = dataset['test'].to_pandas()
dataset_test = dataset_test.reset_index().rename(columns={'index':'sample_id'}) #extracting the sample_id
dataset_test

Unnamed: 0,sample_id,Context,Question,Answer
0,0,You are a medical knowledge assistant trained ...,i'm always nervous and my heart beats too fast...,you are describing classic signs of anxiety. i...
1,1,You are a medical knowledge assistant trained ...,i’m worried about my husband not taking care o...,men need skin care regimens too. they should c...
2,2,You are a medical knowledge assistant trained ...,are there lifestyle and dietary factors that c...,weight is a really important factor to the pos...
3,3,You are a medical knowledge assistant trained ...,what is latent and silent celiac disease?,the terms latent and silent celiac disease are...
4,4,You are a medical knowledge assistant trained ...,how will doctors find out if a woman and her p...,doctors will do an infertility checkup. this i...
...,...,...,...,...
2339,2339,You are a medical knowledge assistant trained ...,is there an equivalent over the counter medica...,thanks for your question. i understand your si...
2340,2340,You are a medical knowledge assistant trained ...,how do i balance my diet for diabetes high blo...,stage 3 kidney disease is not good. more than ...
2341,2341,You are a medical knowledge assistant trained ...,hi i am 23 is it normal to have heart rate at ...,hi here's a link from webmd it should help you...
2342,2342,You are a medical knowledge assistant trained ...,i performed oral on a man who said as far as h...,any form of unprotected sex has it's risk if t...


Saving the json elements in a list

In [22]:
train_dataset_json_list = []

for index, row in dataset_train.iterrows():
    json_element = {'sample_id': row['sample_id'],
                    'context': row['Context'],
                    'question': row['Question'],
                    'answer': row['Answer']}
    
    train_dataset_json_list.append(json_element)

In [23]:
test_dataset_json_list = []

for index, row in dataset_test.iterrows():
    json_element = {'sample_id': row['sample_id'],
                    'context': row['Context'],
                    'question': row['Question'],
                    'answer': row['Answer']}
    
    test_dataset_json_list.append(json_element)

Saving the samples line by line in a jsonl file

In [24]:
file_path = "data_kabatubare/train_kabatubare.jsonl"

with open(file_path, 'w') as file:
    for element in train_dataset_json_list:
        json_line = json.dumps(element)
        file.write(json_line + '\n')

In [25]:
file_path = "data_kabatubare/test_kabatubare.jsonl"

with open(file_path, 'w') as file:
    for element in test_dataset_json_list:
        json_line = json.dumps(element)
        file.write(json_line + '\n')