# The Data
Minor, or common, ailments refers to conditions that can initially be managed with non-prescription therapy, such as allergies or skin irritations. All pharmacists can assess minor aliments symptoms and recommend self-care or over-the-counter treatments. Pharmacists in many provinces can also prescribe medications for certain minor ailments. Our resources will help pharmacists assess, recommend and (where authorized) prescribe appropriate therapy for a number of minor ailments.



Anda akan membuat sistem Question-Answering (QA) yang akan mengatasi pertanyaan terkait dengan "minor ailments" atau penyakit ringan atau umum. Sistem QA ini akan dirancang untuk memberikan jawaban berdasarkan teks yang diberikan, yang merinci kondisi yang dapat dikelola dengan terapi tanpa resep, peran farmasis dalam menilai gejala penyakit ringan, merekomendasikan perawatan mandiri atau obat-obatan bebas, dan kemampuan farmasis di beberapa provinsi untuk meresepkan obat untuk penyakit ringan tertentu.

Sistem QA Anda akan menerima pertanyaan terkait dengan topik tersebut dan mencoba memberikan jawaban yang sesuai berdasarkan informasi dalam teks. Misalnya, jika pertanyaan adalah "Apa itu minor ailments?", sistem akan mencari jawaban yang sesuai dalam teks yang menggambarkan definisi dan cakupan penyakit ringan. Jika pertanyaan adalah "Siapa yang bisa meresepkan obat untuk minor ailments?", sistem akan mencari dan menyajikan informasi tentang peran farmasis dalam meresepkan obat untuk penyakit ringan.

Anda akan menggunakan teknik pemrosesan bahasa alami (NLP) dan pemodelan bahasa untuk mengembangkan sistem ini. Langkah-langkah yang mungkin diperlukan dalam pengembangan sistem QA ini meliputi:
- Pemrosesan teks sumber (teks yang diberikan) untuk memahami strukturnya.
- Membangun basis data pengetahuan yang berisi informasi tentang penyakit ringan, peran farmasis, dan hukum terkait di berbagai provinsi.
- Mempersiapkan data latih dan uji untuk melatih dan menguji model NLP.
- Mengembangkan model NLP yang dapat memahami pertanyaan dan mencocokkannya dengan informasi dalam basis data pengetahuan.
- Membangun antarmuka yang memungkinkan pengguna untuk mengajukan pertanyaan dan menerima jawaban.

Selama pengembangan sistem QA ini, Anda juga perlu memastikan keakuratan dan keandalan jawaban yang dihasilkan oleh sistem. Anda dapat menguji sistem dengan berbagai pertanyaan untuk memastikan bahwa itu memberikan jawaban yang benar sesuai dengan konteks teks sumber.

Sistem QA ini bisa menjadi alat yang sangat berguna bagi para farmasis atau individu yang ingin memahami lebih lanjut tentang penyakit ringan dan perawatan yang sesuai.

#### Setup Env & Download Data

In [1]:
!pip install transformers accelerate evaluate datasets rouge_score -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m52.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m71.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m86.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.4 M

In [2]:
!pip install -q wget

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for wget (setup.py) ... [?25l[?25hdone


## Import Library

In [3]:
import re
import json
import wget
import torch
import numpy as np
import pandas as pd
import evaluate
from datasets import load_dataset, Dataset
from transformers import T5ForConditionalGeneration, T5TokenizerFast
from transformers import DataCollatorForSeq2Seq

from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

In [4]:
driveLink = "https://drive.google.com/uc?export=download&id=1pX56Zk_rzTXSQWmIy08T6yJYg_fGsG6m"

MODEL_CHECKPOINT = "t5-base"
MAX_INPUT_LENGTH = 512
MAX_TARGET_LENGTH = 512
BATCH_SIZE = 4
LEARNING_RATE = 3e-4
MAX_EPOCHS = 10

START_PREFIX = "question: "
END_PREFIX = " </s>"
MODEL_REPO = "commonaliment"

In [5]:
# Download data
file_name = wget.download(driveLink)

In [6]:
data = pd.read_csv(file_name, delimiter=';')
data = data.dropna()
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 748 entries, 0 to 794
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Code                      748 non-null    int64 
 1   name                      748 non-null    object
 2   symptoms                  748 non-null    object
 3   desc                      748 non-null    object
 4   commonTestsAndProcedures  748 non-null    object
 5   medications1              748 non-null    object
 6   medications2              748 non-null    object
 7   whoIsAtRiskDesc           748 non-null    object
 8   Sympoms desc              748 non-null    object
dtypes: int64(1), object(8)
memory usage: 58.4+ KB


In [7]:
data.head(2)

Unnamed: 0,Code,name,symptoms,desc,commonTestsAndProcedures,medications1,medications2,whoIsAtRiskDesc,Sympoms desc
0,105,Panic disorder,"[{""symptoms"":""Anxiety and nervousness""},{""symp...","Panic disorder,Panic disorder is an anxiety di...","[{""commonTestsAndProcedures"":""Psychotherapy""},...",The most commonly prescribed drugs for patient...,"[{""commonMedications"":""Lorazepam""},{""commonMed...",Groups of people at highest risk for panic dis...,The symptoms that are highly suggestive of pa...
1,106,Vocal cord polyp,"[{""symptoms"":""Hoarse voice""},{""symptoms"":""91""}...","Vocal cord polyp ,Vocal cord polyp is encounte...","[{""commonTestsAndProcedures"":""Tracheoscopy and...",The most commonly prescribed drugs for patient...,"[{""commonMedications"":""Esomeprazole (Nexium)""}...",Groups of people at highest risk for vocal cor...,The symptoms that are highly suggestive of vo...


In [8]:
for i in range(5):
    print(f"Data idx {i} Sympoms desc :")
    sys_desc = data.iloc[i]["Sympoms desc"]
    print(sys_desc, "\n")

Data idx 0 Sympoms desc :
 The symptoms that are highly suggestive of panic disorder are anxiety and nervousness and breathing fast, although you may still have panic disorder without those symptoms.                                          

Data idx 1 Sympoms desc :
 The symptoms that are highly suggestive of vocal cord polyp are hoarse voice, difficulty speaking, throat swelling, and lump in throat, although you may still have vocal cord polyp without those symptoms.                

Data idx 2 Sympoms desc :
 The symptoms that are highly suggestive of turner syndrome are groin mass, blood in stool, lack of growth, diminished hearing, emotional symptoms, elbow weakness, back weakness, and pus in sputum, although you may still have turner syndrome without those symptoms.           

Data idx 3 Sympoms desc :
 The symptoms that are highly suggestive of cryptorchidism are symptoms of the scrotum and testes, swelling of scrotum, flatulence, pus draining from ear, jaundice, mass in scrot

as you can see the sysmpoms desc column always starts with a sentence `The symptoms that are highly suggestive of` and we will remove it.

---
#### Question 1

`question1 : what disease has these simptoms: [simptomsDesc]`

`Answer: [name]`



In [9]:
q1 = 'what desease has these simptoms:' + data['Sympoms desc'] + '?'
q1 = q1.str.replace('The symptoms that are highly suggestive of', '')
a1 = data['name']

In [10]:
q1[0], a1[0]

('what desease has these simptoms:  panic disorder are anxiety and nervousness and breathing fast, although you may still have panic disorder without those symptoms.                                         ?',
 'Panic disorder')

#### Question 2
`question2 : who Is At Risk for [name]`

`Answer: [whoIsAtRiskDesc]`

In [11]:
q2 = 'who Is At Risk for ' + data['name'] + '?'
a2 = data['whoIsAtRiskDesc']

In [12]:
q2[0], a2[0]

('who Is At Risk for Panic disorder?',
 'Groups of people at highest risk for panic disorder include     age 30-44 years.   On the other hand, age 1-4 years and age < 1 years almost never get panic disorder.,Within all the people who go to their doctor with panic disorder, 88% report having anxiety and nervousness, 55% report having depression, and 40% report having shortness of breath.   ')

#### Question 3
`question3 : what are the most comon test and procedures for [name] ?`

`Answer: [commonTestsAndProcedures] after deJson`

In [13]:
data['commonTestsAndProcedures'][0]

'[{"commonTestsAndProcedures":"Psychotherapy"},{"commonTestsAndProcedures":"Mental health counseling"},{"commonTestsAndProcedures":"Electrocardiogram"},{"commonTestsAndProcedures":"Depression screen (Depression screening)"},{"commonTestsAndProcedures":"Toxicology screen"},{"commonTestsAndProcedures":"Psychological and psychiatric evaluation and therapy"},{"commonTestsAndProcedures":"Occupational therapy assessment (Speech therapy)"}]'

In [14]:
example = json.loads(data['commonTestsAndProcedures'][0])
example

[{'commonTestsAndProcedures': 'Psychotherapy'},
 {'commonTestsAndProcedures': 'Mental health counseling'},
 {'commonTestsAndProcedures': 'Electrocardiogram'},
 {'commonTestsAndProcedures': 'Depression screen (Depression screening)'},
 {'commonTestsAndProcedures': 'Toxicology screen'},
 {'commonTestsAndProcedures': 'Psychological and psychiatric evaluation and therapy'},
 {'commonTestsAndProcedures': 'Occupational therapy assessment (Speech therapy)'}]

In [15]:
def jsonToStringwithComa(entry):
    m1 = json.loads(entry)
    m2 = [list(x.values()) for x in m1]
    m3 = ''.join([item + ', ' for sublist in m2 for item in sublist])
    return m3

In [16]:
q3 = 'what are the most comon test and procedures for ' + data['name'] + '?'
a3 = data['commonTestsAndProcedures'].map(jsonToStringwithComa)

In [17]:
q3[0], a3[0]

('what are the most comon test and procedures for Panic disorder?',
 'Psychotherapy, Mental health counseling, Electrocardiogram, Depression screen (Depression screening), Toxicology screen, Psychological and psychiatric evaluation and therapy, Occupational therapy assessment (Speech therapy), ')

#### Question 4
`question4 : what are drugs for [name] ?`

` Answer: [medications1]`

In [18]:
q4 = ' what are drugs for  ' + data['name'] + '?'
a4 = data['medications1']

In [19]:
q4[0], a4[0]

(' what are drugs for  Panic disorder?',
 'The most commonly prescribed drugs for patients with panic disorder include       lorazepam,          alprazolam (xanax),          clonazepam,          paroxetine (paxil),          venlafaxine (effexor),          mirtazapine,          buspirone (buspar),          fluvoxamine (luvox),          imipramine,          desvenlafaxine (pristiq),          clomipramine,          acamprosate (campral) and          disulfiram (antabuse)     .')

#### Concat QA

In [20]:
def wikitext_detokenizer(string):
    # contractions
    string = string.replace("s '", "s'")
    string = re.sub(r"/' [0-9]/", r"/'[0-9]/", string)
    # replace more spaces with 1
    string = re.sub("\s\s+", " ", string)
    # number separators
    string = string.replace(" @-@ ", "-")
    string = string.replace(" @,@ ", ",")
    string = string.replace(" @.@ ", ".")
    # punctuation
    string = string.replace(" : ", ": ")
    string = string.replace(" ; ", "; ")
    string = string.replace(" . ", ". ")
    string = string.replace(" ! ", "! ")
    string = string.replace(" ? ", "? ")
    string = string.replace(" , ", ", ")
    # double brackets
    string = re.sub(r"\(\s*([^\)]*?)\s*\)", r"(\1)", string)
    string = re.sub(r"\[\s*([^\]]*?)\s*\]", r"[\1]", string)
    string = re.sub(r"{\s*([^}]*?)\s*}", r"{\1}", string)
    string = re.sub(r"\"\s*([^\"]*?)\s*\"", r'"\1"', string)
    string = re.sub(r"'\s*([^']*?)\s*'", r"'\1'", string)
    # miscellaneous
    string = string.replace("= = = =", "====")
    string = string.replace("= = =", "===")
    string = string.replace("= =", "==")
    string = string.replace(" " + chr(176) + " ", chr(176))
    string = string.replace(" \n", "\n")
    string = string.replace("\n ", "\n")
    string = string.replace(" N ", " 1 ")
    string = string.replace(" 's", "'s")

    return string

In [21]:
q = pd.concat([q1, q2, q3, q4])
a = pd.concat([a1, a2, a3, a4])
question = q.map(wikitext_detokenizer)
answers = a.map(wikitext_detokenizer)

In [22]:
qaPairs = pd.concat((question, answers), axis=1)
qaPairs.columns = ["question", "answers_text"]

In [23]:
# Save to csv
qaPairs.to_csv("qapairs.csv", index=False)

## Preprocessing Data

In [24]:
datasets = load_dataset("csv", data_files="/content/qapairs.csv")
datasets

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'answers_text'],
        num_rows: 2992
    })
})

In [25]:
# train/test split
datasets_train_test = datasets["train"].shuffle(42).train_test_split(test_size=596)
datasets_train_validation = datasets_train_test["train"].shuffle(42).train_test_split(test_size=596)

datasets["train"] = datasets_train_validation["train"]
datasets["validation"] = datasets_train_validation["test"]
datasets["test"] = datasets_train_test["test"]

datasets

DatasetDict({
    train: Dataset({
        features: ['question', 'answers_text'],
        num_rows: 1800
    })
    validation: Dataset({
        features: ['question', 'answers_text'],
        num_rows: 596
    })
    test: Dataset({
        features: ['question', 'answers_text'],
        num_rows: 596
    })
})

In [26]:
tokenizer = T5TokenizerFast.from_pretrained(MODEL_CHECKPOINT)

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [27]:
def preprocess_function(examples):
    inputs = [START_PREFIX + doc + END_PREFIX for doc in examples["question"]]
    target = [doc + END_PREFIX for doc in examples["answers_text"]]
    # tokenize inputs
    model_inputs = tokenizer(
        inputs, max_length=MAX_INPUT_LENGTH,
        pad_to_max_length=True, truncation=True
    )

    labels = tokenizer(
        text_target=target, max_length=MAX_TARGET_LENGTH, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [28]:
tokenized_datasets = datasets.map(preprocess_function, batched=True, remove_columns=["question", "answers_text"])

Map:   0%|          | 0/1800 [00:00<?, ? examples/s]



Map:   0%|          | 0/596 [00:00<?, ? examples/s]

Map:   0%|          | 0/596 [00:00<?, ? examples/s]

In [29]:
model = T5ForConditionalGeneration.from_pretrained(MODEL_CHECKPOINT)

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [30]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

## Compute Metrics

In [31]:
metrics = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [32]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = metrics.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    # result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Args

In [33]:
# Set up training arguments
training_args = Seq2SeqTrainingArguments(
    MODEL_REPO,
    evaluation_strategy="steps",
    eval_steps=100,
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    weight_decay=0.01,
    num_train_epochs=3,
    predict_with_generate=True,
    load_best_model_at_end=True,
    fp16=True
)

In [34]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

## Training

In [35]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
100,No log,1.334022,0.5018,0.4153,0.5,0.4994
200,No log,1.144249,0.4976,0.4092,0.4958,0.4947
300,No log,1.033437,0.4975,0.4097,0.4957,0.4945
400,No log,0.950426,0.4994,0.405,0.4971,0.4956
500,1.444100,0.900044,0.5101,0.4183,0.5075,0.5069
600,1.444100,0.865276,0.5068,0.4183,0.5046,0.5039
700,1.444100,0.877803,0.5094,0.422,0.506,0.5052
800,1.444100,0.880599,0.5092,0.4212,0.5058,0.505
900,1.444100,0.88059,0.5093,0.4213,0.5059,0.5051
1000,0.989900,0.880571,0.5093,0.4213,0.5059,0.5051


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=1350, training_loss=1.1487850613064237, metrics={'train_runtime': 1967.7175, 'train_samples_per_second': 2.744, 'train_steps_per_second': 0.686, 'total_flos': 3288372609024000.0, 'train_loss': 1.1487850613064237, 'epoch': 3.0})

In [39]:
DEVICE = "cuda:0"

In [40]:
input_text =  'what desease has these simptoms:  vocal cord polyp are hoarse voice, difficulty speaking, throat swelling, and lump in throat, although you may still have vocal cord polyp without those symptoms.               ?'
inputs = tokenizer(
    START_PREFIX + input_text, max_length=MAX_INPUT_LENGTH,
    padding="max_length", truncation=True,
    add_special_tokens=True
)

input_ids = torch.tensor(inputs["input_ids"], dtype=torch.long).to(DEVICE).unsqueeze(0)
attention_mask = torch.tensor(inputs["attention_mask"], dtype=torch.long).to(DEVICE).unsqueeze(0)

outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask)

predicted_answer = tokenizer.decode(outputs.flatten(), skip_special_tokens=True)

print("Answer: ", predicted_answer)



Answer:  Vocal cord polyp
