<a href="https://colab.research.google.com/github/data-tamer2410/ds-doctor-chat/blob/master/doctor_chat/doctor_chat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

Develop a chatbot that provides advice and recommendations on medical issues.

# Solution to the problem

In [2]:
import pandas as pd
import tensorflow as tf
from datasets import Dataset
from transformers import AutoTokenizer, TFAutoModelForCausalLM, DataCollatorForSeq2Seq

In [None]:
train_dataset = pd.read_csv('train.csv')
test_dataset = pd.read_csv('test.csv')

In [None]:
print(train_dataset.isna().any())
print(test_dataset.isna().any())

Conversation    False
dtype: bool
Conversation    False
dtype: bool


In [None]:
print(train_dataset.duplicated().any())
print(test_dataset.duplicated().any())

False
False


In [None]:
# Before each token of a new reply, we will add "\n" (if it's not already there).
train_dataset = train_dataset['Conversation'].str.replace(r'(?<!\n)(\[\|Human\|\]|\[\|AI\|\])', r'\n\1', regex=True)
test_dataset = test_dataset['Conversation'].str.replace(r'(?<!\n)(\[\|Human\|\]|\[?\|AI\|\])', r'\n\1', regex=True)

In [None]:
print(train_dataset.str.startswith("The conversation between human and AI assistant.\n[|Human|] ").all())
print(test_dataset.str.startswith("The conversation between human and AI assistant.\n[|Human|] ").all())

True
True


In [None]:
# Let's remove the unnecessary characters.
start_text = "The conversation between human and AI assistant.\n"
train_dataset = train_dataset.str.replace(start_text,'')
test_dataset = test_dataset.str.replace(start_text,'')

train_dataset = train_dataset.str.replace(r'[“”"‘’]', '', regex=True)
test_dataset = test_dataset.str.replace(r'[“”"‘’]', '', regex=True)

The message type in the file consists of text messages containing detailed statements from dialogue participants. The main features:

1. **Formatted Dialogue:** Each message begins with a marker indicating the speaker ([|Human|] or [|AI|]), which allows for role identification.

2. **Content of Messages:**
   - Messages from the patient ("Human") contain descriptions of symptoms, complaints, or questions.
   - Messages from the doctor ("AI") contain detailed responses, possible diagnoses, or recommendations.

3. **Message Separation:** Replies are separated by a newline character (\n), but multiple messages can be on the same line in the file.

In [None]:
# Loading the GPT-2 tokenizer.
tokenizer = AutoTokenizer.from_pretrained('gpt2')

In [None]:
# Adding special tokens.
tokenizer.pad_token = tokenizer.eos_token
special_tokens = {'additional_special_tokens': ['[|Human|]', '[|AI|]']}
tokenizer.add_special_tokens(special_tokens)

In [None]:
# Text tokenization.
train_dataset = tokenizer(train_dataset.to_list(),truncation=True)
test_dataset = tokenizer(test_dataset.to_list(),truncation=True)

In [None]:
# Preparing datasets.
train_labels = [el[1:] + [tokenizer.eos_token_id] for el in train_dataset['input_ids']]
test_labels = [el[1:] + [tokenizer.eos_token_id] for el in test_dataset['input_ids']]

train_dataset['labels'] = train_labels
test_dataset['labels'] = test_labels

train_dataset = Dataset.from_dict(train_dataset)
test_dataset = Dataset.from_dict(test_dataset)

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, padding=True, return_tensors='tf',label_pad_token_id=tokenizer.eos_token_id)

In [None]:
batch_size = 2

train_dataset = train_dataset.to_tf_dataset(
    columns=['input_ids', 'attention_mask', 'labels'],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator
)

test_dataset = test_dataset.to_tf_dataset(
    columns=['input_ids', 'attention_mask', 'labels'],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator
)

In [None]:
# Loading the model.
model = TFAutoModelForCausalLM.from_pretrained('gpt2')

In [None]:
# Changing the size of the model's embedding matrices to account for the new tokens.
model.resize_token_embeddings(len(tokenizer))

In [None]:
model.summary()

Model: "tfgpt2lm_head_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLay  multiple                  124441344 
 er)                                                             
                                                                 
Total params: 124441344 (474.71 MB)
Trainable params: 124441344 (474.71 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
class Perplexity(tf.keras.metrics.Metric):
    def __init__(self, name="perplexity", **kwargs):
        super(Perplexity, self).__init__(name=name, **kwargs)
        self.total_loss = self.add_weight(name="total_loss", initializer="zeros")
        self.total_count = self.add_weight(name="total_count", initializer="zeros")

    def update_state(self, y_true, y_pred, sample_weight=None):
        # Calculating the loss (SparseCategoricalCrossentropy).
        loss_fn = tf.losses.SparseCategoricalCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.SUM)
        loss = loss_fn(y_true, y_pred)

        # Updating the total loss and token count.
        self.total_loss.assign_add(loss)
        self.total_count.assign_add(tf.cast(tf.size(y_true), tf.float32))

    def result(self):
        # Calculating the average loss.
        avg_loss = self.total_loss / self.total_count
        # Perplexity: exp(average_loss).
        return tf.exp(avg_loss)

    def reset_state(self):
        # Clearing the accumulators.
        self.total_loss.assign(0.0)
        self.total_count.assign(0.0)

In [None]:
# Training the model.
epochs = 3
lr = tf.optimizers.schedules.ExponentialDecay(initial_learning_rate=0.00003,decay_steps=5000,decay_rate=0.96,staircase=True)
optimizer = tf.optimizers.AdamW(learning_rate=lr)
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=["accuracy",Perplexity()]
)

history = model.fit(
    train_dataset,
    epochs=epochs
)

In [None]:
model.save_pretrained('drive/MyDrive/gpt2-doctor-chat')
tokenizer.save_pretrained('drive/MyDrive/gpt2-doctor-chat')

In [None]:
res = model.evaluate(test_dataset)



In [16]:
# Let's test the model's performance.
text = "[|Human|] I've been experiencing persistent dizziness and nausea for the past two weeks, especially in the mornings. I also have occasional headaches and a feeling of pressure behind my eyes. Could this be related to a neurological issue or an inner ear disorder? What tests would you recommend to determine the cause, and what treatment options are available?\n[|AI|]"
tokenize_text = tokenizer(text,return_tensors='tf')

In [19]:
res_generate = model.generate(**tokenize_text,
                              max_length=500,eos_token_id=[
                                  tokenizer.eos_token_id,
                                  tokenizer.additional_special_tokens_id[1]
                                  ],
                              do_sample=True,
                              top_p=0.9,
                              repetition_penalty=1.2)

In [20]:
tokenizer.decode(res_generate[0])

"[|Human|] I've been experiencing persistent dizziness and nausea for the past two weeks, especially in the mornings. I also have occasional headaches and a feeling of pressure behind my eyes. Could this be related to a neurological issue or an inner ear disorder? What tests would you recommend to determine the cause, and what treatment options are available?\n[|AI|]  Hi, Thankyou for posting your query. I agree with you that your symptoms are likely due to neurological disorders. The treatment options would range from a neurologic examination (carotid Doppler) to some imaging (imaging of brain), which can mimic any neurological disorder. You should get back if you require any additional information. Best wishes, Chat Doctor. Ly/\n<|endoftext|>"

## Conclusion

The developed chatbot for medical consultations is based on a fine-tuned GPT-2 model, adapted to recognize and generate dialogues between users and a virtual doctor.  

### Testing results:  
- **Perplexity**: 12.9094  
- **Loss**: 2.6079  

These metrics indicate the model's ability to effectively understand and generate medical responses with a relatively low level of uncertainty.  

### Key features of the implementation:  
- Use of special tokens (`[|Human|]` and `[|AI|]`) to structure dialogues correctly.  
- Data cleaning and preprocessing to ensure accurate training.  
- Adaptation of the GPT-2 model by adding new tokens and expanding the vocabulary.  
- Application of the **Perplexity** metric to assess the model's ability to predict the next word in a sequence.  
- Training optimization using `AdamW` and an exponential learning rate decay strategy.  

Testing on real user queries demonstrated that the model can generate logical responses, provide possible diagnoses, and recommend medical examinations. Future improvements may include:  
- Expanding the dataset with more specialized medical texts to enhance response accuracy.  
- Utilizing more advanced language models (e.g., GPT-4).  
- Integrating external medical knowledge bases.  

The results highlight the potential of this approach for developing intelligent assistants in the healthcare sector.