<a href="https://colab.research.google.com/github/fatemafaria142/Impact-of-Instruction-Tuning-in-Biomedical-Language-Processing-Using-Different-LLMs/blob/main/Biomedical_Language_Processing_using_Mistral_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets torch

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.7.9-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m50.5 MB/s[0m eta [36m0:00:00

### **Dataset Link:** https://huggingface.co/datasets/nlpie/Llama2-MedTuned-Instructions?row=0

In [3]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("nlpie/Llama2-MedTuned-Instructions")

### **Dataset structure**

* The dataset contains three different columns.

In [4]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 205048
    })
})

In [5]:
# Display information for 5 data points from the 'train' split
num_samples_to_show = 5
for i in range(num_samples_to_show):
    data = instruct_tune_dataset['train'][i]
    print(f"Data Point {i + 1}:")
    print("Instruction:", data['instruction'])
    print("Input:", data['input'])
    print("Output:", data['output'])
    print("\n-----------------------------\n")


Data Point 1:
Instruction: Your goal is to determine the relationship between the two provided clinical sentences and classify them into one of the following categories:
Contradiction: If the two sentences contradict each other.
Neutral: If the two sentences are unrelated to each other.
Entailment: If one of the sentences logically entails the other.
Input: Sentence 1: For his hypotension, autonomic testing confirmed orthostatic hypotension.
Sentence 2: the patient has orthostatic hypotension
Output: Entailment

-----------------------------

Data Point 2:
Instruction: In the provided text, your objective is to recognize and tag gene-related Named Entities using the BIO labeling scheme. Start by labeling the initial word of a gene-related phrase as B (Begin), and then mark the following words in the same phrase as I (Inner). Any words not constituting gene-related entities should receive an O label.
Input: The first product of ascorbate oxidation , the ascorbate free radical ( AFR ) , 

### **We will use just a small subset of the data for this training example**

In [6]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(1500))
instruct_tune_dataset["test"] = instruct_tune_dataset["train"].select(range(200))

In [7]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 1500
    })
    test: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 200
    })
})

In [8]:
''''
def create_prompt(sample):
  """
  Update the prompt template:
  Combine both the prompt and input into a single column.

  """
  bos_token = "<s>"
  original_system_message = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
  system_message = "Use the provided input to create an instruction that could have been used to generate the response with an LLM."
  response = sample["output"].replace(original_system_message, "").replace("\n\n### Instruction\n", "").replace("\n### Response\n", "").strip()
  input = sample["instruction"]
  eos_token = "</s>"

  full_prompt = ""
  full_prompt += bos_token
  full_prompt += "### Instruction:"
  full_prompt += "\n" + system_message
  full_prompt += "\n\n### Input:"
  full_prompt += "\n" + input
  full_prompt += "\n\n### Response:"
  full_prompt += "\n" + response
  full_prompt += eos_token

  return full_prompt

  create_prompt(instruct_tune_dataset["train"][0])
  ''''

In [18]:
def create_prompt(sample):
  """
  Update the prompt template:
  Combine both the prompt and input into a single column.

  """
  bos_token = "<s>"

  eos_token = "</s>"

  full_prompt = ""
  full_prompt += bos_token
  full_prompt += "### Instruction:"
  full_prompt += "\n" + sample["instruction"]
  full_prompt += "\n\n### Input:"
  full_prompt += "\n" + sample["input"]
  full_prompt += "\n\n### Response:"
  full_prompt += "\n" + sample["output"]
  full_prompt += eos_token

  return full_prompt

In [20]:
create_prompt(instruct_tune_dataset["train"][0])

'<s>### Instruction:\nYour goal is to determine the relationship between the two provided clinical sentences and classify them into one of the following categories:\nContradiction: If the two sentences contradict each other.\nNeutral: If the two sentences are unrelated to each other.\nEntailment: If one of the sentences logically entails the other.\n\n### Input:\nSentence 1: For his hypotension, autonomic testing confirmed orthostatic hypotension.\nSentence 2: the patient has orthostatic hypotension\n\n### Response:\nEntailment</s>'

In [21]:
create_prompt(instruct_tune_dataset["train"][2])

'<s>### Instruction:\nIf you have expertise in healthcare, assist users by addressing their medical questions and concerns.\n\n### Input:\ni am having acute gas problem which i am feeling  and my bp has raised to 130/90, During  this time i have palpitation and i feel there is a missing pulse . Kindly guide me to overcome this problem.On the other hand if there is no gas in my stomach the aforesaid problems are not occuring.\n\n### Response:\nDear-thanks for using our service and will try to help you with my medical advice. You might be experiencing intestinal gas from food that you are eating as some vegetables like broccoli, cauliflower and that can make your heart rate faster too. However, the blood pressure is high, which can be the primary problem or secondary to other things like salt in your diet. I recommend you to watch what your eat, control salt, Chat Doctor.  That will help your blood pressure and the intestinal movements too. Avoid soda and caffeine too. I hope that my adv

In [22]:
create_prompt(instruct_tune_dataset["train"][5])

'<s>### Instruction:\nAs a healthcare expert, provide answers to medical inquiries based on the information given by the user.\n\n### Input:\ni am scheduled for shock wave treatment for kidney stones and blockage. I understand they have more than 1 type of lithrotipsy machines, the ones that use water and those that use padded cushions. Which ones are recommended  and why? Do they both use x-rays to guide them? I know some use ultrasound. Thank you\n\n### Response:\nEndoscopic lithography refers to the visualization of a calculus in the urinary tract and the simultaneous application of energy to fragment the stone or stones into either extractable or passable pieces. Many calculi in the upper urinary tract are treated with extracorporeal shockwave lithography (ESL). However, for stones that are poor candidates for this modality, endoscopic therapy is indicated. Ureteroscopy is the most common means of visualizing an upper urinary tract calculus. In addition, percutaneous techniques (e.

### **Initializing the Model**
* Load the model using a 4-bit configuration, employing double quantization, and set bfloat16 as the compute data type.

* Notably, we opt for the instruct-tuned model in this instance rather than the base model. It's worth mentioning that fine-tuning a base model necessitates a more substantial amount of data!

In [23]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )


* https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1

In [24]:
mode_id = "mistralai/Mistral-7B-Instruct-v0.1"

In [25]:
model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto", use_cache=False
    )

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [26]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

### **Let's example how well the model does at this task currently:**

In [27]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [29]:
prompt="### Instruction:\nYour goal is to determine the relationship between the two provided clinical sentences and classify them into one of the following categories: Contradiction: If the two sentences contradict each other. Neutral: If the two sentences are unrelated to each other. Entailment: If one of the sentences logically entails the other.### Input:\nSentence 1: For his hypotension, autonomic testing confirmed orthostatic hypotension. Sentence 2: the patient has orthostatic hypotension\n\n### Response:"

In [30]:
generate_response(prompt, model)



'<s> \nNeutral</s>'

### **Setting up the Training**
we will be using the `huggingface` and the `peft` library!

In [31]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")


* we need to prepare the model to be trained in 4bit so we will use the  **`prepare_model_for_kbit_training`** function from peft




In [32]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# **Training Hyperparameters**
The choice of hyperparameters is contingent upon the desired training duration. Pay special attention to the following key factors:

* `num_train_epochs/max_steps:` Dictates the number of iterations over the data. Exercise caution, as an excessive number may lead to overfitting!

* `learning_rate:` Governs the convergence speed of the model. Adjust this parameter judiciously for optimal results.

In [33]:
from transformers import TrainingArguments
output_model= "mistral_biomedical_generation"
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=50,
        fp16=True,
)


### **Setting up the trainer**

`max_seq_length`: Context window size


In [34]:
from trl import SFTTrainer

max_seq_length = 1024

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=training_arguments,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



### **Training starts here**

In [35]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.9803
20,1.787
30,1.5734
40,1.6362
50,1.582


TrainOutput(global_step=50, training_loss=1.7117459106445312, metrics={'train_runtime': 777.8677, 'train_samples_per_second': 0.257, 'train_steps_per_second': 0.064, 'total_flos': 8741766719078400.0, 'train_loss': 1.7117459106445312, 'epoch': 0.48})

### **Save the model**

In [36]:
trainer.save_model("mistral_biomedical_generation")

In [37]:
merged_model = model.merge_and_unload()



In [38]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

### **Example: 1**

In [39]:
# Example usage
prompt = "### Instruction:\nIf you have medical expertise, assist the user by responding to their healthcare-related questions."
prompt += "\n### Input:\nHi I am 33 year old male , 200 lbs . Sunday I had a sever panic attack , was dizzy , naseuas , almost fainted . Went to the emergency room , had a ekg and blood work all came back normal . I do suffer from anxiety but as of late I have chest pains . Sharp pains in diffrent spots of my chest . They come and go . I also have a strange feeling I m my head and pressure under my rib cage . My blood pressure is always normal , but have a high pulse rate . Ranges from 85 - 111 bpm evertime I check it . Are all of these related to stress , what can be done . A very scared man , Mike"
prompt += "\n\n### Response:"
response = generate_response(prompt, model)

# Print the response with formatted output
print(response)



<s> ### Instruction:
If you have medical expertise, assist the user by responding to their healthcare-related questions.
### Input:
Hi I am 33 year old male , 200 lbs . Sunday I had a sever panic attack , was dizzy , naseuas , almost fainted . Went to the emergency room , had a ekg and blood work all came back normal . I do suffer from anxiety but as of late I have chest pains . Sharp pains in diffrent spots of my chest . They come and go . I also have a strange feeling I m my head and pressure under my rib cage . My blood pressure is always normal , but have a high pulse rate . Ranges from 85 - 111 bpm evertime I check it . Are all of these related to stress , what can be done . A very scared man , Mike

### Response:
Hello Mike,

I'm sorry to hear about your recent health concerns. It sounds like you may be experiencing a combination of symptoms related to anxiety and possibly some cardiac issues.

The symptoms you described such as panic attacks, dizziness, nausea, and chest pains a

### **Example: 2**

In [40]:
# Example usage
prompt = "### Instruction:\nAs a medical chatbot, your responsibility is to provide information and guidance on medical matters to users."
prompt += "\n### Input:\nCan i give my 6 month old 18lb baby anythi g for allergies shes having clear boogars that were green but now are clear again her eyes are watery and she keeps sneezing she cant breathe and drink her bottle and the pedi is not awnsering called 6 times already"
prompt += "\n\n### Response:"
response = generate_response(prompt, model)

# Print the response with formatted output
print(response)

<s> ### Instruction:
As a medical chatbot, your responsibility is to provide information and guidance on medical matters to users.
### Input:
Can i give my 6 month old 18lb baby anythi g for allergies shes having clear boogars that were green but now are clear again her eyes are watery and she keeps sneezing she cant breathe and drink her bottle and the pedi is not awnsering called 6 times already

### Response:
Hello! I'm sorry to hear that your baby has been experiencing some symptoms of allergies, such as clear and green boogers, watery eyes, sneezing, and difficulty breathing. These symptoms can be caused by a variety of allergens, such as dust mites, pollen, or pet dander. However, since she's a 6-month-old, it could also be a sign of a respiratory infection, not an allergy.

It is always best to consult with your pediatrician for an accurate diagnosis and proper treatment. If you've already called your doctor 6 times, it would be good to seek medical advice in person or by callin

### **Example: 3**

In [41]:
# Example usage
prompt = "### Instruction:\nIn your capacity as a doctor, it is expected that you answer the medical questions relying on the patient's description.Analyze the question given its context. Give both long answer and yes/no decision."
prompt += "\n### Input:\nDoes systemic inflammation response syndrome score predict the mortality in multiple trauma patients?"
prompt += "\n\n### Response:"
response = generate_response(prompt, model)

# Print the response with formatted output
print(response)

<s> ### Instruction:
In your capacity as a doctor, it is expected that you answer the medical questions relying on the patient's description.Analyze the question given its context. Give both long answer and yes/no decision.
### Input:
Does systemic inflammation response syndrome score predict the mortality in multiple trauma patients?

### Response:
Systemic inflammation response syndrome (SIRS) is a nonspecific inflammatory response characterized by a systemic increase in one or more acute phase reactants such as white blood cell count (WBC), C-reactive protein (CRP), and tumor necrosis factor-alpha (TNF-α). It is commonly associated with infections, trauma, and other inflammatory conditions.

In multiple trauma patients, SIRS is a common complication that can lead to a severe systemic inflammatory response, organ dysfunction, and ultimately, death. Several studies have evaluated the role of SIRS in predicting mortality in multiple trauma patients and have found a positive association