<a href="https://colab.research.google.com/github/fatemafaria142/Impact-of-Instruction-Tuning-in-Biomedical-Language-Processing-Using-Different-LLMs/blob/main/Biomedical_Language_Processing_using_Starling_LM_7B_alpha.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets torch

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.7.9-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00

#### **Dataset Link:** https://huggingface.co/datasets/nlpie/Llama2-MedTuned-Instructions?row=0

In [2]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("nlpie/Llama2-MedTuned-Instructions")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/91.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/205048 [00:00<?, ? examples/s]

### **Dataset structure**
* The dataset contains three different columns.

In [3]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 205048
    })
})

In [5]:
# Display information for 5 data points from the 'train' split
num_samples_to_show = 5
for i in range(num_samples_to_show):
    data = instruct_tune_dataset['train'][i]
    print(f"Data Point {i + 1}:")
    print("Instruction:", data['instruction'])
    print("Input:", data['input'])
    print("Output:", data['output'])
    print("\n-----------------------------\n")


Data Point 1:
Instruction: Your goal is to determine the relationship between the two provided clinical sentences and classify them into one of the following categories:
Contradiction: If the two sentences contradict each other.
Neutral: If the two sentences are unrelated to each other.
Entailment: If one of the sentences logically entails the other.
Input: Sentence 1: For his hypotension, autonomic testing confirmed orthostatic hypotension.
Sentence 2: the patient has orthostatic hypotension
Output: Entailment

-----------------------------

Data Point 2:
Instruction: In the provided text, your objective is to recognize and tag gene-related Named Entities using the BIO labeling scheme. Start by labeling the initial word of a gene-related phrase as B (Begin), and then mark the following words in the same phrase as I (Inner). Any words not constituting gene-related entities should receive an O label.
Input: The first product of ascorbate oxidation , the ascorbate free radical ( AFR ) , 

### **We will use just a small subset of the data for this training example**

In [6]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(3500))
instruct_tune_dataset["test"] = instruct_tune_dataset["train"].select(range(300))

In [7]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 3500
    })
    test: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 300
    })
})

In [8]:
def create_prompt(sample):
    # Use a predefined template for instructions
    prompt = sample['instruction']
    prompt += sample['input']
    single_turn_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant: {sample['output']}"
    return single_turn_prompt

In [9]:
create_prompt(instruct_tune_dataset["train"][0])

'GPT4 Correct User: Your goal is to determine the relationship between the two provided clinical sentences and classify them into one of the following categories:\nContradiction: If the two sentences contradict each other.\nNeutral: If the two sentences are unrelated to each other.\nEntailment: If one of the sentences logically entails the other.Sentence 1: For his hypotension, autonomic testing confirmed orthostatic hypotension.\nSentence 2: the patient has orthostatic hypotension<|end_of_turn|>GPT4 Correct Assistant: Entailment'

In [10]:
create_prompt(instruct_tune_dataset["train"][1])

'GPT4 Correct User: In the provided text, your objective is to recognize and tag gene-related Named Entities using the BIO labeling scheme. Start by labeling the initial word of a gene-related phrase as B (Begin), and then mark the following words in the same phrase as I (Inner). Any words not constituting gene-related entities should receive an O label.The first product of ascorbate oxidation , the ascorbate free radical ( AFR ) , acts in biological systems mainly as an oxidant , and through its role in the plasma membrane redox system exerts different effects on the cell .<|end_of_turn|>GPT4 Correct Assistant: The : O\nfirst : O\nproduct : O\nof : O\nascorbate : O\noxidation : O\n, : O\nthe : O\nascorbate : O\nfree : O\nradical : O\n( : O\nAFR : O\n) : O\n, : O\nacts : O\nin : O\nbiological : O\nsystems : O\nmainly : O\nas : O\nan : O\noxidant : O\n, : O\nand : O\nthrough : O\nits : O\nrole : O\nin : O\nthe : O\nplasma : O\nmembrane : O\nredox : O\nsystem : O\nexerts : O\ndifferent :

In [11]:
create_prompt(instruct_tune_dataset["train"][2])

'GPT4 Correct User: If you have expertise in healthcare, assist users by addressing their medical questions and concerns.i am having acute gas problem which i am feeling  and my bp has raised to 130/90, During  this time i have palpitation and i feel there is a missing pulse . Kindly guide me to overcome this problem.On the other hand if there is no gas in my stomach the aforesaid problems are not occuring.<|end_of_turn|>GPT4 Correct Assistant: Dear-thanks for using our service and will try to help you with my medical advice. You might be experiencing intestinal gas from food that you are eating as some vegetables like broccoli, cauliflower and that can make your heart rate faster too. However, the blood pressure is high, which can be the primary problem or secondary to other things like salt in your diet. I recommend you to watch what your eat, control salt, Chat Doctor.  That will help your blood pressure and the intestinal movements too. Avoid soda and caffeine too. I hope that my a

In [12]:
create_prompt(instruct_tune_dataset["train"][3])

"GPT4 Correct User: Being a doctor, your task is to answer the medical questions based on the patient's description.\nDetermine the relation between the two medical terms.###Question: What is the relation between Household Activity and Does not bath self (finding) ?<|end_of_turn|>GPT4 Correct Assistant: ###Answer: Does not bath self (finding) interprets: Household Activity"

In [13]:
create_prompt(instruct_tune_dataset["train"][4])

"GPT4 Correct User: In your role as a medical professional, address the user's medical questions and concerns.Hi I have the last few weeks had a sharp pain in my upper abdoman (centre) and a dull pain on the side, changing from left to right..just under my chest/ribs I also feel dizzy and like im a bit drunk... I have see a doctor who says im aneamic and also referred me for a gastroscopy but im really worried its somethign that can t be seen with this camera..<|end_of_turn|>GPT4 Correct Assistant: Thank you. Upper abdominal pain with anemia indicate some disease of upper GIT.peptic ulcer disease is the most possibility. Have you any history of vomiting out of blood or passage of black tarry stool ? Gastroscopy is confirmatory for this. Serology for h.pylori may be done. If these are inconclusive then other in. Should be considered like USG of abdomen. Dizziness is due to anemia. You have to check He level also."

In [41]:
create_prompt(instruct_tune_dataset["train"][5])

"GPT4 Correct User: Put yourself in the shoes of a thoughtful mental health counselor. Imagine you're all set to help with kindness and understanding, ready to offer care and support. Your task is to extend understanding and guidance, providing support to individuals dealing with mental health challenges. Here is my mental health problems: I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here.\n   I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it.\n   How can I change my feeling of being worthless to everyone?<|end_of_turn|>GPT4 Correct Assistant: Heck, sure thing, hun!Feelings of 'depression' have a deeply-rooted base in physical structures that may not be functioning very well at present; and, we can certainly turn them on again using means that you are able to find around the house and with relative ease :)After that, emotional 

In [14]:
create_prompt(instruct_tune_dataset["train"][6])

'GPT4 Correct User: Your goal is to determine the relationship between the two provided clinical sentences and classify them into one of the following categories:\nContradiction: If the two sentences contradict each other.\nNeutral: If the two sentences are unrelated to each other.\nEntailment: If one of the sentences logically entails the other.Sentence 1: Patient denies the use of EtOH or any illegal substances prior to the fall.\nSentence 2: Patient has not had recent trauma<|end_of_turn|>GPT4 Correct Assistant: Contradiction'

### **Initializing the Model**
* Load the model using a 4-bit configuration, employing double quantization, and set bfloat16 as the compute data type.

* Notably, we opt for the instruct-tuned model in this instance rather than the base model. It's worth mentioning that fine-tuning a base model necessitates a more substantial amount of data!

In [15]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )


* **Model and Tokenizer:** https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha

In [16]:
mode_id = "berkeley-nest/Starling-LM-7B-alpha"

In [17]:
model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto", use_cache=False
    )

config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

In [18]:
tokenizer = AutoTokenizer.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### **Let's example how well the model does at this task currently:**

In [19]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1024, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [34]:
prompt = """GPT4 Correct User: Your goal is to determine the relationship between the two provided clinical sentences and classify them into one of the following categories:
Contradiction: If the two sentences contradict each other. Neutral: If the two sentences are unrelated to each other. Entailment: If one of the sentences logically entails the other. """
prompt += '''Sentence 1: For his hypotension, autonomic testing confirmed orthostatic hypotension. Sentence 2: the patient has orthostatic hypotension <|end_of_turn|>'''
prompt += "GPT4 Correct Assistant:"

In [35]:
print(prompt)

GPT4 Correct User: Your goal is to determine the relationship between the two provided clinical sentences and classify them into one of the following categories:
Contradiction: If the two sentences contradict each other. Neutral: If the two sentences are unrelated to each other. Entailment: If one of the sentences logically entails the other. Sentence 1: For his hypotension, autonomic testing confirmed orthostatic hypotension. Sentence 2: the patient has orthostatic hypotension <|end_of_turn|>GPT4 Correct Assistant:


In [36]:
generate_response(prompt, model)



'<s> GPT4 Correct User: Your goal is to determine the relationship between the two provided clinical sentences and classify them into one of the following categories:\nContradiction: If the two sentences contradict each other. Neutral: If the two sentences are unrelated to each other. Entailment: If one of the sentences logically entails the other. Sentence 1: For his hypotension, autonomic testing confirmed orthostatic hypotension. Sentence 2: the patient has orthostatic hypotension <|end_of_turn|> GPT4 Correct Assistant: These two sentences are related and one sentence logically entails the other. \n\nSentence 1 provides information about the clinical condition (orthostatic hypotension) and the confirmation from the autonomic testing. Sentence 2 states the patient has orthostatic hypotension, which is directly mentioned in Sentence 1. Therefore, the category for these two sentences is Entailment. \n\nBased on the provided categories, the relationship between the two sentences is clas

### **Setting up the Training**
we will be using the `huggingface` and the `peft` library!

In [37]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")


* we need to prepare the model to be trained in 4bit so we will use the  **`prepare_model_for_kbit_training`** function from peft




In [38]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# **Training Hyperparameters**
The choice of hyperparameters is contingent upon the desired training duration. Pay special attention to the following key factors:

* `num_train_epochs/max_steps:` Dictates the number of iterations over the data. Exercise caution, as an excessive number may lead to overfitting!

* `learning_rate:` Governs the convergence speed of the model. Adjust this parameter judiciously for optimal results.

In [40]:
from transformers import TrainingArguments
output_model= "Starling_instruct_generation"
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=100,
        fp16=True,
)


### **Setting up the trainer**

`max_seq_length`: Context window size


In [41]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=training_arguments,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



### **Training starts here**

In [42]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,2.2027
20,1.857
30,1.7818
40,1.5927
50,1.4488
60,1.5798
70,1.393
80,1.411
90,1.4624
100,1.5151


TrainOutput(global_step=100, training_loss=1.624425163269043, metrics={'train_runtime': 804.113, 'train_samples_per_second': 0.497, 'train_steps_per_second': 0.124, 'total_flos': 8741776785408000.0, 'train_loss': 1.624425163269043, 'epoch': 0.21})

### **Save the model**

In [43]:
trainer.save_model("Starling_instruct_generation")

In [44]:
merged_model = model.merge_and_unload()



In [45]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=512, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

### **Example No:1**

In [49]:
prompt = "GPT4 Correct User: In your role as a medical professional, address the user's medical questions and concerns. "
prompt += "I have a white tab under my tounge that is not only painful when i touch it but bleeds as well. not sure what it is, or why I got it. Can you give me any advise? <|end_of_turn|> "
prompt += "GPT4 Correct Assistant:"
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s> GPT4 Correct User: In your role as a medical professional, address the user's medical questions and concerns. I have a white tab under my tounge that is not only painful when i touch it but bleeds as well. not sure what it is, or why I got it. Can you give me any advise? <|end_of_turn|>  GPT4 Correct Assistant: As a medical professional, I would recommend that you visit a doctor for a proper examination and diagnosis. The white tab you described could be a canker sore, which is a small, painful ulcer that can appear inside the mouth. While they are usually not indicative of a serious problem, it's important to get a professional opinion to rule out any potential causes.

In the meantime, try these home remedies to soothe the pain:
1. Rinse your mouth with a salt water solution several times a day. This can help reduce inflammation and speed up the healing process.
2. Apply a cold compress to the area to reduce swelling and pain.
3. Over-the-counter pain relief medication, such as i

### **Example No:2**

In [50]:
prompt = "GPT4 Correct User: In your capacity as a healthcare expert, offer insights and recommendations in response to users' medical inquiries. "
prompt += "I have terrible anxiety and depression. I've tried various therapists and pills, but nothing's helped. <|end_of_turn|> "
prompt += "GPT4 Correct Assistant:"
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s> GPT4 Correct User: In your capacity as a healthcare expert, offer insights and recommendations in response to users' medical inquiries. I have terrible anxiety and depression. I've tried various therapists and pills, but nothing's helped. <|end_of_turn|>  GPT4 Correct Assistant: I understand the struggle with anxiety and depression, and it's great that you've sought help from various therapists and medications. It's important to remember that finding the right treatment often requires a bit of trial and error. Here are some suggestions for you to consider:

1. Continue exploring therapists: It might be worthwhile to try different types of therapists - psychologists, psychiatrists, cognitive-behavioral therapists, mindfulness-based therapists, etc. Each has their own unique approach, and finding one that resonates with you could be beneficial.

2. Medication trial and error: The process of finding the right medication can be a bit of a journey. Keep an open mind about trying differe

### **Example No:3**

In [51]:
prompt = "GPT4 Correct User: As a medical chatbot, your responsibility is to provide information and guidance on medical matters to users. "
prompt += "Hi sir, I am so happy with this website. First of all thanks for giving this opportunity. I am the  Software employee.My age is 24. My height is 169cm .Recently I got back pain and some pain in chest. How can i get relief from those pains.How i improve my health and which type of diseases will attack to my life in future. Please give Some health tips for heart and kidneys protection. <|end_of_turn|> "
prompt += "GPT4 Correct Assistant:"
response = generate_response(prompt, model)
# Print the response with formatted output
print(response)

<s> GPT4 Correct User: As a medical chatbot, your responsibility is to provide information and guidance on medical matters to users. Hi sir, I am so happy with this website. First of all thanks for giving this opportunity. I am the  Software employee.My age is 24. My height is 169cm .Recently I got back pain and some pain in chest. How can i get relief from those pains.How i improve my health and which type of diseases will attack to my life in future. Please give Some health tips for heart and kidneys protection. <|end_of_turn|>  GPT4 Correct Assistant: Hello! I'm glad to hear that you're satisfied with the website and that you feel comfortable enough to reach out for advice. It's important to consult with a healthcare professional if you're experiencing any chronic pain, as they may be able to provide a more detailed and personalized assessment. In the meantime, here are some general tips that may help in managing pain and promoting overall health:

1. Maintain a healthy lifestyle: T