# “Evaluation Instruction-tuning - Llama2-MedTuned”
libs : 
1. "accelerate" => flexible API for PyTorch to help with distributed and mixed precision training (multi GPUs Or TPUs)
2. "peft" => Parameter-Efficient Fine-Tuning
3. "bitsandbytes" => fast PyTorch optimizers. It provides a fast implementation of optimizers like Adam.
4. "Transformers" => pre-trained transfomers models.
5. "TRL (Transformer Reinforcement Learning) =>  fine-tuning transformer models . 

In [1]:
!pip install -q accelerate==0.21.0 peft==0.7.1 bitsandbytes==0.42 transformers==4.37.0 trl==0.4.7

Load Llama2-MedTuned-Instructions dataset from huggingface , we store it now locally on kaggle . 

you can load by : 
```python
    from datasets import load_dataset
    dataset_base = load_dataset("nlpie/Llama2-MedTuned-Instructions")
```

Here a link to the database : 

- [DataSet Llama2-MedTuned-Instructions](https://huggingface.co/datasets/nlpie/Llama2-MedTuned-Instructions)

In [1]:
import pandas as pd
import torch
from datasets import Dataset, load_dataset , DatasetDict

# Load the parquet files into pandas dataframes
train_df = pd.read_parquet('/kaggle/input/medtuned-instructions/train-00000-of-00001-a8790d88efc2bc45.parquet')
validation_df = pd.read_parquet('/kaggle/input/medtuned-instructions/validation-00000-of-00001-b543c64b1786c03e.parquet')

# Convert the pandas dataframes into datasets
train_dataset = Dataset.from_pandas(train_df)
validation_dataset = Dataset.from_pandas(validation_df)

train_dataset = train_dataset.select(range(9000))
validation_dataset = validation_dataset.select(range(500))

# Create a DatasetDict
dataset_prepared = DatasetDict({
    'train': train_dataset,
    'test':validation_dataset
})

In [2]:
dataset_prepared

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'source'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['instruction', 'input', 'output', 'source'],
        num_rows: 500
    })
})

In [3]:
# Rename the columns
dataset_prepared['train'] = dataset_prepared['train'].rename_column('input', 'context')
dataset_prepared['train'] = dataset_prepared['train'].rename_column('output', 'response')
dataset_prepared['train'] = dataset_prepared['train'].rename_column('source', 'category')

dataset_prepared['test'] = dataset_prepared['test'].rename_column('input', 'context')
dataset_prepared['test'] = dataset_prepared['test'].rename_column('output', 'response')
dataset_prepared['test'] = dataset_prepared['test'].rename_column('source', 'category')

dataset_prepared

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 500
    })
})

Before we can begin training, we need to set up a few helper functions to ensure our dataset is parsed in the correct format and we save our PEFT adapters!

In [4]:
def formatting_func(example):
  if example.get("context", "") != "":
      input_prompt = (f"Below is an instruction that describes a task, paired with an input that provides further context. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Input: \n"
      f"{example['context']}\n\n"
      f"### Response: \n"
      f"{example['response']}")

  else:
    input_prompt = (f"Below is an instruction that describes a task. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Response:\n"
      f"{example['response']}")

  return {"text" : input_prompt}

In [5]:
formatted_dataset = dataset_prepared.map(formatting_func)

  0%|          | 0/9000 [00:00<?, ?ex/s]

  0%|          | 0/500 [00:00<?, ?ex/s]

In [6]:
formatted_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category', 'text'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['instruction', 'context', 'response', 'category', 'text'],
        num_rows: 500
    })
})

### We have the Llama2-MedTuned-Instructions dataset pared down to a more reasonable length - let's set up our model for eval steps!

In [9]:
# Login to huggingface hub

from huggingface_hub import login
login("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")

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [10]:
from peft import get_peft_model
import torch
import transformers
from peft import LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
from transformers import AutoTokenizer


lora_config = LoraConfig.from_pretrained("madarsProd/lama2FineTunBioLast")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained("madarsProd/lama2FineTunBioLast")
model = AutoModelForCausalLM.from_pretrained(
    lora_config.base_model_name_or_path,
    quantization_config=bnb_config,
    device_map="auto"
)

adapter_config.json:   0%|          | 0.00/582 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [11]:
model = get_peft_model(model, lora_config)
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=F

### Let's check one sample inference before evaluation 

In [16]:
from IPython.display import display, Markdown

def make_inference(instruction, context = None):
  if context:
    prompt = f"Below is an instruction that describes a task, paired with an input that provides further context.\n\n### Instruction: \n{instruction}\n\n### Input: \n{context}\n\n### Response: \n"
  else:
    prompt = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction: \n{instruction}\n\n### Response: \n"
  inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to("cuda:0")
  outputs = model.generate(**inputs, max_new_tokens=1000 ,temperature=0.2)
  display(Markdown((tokenizer.decode(outputs[0], skip_special_tokens=True))))

In [18]:
Instruction = """Your task is to determine the relationships between
medical problems, treatments, and tests within the
clinical text. Medical problems are marked as @problem$,
medical tests are marked as @test$, and treatments are
marked as @treatment$. Categorize the relationship
between two entities in the text as one of the following
options:
Treatment improves medical problem (TrIP)
Treatment worsens medical problem (TrWP)
Treatment causes medical problem (TrCP)
Treatment is administered for medical problem (TrAP)
Treatment is not administered because of medical
problem (TrNAP)
Test reveals medical problem (TeRP)
Test conducted to investigate medical problem (TeCP)
Medical problem indicates medical problem (PIP)
No Relations"""

Instruction = Instruction.replace('\n', ' ')

Input = """This is an 83 y/o female with moderate AS , s/p MVR 27
years ago , s/p PPM , @problem$ , CHF , p/w dsypnea
and @problem$ x 4-5 days"""

make_inference(Instruction, Input)

Below is an instruction that describes a task, paired with an input that provides further context.

### Instruction: 
Your task is to determine the relationships between medical problems, treatments, and tests within the clinical text. Medical problems are marked as @problem$, medical tests are marked as @test$, and treatments are marked as @treatment$. Categorize the relationship between two entities in the text as one of the following options: Treatment improves medical problem (TrIP) Treatment worsens medical problem (TrWP) Treatment causes medical problem (TrCP) Treatment is administered for medical problem (TrAP) Treatment is not administered because of medical problem (TrNAP) Test reveals medical problem (TeRP) Test conducted to investigate medical problem (TeCP) Medical problem indicates medical problem (PIP) No Relations

### Input: 
This is an 83 y/o female with moderate AS , s/p MVR 27 years ago , s/p PPM , @problem$ , CHF , p/w dsypnea and @problem$ x 4-5 days

### Response: 
Based on the provided text, the relationships between the medical problems, treatments, and tests can be categorized as follows:

* Treatment improves medical problem:
	+ Treatment with MVR 27 years ago improved the patient's AS.
	+ Treatment with PPM improved the patient's p/w dyspnea.
* Treatment worsens medical problem:
	+ Treatment with MVR 27 years ago worsened the patient's AS.
	+ Treatment with PPM worsened the patient's p/w dyspnea.
* Treatment causes medical problem:
	+ Treatment with MVR 27 years ago caused the patient's AS.
	+ Treatment with PPM caused the patient's p/w dyspnea.
* Treatment is administered for medical problem:
	+ Treatment with MVR was administered to the patient for their AS.
	+ Treatment with PPM was administered to the patient for their p/w dyspnea.
* Treatment is not administered because of medical problem:
	+ Treatment with MVR was not administered to the patient due to their AS.
	+ Treatment with PPM was not administered to the patient due to their p/w dyspnea.
* Test reveals medical problem:
	+ The test revealed the patient's AS.
	+ The test revealed the patient's p/w dyspnea.
* Test conducted to investigate medical problem:
	+ The test was conducted to investigate the patient's AS.
	+ The test was conducted to investigate the patient's p/w dyspnea.
* Medical problem indicates medical problem:
	+ The patient's AS indicates their AS.
	+ The patient's p/w dyspnea indicates their p/w dyspnea.
* No Relations:
	+ There is no relationship between the patient's AS and p/w dyspnea.

Note: The relationships are based on the information provided in the input and may not be applicable to all patients with the same medical conditions.

## evaluate performance using the BLEU (Bilingual Evaluation Understudy) score

In [12]:
from nltk.translate.bleu_score import sentence_bleu
from tqdm import tqdm


def evaluate(sample):
    
    instruction=sample["instruction"]
    context=sample["context"]
        
    if context:
        prompt = f"Below is an instruction that describes a task, paired with an input that provides further context.\n\n### Instruction: \n{instruction}\n\n### Input: \n{context}\n\n### Response: \n"
    else:
        prompt = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction: \n{instruction}\n\n### Response: \n"
    inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to("cuda:0")
    outputs = model.generate(**inputs)
    predicted_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return predicted_answer
    

# Initialize lists to store the true and predicted responses
true_responses = []
predicted_responses = []

# Iterate over eval dataset and predict
for s in tqdm(formatted_dataset["test"]):
    true_responses.append(s["text"])
    predicted_responses.append(evaluate(s))

# Compute average BLEU score
bleu_scores = [sentence_bleu([tr.split()], pr.split(), weights=(0.25, 0.25, 0.25, 0.25)) for tr, pr in zip(true_responses, predicted_responses)]
average_bleu = sum(bleu_scores) / len(bleu_scores)

print(f"Average BLEU Score: {average_bleu:.2f}")

100%|██████████| 500/500 [2:41:08<00:00, 19.34s/it]    


Average BLEU Score: 0.48


# **Llama2-MedTuned-7b trained model**
### evaluate performance

In [13]:
# Empty VRAM
del model
import gc
gc.collect()
gc.collect()

0

In [10]:
import torch
import transformers
from peft import LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

#model_id = "nlpie/Llama2-MedTuned-7b"
model_id = "nlpie/Llama2-MedTuned-13b"

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    #load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    #device_map={"": 0}
    device_map="auto"
)

2024-04-07 19:24:53.730723: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-07 19:24:53.730784: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-07 19:24:53.735647: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [11]:
tokenizer = AutoTokenizer.from_pretrained("nlpie/Llama2-MedTuned-7b")

tokenizer_config.json:   0%|          | 0.00/895 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

In [33]:
from IPython.display import display, Markdown

def make_inference(instruction, context = None):
  if context:
    prompt = f"Below is an instruction that describes a task, paired with an input that provides further context.\n\n### Instruction: \n{instruction}\n\n### Input: \n{context}\n\n### Response: \n"
  else:
    prompt = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction: \n{instruction}\n\n### Response: \n"
  inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to("cuda:0")
  outputs = model.generate(**inputs, max_new_tokens=1000 ,temperature=0.2)
  display(Markdown((tokenizer.decode(outputs[0], skip_special_tokens=True))))

In [5]:
Instruction = """Your task is to determine the relationships between
medical problems, treatments, and tests within the
clinical text. Medical problems are marked as @problem$,
medical tests are marked as @test$, and treatments are
marked as @treatment$. Categorize the relationship
between two entities in the text as one of the following
options:
Treatment improves medical problem (TrIP)
Treatment worsens medical problem (TrWP)
Treatment causes medical problem (TrCP)
Treatment is administered for medical problem (TrAP)
Treatment is not administered because of medical
problem (TrNAP)
Test reveals medical problem (TeRP)
Test conducted to investigate medical problem (TeCP)
Medical problem indicates medical problem (PIP)
No Relations"""

Input = """This is an 83 y/o female with moderate AS , s/p MVR 27
years ago , s/p PPM , @problem$ , CHF , p/w dsypnea
and @problem$ x 4-5 days"""

make_inference(Instruction, Input)

Below is an instruction that describes a task, paired with an input that provides further context.

### Instruction: 
Your task is to determine the relationships between
medical problems, treatments, and tests within the
clinical text. Medical problems are marked as @problem$,
medical tests are marked as @test$, and treatments are
marked as @treatment$. Categorize the relationship
between two entities in the text as one of the following
options:
Treatment improves medical problem (TrIP)
Treatment worsens medical problem (TrWP)
Treatment causes medical problem (TrCP)
Treatment is administered for medical problem (TrAP)
Treatment is not administered because of medical
problem (TrNAP)
Test reveals medical problem (TeRP)
Test conducted to investigate medical problem (TeCP)
Medical problem indicates medical problem (PIP)
No Relations

### Input: 
This is an 83 y/o female with moderate AS , s/p MVR 27
years ago , s/p PPM , @problem$ , CHF , p/w dsypnea
and @problem$ x 4-5 days

### Response: 
No Relations 


In [12]:
from nltk.translate.bleu_score import sentence_bleu
from tqdm import tqdm


def evaluate(sample):
    
    instruction=sample["instruction"]
    context=sample["context"]
        
    if context:
        prompt = f"Below is an instruction that describes a task, paired with an input that provides further context.\n\n### Instruction: \n{instruction}\n\n### Input: \n{context}\n\n### Response: \n"
    else:
        prompt = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction: \n{instruction}\n\n### Response: \n"
    inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to("cuda:0")
    outputs = model.generate(**inputs)
    predicted_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return predicted_answer
    

# Initialize lists to store the true and predicted responses
true_responses = []
predicted_responses = []

# Iterate over eval dataset and predict
for s in tqdm(formatted_dataset["test"]):
    true_responses.append(s["text"])
    predicted_responses.append(evaluate(s))

# Compute average BLEU score
bleu_scores = [sentence_bleu([tr.split()], pr.split(), weights=(0.25, 0.25, 0.25, 0.25)) for tr, pr in zip(true_responses, predicted_responses)]
average_bleu = sum(bleu_scores) / len(bleu_scores)

print(f"Average BLEU Score: {average_bleu:.2f}")

100%|██████████| 500/500 [4:28:01<00:00, 32.16s/it]  


Average BLEU Score: 0.94
