This notebook will deal with the batch inference of medical healthcare question/answer dataset using LLaMA model.

https://huggingface.co/datasets/Kabatubare/medical

After the inference, the LLaMA answers will be evaluated using BLEU/ROUGE against the groundtruth answers 

In [1]:
import os

In [2]:
os.environ["CUDA_VISIBLE_DEVICES"] = "7"

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from tqdm import tqdm
import pandas as pd
import numpy as np
import pickle

  from .autonotebook import tqdm as notebook_tqdm


### Get the Medical Healthcare dataset

In [4]:
dataset_name = "Kabatubare/medical"
dataset = load_dataset(dataset_name, split="all")
dataset

Dataset({
    features: ['Context', 'Question', 'Answer'],
    num_rows: 23437
})

Reading one instance of dataset

In [5]:
dataset[0]

{'Context': 'You are a medical knowledge assistant trained to provide information and guidance on various health-related topics.',
 'Question': 'my 5 1/2-year-old son displays adhd symptoms for 20 days then for 10 days he has none. is it adhd or another condition?',
 'Answer': "adhd and bipolar mood disorder (bmd) can coexist or be mistaken for one another. bmd usually is not diagnosed until young adulthood. however studies have shown that the earlier a person is diagnosed with bmd the more likely he is to have been diagnosed with adhd previously. in this case i would just like to reiterate that there is not enough information to discuss either possibility for your son. you mentioned that he becomes hyperactive for 3 weeks but not what his behaviors are like during those 10 days. you also do not mention irritability or mood swings just adhd symptoms. keep documenting the symptoms you are concerned about including what goes on in the home and at school when you see changes in behavior (

### Loading the LLaMA Model

In [6]:
with open('hf_token.key', 'r') as f:
    hf_token = f.read()

model_id = "meta-llama/Llama-3.2-3B-Instruct"

Initializing the tokenizer

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
tokenizer.model_max_length = 4096

Loading the model shards

In [8]:
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto") # Must be float32 for MacBooks!
model.config.pad_token_id = tokenizer.pad_token_id # Updating the model config to use the special pad token

Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.02s/it]


Defining the `eos` token as the terminator to finish the sentences

In [9]:
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

Defining the function to get the LLaMA Responses

In [10]:
def get_llama_response(question_inputs: str):
    
    llama_inputs = [[{"role": "system", "content": "You are a medical knowledge assistant trained to provide information and guidance on various health-related topics."},
                     {"role": "user", "content": question}] for question in question_inputs]

    texts = tokenizer.apply_chat_template(llama_inputs, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(texts, padding="longest", truncation=True, return_tensors="pt")
    inputs = {key: val.to(model.device) for key, val in inputs.items()}
    temp_texts = tokenizer.batch_decode(inputs['input_ids'], skip_special_tokens=True)
    
    gen_tokens = model.generate(
        **inputs, 
        max_new_tokens=4096, 
        pad_token_id=tokenizer.pad_token_id, 
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.7,
        # top_p=0.9
    )

    gen_text = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
    gen_text = [i[len(temp_texts[idx]):] for idx, i in enumerate(gen_text)]
    
    return gen_text

Preparing the batches of data

In [None]:
batch_size = 100
dataset_questions = dataset['Question']
dataset_answers = dataset['Answer']
batch_indices = np.arange(0, len(dataset_questions), batch_size)
if batch_indices[-1] != len(dataset_questions):
    batch_indices = np.append(batch_indices, len(dataset_questions))

Running the batch inference to get responses from LLaMA

Only run the following cell when new inferences are needed. Otherwise, keep it commented and move onto next section for evaluation

In [14]:
question_list = []
answer_list = []
llama_responses = []
for i in tqdm(range(0, len(batch_indices) - 1)):
    questions_input = dataset_questions[batch_indices[i]:batch_indices[i+1]]
    orig_answers = dataset_questions[batch_indices[i]:batch_indices[i+1]]
    llama_resp = get_llama_response(questions_input)
    
    question_list = question_list + questions_input
    answer_list = answer_list + orig_answers
    llama_responses = llama_responses + llama_resp

with open('kabatubare_medical/questions.pkl', 'wb') as file:
    pickle.dump(question_list, file)
    
with open('kabatubare_medical/answers.pkl', 'wb') as file:
    pickle.dump(answer_list, file)

with open('kabatubare_medical/llama_resp.pkl', 'wb') as file:
    pickle.dump(llama_responses, file)

100%|██████████| 1/1 [01:56<00:00, 116.94s/it]


### Evaluaing the LLaMA Responses

Reading the questions, answers and LLaMA Responses for

In [15]:
with open('kabatubare_medical/questions.pkl', 'rb') as file:
    question_list = pickle.load(file)
    
with open('kabatubare_medical/answers.pkl', 'rb') as file:
    answer_list = pickle.load(file)

with open('kabatubare_medical/llama_resp.pkl', 'rb') as file:
    llama_responses = pickle.load(file)