<a href="https://colab.research.google.com/github/azzindani/03_LLM_Data_Preprocess/blob/main/00_LLM_Dataset_Preprocess_v5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Modules

In [1]:
!pip install -q datasets

In [2]:
import torch

from datasets import load_dataset
from datasets import Dataset

from transformers import (
  AutoTokenizer,
  AutoModelForCausalLM,
  GenerationConfig
)

if torch.cuda.is_available():
  print('GPU is available!')
else:
  print('GPU is not available.')

GPU is not available.


In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

## Import Model Tokenizer

In [4]:
#url = 'https://huggingface.co/Qwen/Qwen2.5-0.5B'
#model_name = url.split('.co/')[-1]

model_name = 'unsloth/Llama-3.2-1B-Instruct'

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
#tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Import Dataset

In [6]:
#url = 'https://huggingface.co/datasets/KingNish/reasoning-base-20k'
#dataset_name = url.split('datasets/')[-1]

dataset_name = 'keivalya/MedQuad-MedicalQnADataset'

In [7]:
max_length = 1024

In [8]:
dataset = load_dataset(dataset_name, split = 'train')
dataset

README.md:   0%|          | 0.00/233 [00:00<?, ?B/s]

medDataset_processed.csv:   0%|          | 0.00/22.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16407 [00:00<?, ? examples/s]

Dataset({
    features: ['qtype', 'Question', 'Answer'],
    num_rows: 16407
})

In [9]:
dataset.select(range(5)).to_pandas().head()

Unnamed: 0,qtype,Question,Answer
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...
2,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos..."
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen..."


In [10]:
dataset[0]

{'qtype': 'susceptibility',
 'Question': 'Who is at risk for Lymphocytic Choriomeningitis (LCM)? ?',
 'Answer': 'LCMV infections can occur after exposure to fresh urine, droppings, saliva, or nesting materials from infected rodents.  Transmission may also occur when these materials are directly introduced into broken skin, the nose, the eyes, or the mouth, or presumably, via the bite of an infected rodent. Person-to-person transmission has not been reported, with the exception of vertical transmission from infected mother to fetus, and rarely, through organ transplantation.'}

In [11]:
features = list(dataset.features.keys())
print(features)

['qtype', 'Question', 'Answer']


## Dataset Preprocess

In [12]:
prompt_format = """
Below is an instruction that describes a questions, paired with an question type that provides further classification. Write a response that appropriately answer the question.

### Question:
{}

### Type:
{}

### Answer:
{}"""

In [13]:
EOS_TOKEN = tokenizer.eos_token

def preprocess(examples):
  instructions = examples['Question']
  inputs = examples['qtype']
  outputs = examples['Answer']

  texts = prompt_format.format(instructions, inputs, outputs) + EOS_TOKEN
  return {'prompt' : texts}

In [14]:
formatted_dataset = dataset.map(
  preprocess,
  remove_columns = list(dataset.features.keys())
)
formatted_dataset

Map:   0%|          | 0/16407 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt'],
    num_rows: 16407
})

In [15]:
print(formatted_dataset[0]['prompt'])


Below is an instruction that describes a questions, paired with an question type that provides further classification. Write a response that appropriately answer the question.

### Question:
Who is at risk for Lymphocytic Choriomeningitis (LCM)? ?

### Type:
susceptibility

### Answer:
LCMV infections can occur after exposure to fresh urine, droppings, saliva, or nesting materials from infected rodents.  Transmission may also occur when these materials are directly introduced into broken skin, the nose, the eyes, or the mouth, or presumably, via the bite of an infected rodent. Person-to-person transmission has not been reported, with the exception of vertical transmission from infected mother to fetus, and rarely, through organ transplantation.<|eot_id|>


## Tokenization

In [16]:
def tokenize_data(example, max_length = max_length):
  return tokenizer(example['prompt'], truncation = True, padding = 'max_length', max_length = max_length)

In [17]:
tokenized_dataset = formatted_dataset.map(tokenize_data, batched = True)#, remove_columns = 'text')
tokenized_dataset

Map:   0%|          | 0/16407 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'input_ids', 'attention_mask'],
    num_rows: 16407
})

In [18]:
print(tokenized_dataset[0]['prompt'])


Below is an instruction that describes a questions, paired with an question type that provides further classification. Write a response that appropriately answer the question.

### Question:
Who is at risk for Lymphocytic Choriomeningitis (LCM)? ?

### Type:
susceptibility

### Answer:
LCMV infections can occur after exposure to fresh urine, droppings, saliva, or nesting materials from infected rodents.  Transmission may also occur when these materials are directly introduced into broken skin, the nose, the eyes, or the mouth, or presumably, via the bite of an infected rodent. Person-to-person transmission has not been reported, with the exception of vertical transmission from infected mother to fetus, and rarely, through organ transplantation.<|eot_id|>


In [19]:
print(tokenized_dataset[0]['input_ids'])#[-10:]

[128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004,

In [20]:
print(tokenized_dataset[0]['attention_mask'])#[-10:]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

## Prompt Test

In [21]:
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype = torch.float16,
  trust_remote_code = True
).to(device) #'''

In [22]:
def assistant(prompt):
  inputs = tokenizer(
  [
    prompt_format.format(
      prompt,
      "",
      "",
    )
  ], return_tensors = 'pt').to(device)
  generation_config = GenerationConfig(
    do_sample = True,
    top_k = 1,
    temperature = 0.1,
    max_new_tokens = 1024,
    pad_token_id = tokenizer.eos_token_id
  )
  outputs = model.generate(
    **inputs,
    generation_config = generation_config
  )
  return print(tokenizer.decode(outputs[0], skip_special_tokens = True))

In [25]:
prompt = dataset[0]['Question']
assistant(prompt)


Below is an instruction that describes a questions, paired with an question type that provides further classification. Write a response that appropriately answer the question.

### Question:
Who is at risk for Lymphocytic Choriomeningitis (LCM)??

### Type:


### Answer:
Lymphocytic Choriomeningitis (LCM) is a viral infection that primarily affects children and young adults. It is caused by the LCM virus, which is a type of enterovirus. The virus is transmitted through the fecal-oral route, where it is ingested through contaminated food or water. The main groups at risk for LCM include:

*   Children under the age of 5
*   Young adults (20-30 years old)
*   Pregnant women
*   People with weakened immune systems, such as those with HIV/AIDS or undergoing chemotherapy

### Additional Information:
Lymphocytic Choriomeningitis (LCM) is a serious and potentially life-threatening illness. Symptoms include fever, headache, and muscle aches, which can progress to more severe complications suc

In [24]:
print(dataset[0]['Answer'])

LCMV infections can occur after exposure to fresh urine, droppings, saliva, or nesting materials from infected rodents.  Transmission may also occur when these materials are directly introduced into broken skin, the nose, the eyes, or the mouth, or presumably, via the bite of an infected rodent. Person-to-person transmission has not been reported, with the exception of vertical transmission from infected mother to fetus, and rarely, through organ transplantation.
