<a href="https://colab.research.google.com/github/azzindani/03_LLM_Data_Preprocess/blob/main/00_LLM_Dataset_Preprocess_v7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Modules

In [27]:
!pip install -q datasets

In [28]:
import torch

from datasets import load_dataset
from datasets import Dataset

from transformers import (
  AutoTokenizer,
  AutoModelForCausalLM,
  GenerationConfig
)

if torch.cuda.is_available():
  print('GPU is available!')
else:
  print('GPU is not available.')

GPU is not available.


In [29]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

## Import Model Tokenizer

In [30]:
#url = 'https://huggingface.co/Qwen/Qwen2.5-0.5B'
#model_name = url.split('.co/')[-1]

model_name = 'unsloth/Llama-3.2-1B-Instruct'

In [31]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
#tokenizer

## Import Dataset

In [32]:
#url = 'https://huggingface.co/datasets/KingNish/reasoning-base-20k'
#dataset_name = url.split('datasets/')[-1]

dataset_name = 'shireesh-uop/nhs_qna'

In [39]:
max_length = 2048

In [34]:
dataset = load_dataset(dataset_name, split = 'train')
dataset

README.md:   0%|          | 0.00/510 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/19.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/21281 [00:00<?, ? examples/s]

Dataset({
    features: ['q', 'i', 'a', 'filename', 'context', 'count', 'text', 'text1'],
    num_rows: 21281
})

In [35]:
dataset.select(range(5)).to_pandas().head()

Unnamed: 0,q,i,a,filename,context,count,text,text1
0,What are the main problems that affect the mit...,<noinput>,The main problems that affect the mitral valve...,Mitral valve problem,The mitral valve is a small flap in the heart...,21,\n <s>\n [INST] <<SYS>> The mitral valve...,\n <s>\n [INST]What are the main problem...
1,What are the symptoms of mitral valve prolapse?,<noinput>,Symptoms of mitral valve prolapse can include ...,Mitral valve problem,The mitral valve is a small flap in the heart...,21,\n <s>\n [INST] <<SYS>> The mitral valve...,\n <s>\n [INST]What are the symptoms of ...
2,How is mitral valve prolapse treated?,<noinput>,Treatment for mitral valve prolapse may involv...,Mitral valve problem,The mitral valve is a small flap in the heart...,21,\n <s>\n [INST] <<SYS>> The mitral valve...,\n <s>\n [INST]How is mitral valve prola...
3,What can cause mitral valve prolapse?,<noinput>,Mitral valve prolapse is usually caused by pro...,Mitral valve problem,The mitral valve is a small flap in the heart...,21,\n <s>\n [INST] <<SYS>> The mitral valve...,\n <s>\n [INST]What can cause mitral val...
4,What is mitral regurgitation?,<noinput>,Mitral regurgitation is a condition where the ...,Mitral valve problem,The mitral valve is a small flap in the heart...,21,\n <s>\n [INST] <<SYS>> The mitral valve...,\n <s>\n [INST]What is mitral regurgitat...


In [36]:
dataset[0]

{'q': 'What are the main problems that affect the mitral valve?',
 'i': '<noinput>',
 'a': 'The main problems that affect the mitral valve are mitral valve prolapse, mitral regurgitation, and mitral stenosis.',
 'filename': 'Mitral valve problem',
 'context': ' The mitral valve is a small flap in the heart that stops blood flowing the wrong way. Problems with it can affect how blood flows around the body.The main problems that affect the mitral valve are:mitral valve prolapse – the valve becomes too floppymitral regurgitation – the valve leaks and blood flows the wrong waymitral stenosis – the valve does not open as wide as it shouldThese conditions can be serious, but they\'re often treatable.In some cases, mitral valve surgery may be needed. \nMitral valve prolapse\n Mitral valve prolapseMitral valve prolapse is where the mitral valve is too floppy and does not close tightly.SymptomsMany people with a mitral valve prolapse do not have symptoms and it may only be spotted during a hear

In [37]:
features = list(dataset.features.keys())
print(features)

['q', 'i', 'a', 'filename', 'context', 'count', 'text', 'text1']


## Dataset Preprocess

In [40]:
prompt_format = """
Below is an medical questions that explain personal condition of the questioner, paired with the context that explains detailed about medical conditions. Write a answer that appropriately giving medical main problems.

### Question:
{}

### Context:
{}

### Answer:
{}"""

In [41]:
EOS_TOKEN = tokenizer.eos_token

def preprocess(examples):
  instructions = examples['q']
  inputs = examples['context']
  outputs = examples['a']

  texts = prompt_format.format(instructions, inputs, outputs) + EOS_TOKEN
  return {'prompt' : texts}

In [42]:
formatted_dataset = dataset.map(
  preprocess,
  remove_columns = list(dataset.features.keys())
)
formatted_dataset

Map:   0%|          | 0/21281 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt'],
    num_rows: 21281
})

In [43]:
print(formatted_dataset[0]['prompt'])


Below is an medical questions that explain personal condition of the questioner, paired with the context that explains detailed about medical conditions. Write a answer that appropriately giving medical main problems.

### Question:
What are the main problems that affect the mitral valve?

### Context:
 The mitral valve is a small flap in the heart that stops blood flowing the wrong way. Problems with it can affect how blood flows around the body.The main problems that affect the mitral valve are:mitral valve prolapse – the valve becomes too floppymitral regurgitation – the valve leaks and blood flows the wrong waymitral stenosis – the valve does not open as wide as it shouldThese conditions can be serious, but they're often treatable.In some cases, mitral valve surgery may be needed. 
Mitral valve prolapse
 Mitral valve prolapseMitral valve prolapse is where the mitral valve is too floppy and does not close tightly.SymptomsMany people with a mitral valve prolapse do not have symptoms

## Tokenization

In [44]:
def tokenize_data(example, max_length = max_length):
  return tokenizer(example['prompt'], truncation = True, padding = 'max_length', max_length = max_length)

In [45]:
tokenized_dataset = formatted_dataset.map(tokenize_data, batched = True)#, remove_columns = 'text')
tokenized_dataset

Map:   0%|          | 0/21281 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'input_ids', 'attention_mask'],
    num_rows: 21281
})

In [46]:
print(tokenized_dataset[0]['prompt'])


Below is an medical questions that explain personal condition of the questioner, paired with the context that explains detailed about medical conditions. Write a answer that appropriately giving medical main problems.

### Question:
What are the main problems that affect the mitral valve?

### Context:
 The mitral valve is a small flap in the heart that stops blood flowing the wrong way. Problems with it can affect how blood flows around the body.The main problems that affect the mitral valve are:mitral valve prolapse – the valve becomes too floppymitral regurgitation – the valve leaks and blood flows the wrong waymitral stenosis – the valve does not open as wide as it shouldThese conditions can be serious, but they're often treatable.In some cases, mitral valve surgery may be needed. 
Mitral valve prolapse
 Mitral valve prolapseMitral valve prolapse is where the mitral valve is too floppy and does not close tightly.SymptomsMany people with a mitral valve prolapse do not have symptoms

In [47]:
print(tokenized_dataset[0]['input_ids'])#[-10:]

[128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004,

In [48]:
print(tokenized_dataset[0]['attention_mask'])#[-10:]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

## Prompt Test

In [49]:
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype = torch.float16,
  trust_remote_code = True
).to(device) #'''

In [50]:
def assistant(prompt):
  inputs = tokenizer(
  [
    prompt_format.format(
      prompt,
      "",
      "",
    )
  ], return_tensors = 'pt').to(device)
  generation_config = GenerationConfig(
    do_sample = True,
    top_k = 1,
    temperature = 0.1,
    max_new_tokens = 1024,
    pad_token_id = tokenizer.eos_token_id
  )
  outputs = model.generate(
    **inputs,
    generation_config = generation_config
  )
  return print(tokenizer.decode(outputs[0], skip_special_tokens = True))

In [52]:
prompt = dataset[0]['q']
assistant(prompt)


Below is an medical questions that explain personal condition of the questioner, paired with the context that explains detailed about medical conditions. Write a answer that appropriately giving medical main problems.

### Question:
What are the main problems that affect the mitral valve?

### Context:


### Answer:
The main problems that affect the mitral valve are:

1. **Mitral Valve Prolapse (MVP)**: This is the most common condition affecting the mitral valve. It occurs when the mitral valve leaflets bulge back into the left atrium during systole, causing the valve to prolapse. This can lead to symptoms such as chest pain, shortness of breath, and fatigue.

2. **Mitral Valve Stenosis (MVS)**: This condition is characterized by the narrowing of the mitral valve opening, which restricts blood flow from the left atrium to the left ventricle. This can cause symptoms such as shortness of breath, fatigue, and swelling in the legs and feet.

3. **Mitral Valve Regurgitation (MVR)**: This 

In [51]:
print(dataset[0]['a'])

The main problems that affect the mitral valve are mitral valve prolapse, mitral regurgitation, and mitral stenosis.
