<a href="https://colab.research.google.com/github/azzindani/03_LLM_Data_Preprocess/blob/main/00_LLM_Dataset_Preprocess_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Modules

In [1]:
!pip install -q datasets

In [34]:
import torch

from datasets import load_dataset
from datasets import Dataset

from transformers import (
  AutoTokenizer,
  AutoModelForCausalLM,
  GenerationConfig
)

if torch.cuda.is_available():
  print('GPU is available!')
else:
  print('GPU is not available.')

GPU is not available.


In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

## Import Model Tokenizer

In [4]:
#url = 'https://huggingface.co/Qwen/Qwen2.5-0.5B'
#model_name = url.split('.co/')[-1]

model_name = 'unsloth/Llama-3.2-1B-Instruct'

In [42]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
#tokenizer

## Import Dataset

In [6]:
#url = 'https://huggingface.co/datasets/KingNish/reasoning-base-20k'
#dataset_name = url.split('datasets/')[-1]

dataset_name = 'yahma/alpaca-cleaned'

In [7]:
max_length = 1024

In [8]:
dataset = load_dataset(dataset_name, split = 'train')
dataset

Dataset({
    features: ['output', 'input', 'instruction'],
    num_rows: 51760
})

In [9]:
dataset.select(range(5)).to_pandas().head()

Unnamed: 0,output,input,instruction
0,1. Eat a balanced and nutritious diet: Make su...,,Give three tips for staying healthy.
1,"The three primary colors are red, blue, and ye...",,What are the three primary colors?
2,An atom is the basic building block of all mat...,,Describe the structure of an atom.
3,There are several ways to reduce air pollution...,,How can we reduce air pollution?
4,I had to make a difficult decision when I was ...,,Pretend you are a project manager of a constru...


In [10]:
dataset[0]

{'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.',
 'input': '',
 'instruction': 'Give three tips for staying healthy.'}

In [11]:
features = list(dataset.features.keys())
print(features)

['output', 'input', 'instruction']


## Dataset Preprocess

In [12]:
prompt_format = """
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [13]:
EOS_TOKEN = tokenizer.eos_token

def preprocess(examples):
  instructions = examples['instruction']
  inputs = examples['input']
  outputs = examples['output']

  texts = prompt_format.format(instructions, inputs, outputs) + EOS_TOKEN
  return {'prompt' : texts}

In [14]:
formatted_dataset = dataset.map(preprocess, remove_columns = features)
formatted_dataset

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt'],
    num_rows: 51760
})

In [15]:
formatted_dataset[0]

{'prompt': '\nBelow is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Input:\n\n\n### Response:\n1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hour

## Tokenization

In [17]:
def tokenize_data(example, max_length = max_length):
  return tokenizer(example['prompt'], truncation = True, padding = 'max_length', max_length = max_length)

In [18]:
tokenized_dataset = formatted_dataset.map(tokenize_data, batched = True)#, remove_columns = 'text')
tokenized_dataset

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'input_ids', 'attention_mask'],
    num_rows: 51760
})

In [20]:
tokenized_dataset[0]['prompt']

'\nBelow is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Input:\n\n\n### Response:\n1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep 

In [30]:
tokenized_dataset[0]['input_ids'][-10:]

[22, 12, 24, 4207, 315, 6212, 1855, 3814, 13, 128009]

In [31]:
tokenized_dataset[0]['attention_mask'][-10:]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

## Prompt Test

In [33]:
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype = torch.float16,
  trust_remote_code = True
).to(device) #'''

config.json:   0%|          | 0.00/927 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

In [35]:
def assistant(prompt):
  inputs = tokenizer(
  [
    prompt_format.format(
      prompt,
      "",
      "",
    )
  ], return_tensors = 'pt').to(device)
  generation_config = GenerationConfig(
    do_sample = True,
    top_k = 1,
    temperature = 0.1,
    max_new_tokens = 1024,
    pad_token_id = tokenizer.eos_token_id
  )
  outputs = model.generate(
    **inputs,
    generation_config = generation_config
  )
  return print(tokenizer.decode(outputs[0], skip_special_tokens = True))

In [39]:
prompt = dataset[0]['instruction']
assistant(prompt)


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Input:


### Response:
Here are three tips for staying healthy:

1.  **Stay Hydrated**: Drinking plenty of water is essential for maintaining good health. Aim to drink at least eight glasses of water a day, and make sure to drink water regularly throughout the day, especially when you're physically active.
2.  **Eat a Balanced Diet**: A well-balanced diet that includes a variety of fruits, vegetables, whole grains, lean proteins, and healthy fats is crucial for maintaining good health. Aim to eat at least five servings of fruits and vegetables a day, and choose whole grains over refined grains.
3.  **Get Enough Sleep**: Getting enough sleep is essential for physical and mental health. Aim to get at least seven hours of sleep a night, and establish a consistent sleep sche

In [41]:
print(dataset[0]['output'])

1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.
