# Instruction Tuning
This module will guide you through instruction tuning language models. Instruction tuning involves adapting pre-trained models to specific tasks by further training them on task-specific datasets. This process helps models improve their performance on targeted tasks.

In this module, we will explore two topics: 1) Alpaca Prompt Template and 2) SFT

In [8]:
import os
import torch
# Set GPU device
os.environ["CUDA_VISIBLE_DEVICES"] = "3"
#uncomment this if you are not using our department puffer
os.environ['http_proxy']  = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM

#make our work comparable if restarted the kernel
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## Dataset format support
The SFTTrainer supports popular dataset formats. This allows you to pass the dataset to the trainer without any pre-processing directly. The following formats are supported:

instruction format 
```sh
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

```

conversational format
```sh
{"messages": [
    {"role": "system", "content": "You are helpful"}, 
    {"role": "user", "content": "What's the capital of France?"}, 
    {"role": "assistant", "content": "..."}
]},
{"messages": [
    {"role": "system", "content": "You are helpful"}, 
    {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, 
    {"role": "assistant", "content": "..."}
]}, 
{"messages": [
    {"role": "system", "content": "You are helpful"}, 
    {"role": "user", "content": "How far is the Moon from Earth?"}, 
    {"role": "assistant", "content": "..."}
]}
```

If your dataset uses one of the above formats, you can directly pass it to the trainer without pre-processing. The SFTTrainer will then format the dataset for you using the defined format from the model’s tokenizer with the apply_chat_template method.

In [3]:
# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_name)
model = model.to(device)

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)
tokenizer.pad_token = tokenizer.eos_token

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-MyDataset"
finetune_tags = ["smol-course", "module_1"]

In [4]:
# Step 1: Load the dataset
from datasets import load_dataset
dataset = load_dataset("lucasmccabe-lmi/CodeAlpaca-20k")
dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 20022
    })
})

In [5]:
dataset['train'][0]

{'instruction': 'Create a function that takes a specific input and produces a specific output using any mathematical operators. Write corresponding code in Python.',
 'input': '',
 'output': 'def f(x):\n    """\n    Takes a specific input and produces a specific output using any mathematical operators\n    """\n    return x**2 + 3*x'}

### Standard-Alpaca : Format your input prompts
For instruction fine-tuning, it is quite common to have two columns inside the dataset: one for the prompt & the other for the response.

This allows people to format examples like [Stanford-Alpaca](https://github.com/tatsu-lab/stanford_alpaca) did as follows:

```sh
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{response}
```

**Customize your prompts using packed dataset**

If your dataset has several fields that you want to combine, for example if the dataset has question and answer fields and you want to combine them, you can pass a formatting function to the trainer that will take care of that. For example:

In [6]:
def formatting_func(example):
    text = f"### Question: {example['instruction']}\n ### Answer: {example['output']}"
    return text

formatting_func(dataset['train'][0])

'### Question: Create a function that takes a specific input and produces a specific output using any mathematical operators. Write corresponding code in Python.\n ### Answer: def f(x):\n    """\n    Takes a specific input and produces a specific output using any mathematical operators\n    """\n    return x**2 + 3*x'

In [8]:
## The `formatting_func` should return a list of processed strings since it can lead to silent bugs.
    
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
        output_texts.append(text)
    return output_texts

In [9]:
response_template = " ### Answer:"

collator = DataCollatorForCompletionOnlyLM(
    response_template, tokenizer=tokenizer)

# Step 3.1 : Set configure the SFTTrainer
sft_config = SFTConfig(
    output_dir="./sft_alpaca",
    max_steps=1000,  # Adjust based on dataset size and desired training duration
    per_device_train_batch_size=4,  # Set according to your GPU memory capacity
    learning_rate=5e-5,  # Common starting point for fine-tuning
    logging_steps=100,  # Frequency of logging training metrics
    save_steps=500,  # Frequency of saving model checkpoints
    # evaluation_strategy="steps",  # Evaluate the model at regular intervals
    # eval_steps=50,  # Frequency of evaluation
    use_mps_device=(
        True if device == "mps" else False
    ),  # Use MPS for mixed precision training
    hub_model_id=finetune_name,  # Set a unique name for your model
)

# Step 3.2 : Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,  # The pre-trained model to be fine-tuned
    args=sft_config,  # Configuration settings for fine-tuning, such as training steps and batch size
    formatting_func=formatting_prompts_func,  # Function to format input prompts for the model
    train_dataset=dataset["train"],  # Training dataset used for fine-tuning
    data_collator=collator,  # Handles batch collation and response formatting
    tokenizer=tokenizer,  # Tokenizer used for text processing
)


  trainer = SFTTrainer(


In [10]:
trainer.train()

Step,Training Loss
100,0.9074
200,0.9214
300,0.8355
400,0.8496
500,0.7838
600,0.8398
700,0.8229
800,0.8269
900,0.7514
1000,0.756


TrainOutput(global_step=1000, training_loss=0.8294741744995118, metrics={'train_runtime': 253.7607, 'train_samples_per_second': 15.763, 'train_steps_per_second': 3.941, 'total_flos': 394352973563904.0, 'train_loss': 0.8294741744995118, 'epoch': 0.1997602876548142})

### Test the fine-tuned model on the same prompt

In [21]:
# Load the model and tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
# model_name = "HuggingFaceTB/SmolLM2-135M"
model_name = "./sft_alpaca/checkpoint-1000"

model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_name)
model = model.to(device)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

In [27]:
from datasets import load_dataset 
sample = load_dataset("databricks/databricks-dolly-15k", split="train")
sample[0]

{'instruction': 'When did Virgin Australia start operating?',
 'context': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.",
 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.',
 'category': 'closed_qa'}

In [38]:
def formatting_func(example):
    text = f"### Question: {example['instruction']}\n ### Answer:"
    return text

In [39]:
# Generate response
# inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(device)
input_ids = tokenizer(formatting_func(sample[0]), return_tensors="pt", truncation=True).input_ids.to(device)
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


In [40]:
print(f"Prompt:\n{sample[0]['instruction']}\n")

Prompt:
When did Virgin Australia start operating?



In [41]:
print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")

Generated instruction:
il Service, and the Australian Air Mail service, was made in August 1929. This was the first manned flight between the two nations. The first passenger flight between the two nations, between Brisbane and Sydney, was made in 1930. This is when the first commercial flights between the two nations were made. The first flights


In [42]:
print(f"Ground truth:\n{sample[0]['response']}")

Ground truth:
Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.


In [20]:
print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")

Memory footprint: 538.06 MB
