# Fine-Tuning a Phi-2 Model for Function Calling on a Synthetic Dataset

In this notebook, we explore the process of fine-tuning a Phi-2 language model for function calling tasks using a synthetic dataset. The Phi-2 model is a small language model originally trained on synthetic datasets designed to enhance common-sense reasoning and general knowledge. In this notebook, we adapt the model to handle specific function calling scenarios, which is particularly relevant in applications where a model needs to interpret user requests and accurately execute predefined functions based on the input.

The primary objective of this notebook is to provide a comprehensive guide to fine-tuning a language model for specialized tasks, showcasing the steps required to adapt a general-purpose model to handle specific, structured interactions in a conversational context. By the end of this notebook, you will have a fully fine-tuned Phi-2 model capable of handling function calling tasks on a synthetic dataset, along with an understanding of the key concepts and techniques involved in the process.


## Imports

Includes all the necessary import statements required to execute the subsequent code.

In [1]:
import os
import numpy as np
import wandb
from datasets import Dataset
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import setup_chat_format
from peft import LoraConfig, get_peft_model, cast_mixed_precision_params, prepare_model_for_kbit_training
from trl import SFTConfig, SFTTrainer
import math
from dotenv import load_dotenv
from transformers import GenerationConfig

In [2]:
load_dotenv()

hf_token = os.getenv("HUGGINGFACE_TOKEN")

!huggingface-cli login --token $hf_token

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/cgrodrigues/.cache/huggingface/token
Login successful


## Load the Generated Dataset

In this section, we load a dataset that has been pre-generated and stored in a NumPy array file. The dataset consists of chat messages that include system, user, and assistant roles. The code performs the following steps:
1. Loading the dataset from the file.
2. Converting the data to a dataset object

In [3]:
messages = np.load('./data/messages.npy', allow_pickle=True)

# messages = [f"{m['system']}{m['user']}{m['assistant']}" for m in messages]

dataset = Dataset.from_dict({
    "messages_templated": messages 
    }
)

## Load the Base Model and the Tokenizer 

In this section, we load a pre-trained language model and its corresponding tokenizer, along with configuring the model for efficient inference using quantization techniques. The steps involved are as follows:

1. Define the model_id of the pre-trained model to use. In this case, we are loading the "microsoft/phi-2" model, which is a pre-trained language model from Microsoft.
2. Load the tokenizer. This tokenizer is essential for converting text into a format that the model can process (i.e., token IDs) and for converting the model's output back into human-readable text.
3. Use a BitsAndBytesConfig object to configure the model's quantization settings. This configuration reduce the model's memory footprint and improve inference speed by:
    * Loading the model in 4-bit precision.
    * Setting the quantization data type to "nf4" (4-bit NormalFloat).
    * The computation data type as torch.bfloat16.
    * And not to use double quantization.
4. Load the Pre-trained Model.


https://huggingface.co/blog/4bit-transformers-bitsandbytes

In [4]:
base_model_path = "microsoft/phi-2"

tokenizer = AutoTokenizer.from_pretrained(base_model_path)

bnb_config = BitsAndBytesConfig(  
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)


model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=base_model_path,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
  )

model.config.use_cache = False # disable the use of the cache during the 
                               # generation process.
model.config.pretraining_tp = 1 # disables tensor parallelism, the model is
                                # not split across multiple devices and runs
                                # on a single device.

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Setup the Tokenizer Chat Format

In this section, we configure the model and tokenizer to use a specific chat format, referred to as the 'chatml' format. This setup is likely necessary to ensure that the inputs and outputs between the model, tokenizer, and chat interface are compatible and properly formatted.

In [5]:
# Set up the chat format with default 'chatml' format
model, tokenizer = setup_chat_format(model, tokenizer)

In [6]:
tokenizer.apply_chat_template([{
                                "role": "system", 
                                "content": "You are a nice assistant."
                               },
                               {
                                "role": "user", 
                                "content": "Hello, there!"
                               }, 
                               {
                                "role": "assistant", 
                                "content": "Hi!"
                               }], tokenize=False)

'<|im_start|>system\nYou are a nice assistant.<|im_end|>\n<|im_start|>user\nHello, there!<|im_end|>\n<|im_start|>assistant\nHi!<|im_end|>\n'

In [7]:
print(model) 

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(50297, 2560)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiSdpaAttention(
          (q_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (dense): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (final_la

## Prepare the Dataset for Training and Testing

In this section, the dataset is prepared for training and testing by shuffling, selecting a subset of samples, and splitting the data into training and test sets.

In [8]:
# Define the Number of Samples:
NUM_SAMPLES = 1400 #128_000

# Shuffle the Dataset
dataset = dataset.shuffle(seed=42).select(range(NUM_SAMPLES))

# Split the Dataset into Training and Test Sets
dataset = dataset.train_test_split(test_size=0.1, seed=42)
dataset

DatasetDict({
    train: Dataset({
        features: ['messages_templated'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['messages_templated'],
        num_rows: 140
    })
})

In [9]:
# Print some examples
for i in range(2):
  print(dataset["train"]["messages_templated"][i])
  print("\n-------------------------------------------\n")

{'assistant': '<|im_start|>assistant\n<functioncall> {"name": "search_internet", "arguments": "{\'query\': \'Far modern myself.\', \'language\': \'industry\', \'filter_date\': \'account\'}"} <|im_end|><|endoftext|>', 'system': '<|im_start|>system\nYou are a helpful assistant with access to the following functions. Use these functions when they are relevant to assist with a user\'s request\n[{\n    "name": "search_internet",\n    "description": "Perform an internet search based on the user\'s query.",\n    "parameters": {\n        "type": "object",\n        "properties": {\n            "query": {\n                "type": "string",\n                "description": "The search query."\n            },\n            "language": {\n                "type": "string",\n                "description": "The language of the search results."\n            },\n            "filter_date": {\n                "type": "string",\n                "description": "A date filter for the search results."\n        

## Tokenization the Dataset and Collation Function 

Hear the dataset is tokenized to all the message have a maximum size of 1024 tokens, also, due the model should not learn from teh user and system messages,  the messages from user and system are karket with label IGNORE to no be used in the loss calculation during the training. In the second part is create a collation function to use to process the bactches during the training. 

In [10]:
IGNORE_INDEX = -100

def tokenize(input):
    max_length = 1024 
    input_ids, attention_mask, labels = [], [], [] 
    message = [input['messages_templated']['system'],
               input['messages_templated']['user'],
               input['messages_templated']['assistant']]
   
    for i, msg in enumerate(message):
        msg_tokenized = tokenizer(  
          msg,   
          truncation=False,   
          add_special_tokens=False)  
  
        # Copy tokens and attention mask without changes  
        input_ids += msg_tokenized["input_ids"]  
        attention_mask += msg_tokenized["attention_mask"]
        
        # Adapt labels for loss calculation: if system or user ->IGNORE_INDEX, 
        # if assistant->input_ids  (calculate loss only for assistant messages)      
        if i == 2:
            labels += msg_tokenized["input_ids"]  
        else:
            labels += [IGNORE_INDEX]*len(msg_tokenized["input_ids"]) 
    
    # truncate to max. length  
    return {  
        "input_ids": input_ids[:max_length],   
        "attention_mask": attention_mask[:max_length],  
        "labels": labels[:max_length],  
    }  

        
dataset_tokenized = dataset.map(tokenize,   
            batched = False,  
            num_proc = os.cpu_count(),    # multithreaded  
            remove_columns = dataset["train"].column_names  # Remove original columns, no longer needed  
)

def collate(elements):
    tokens=[e["input_ids"] for e in elements]
    tokens_maxlen=max([len(t) for t in tokens])

    for i,sample in enumerate(elements):
        input_ids=sample["input_ids"]
        labels=sample["labels"]
        attention_mask=sample["attention_mask"]

        pad_len=tokens_maxlen-len(input_ids)

        input_ids.extend( pad_len * [tokenizer.pad_token_id] )   
        labels.extend( pad_len * [IGNORE_INDEX] )    
        attention_mask.extend( pad_len * [0] ) 

    batch={
        "input_ids": torch.tensor( [e["input_ids"] for e in elements] ),
        "labels": torch.tensor( [e["labels"] for e in elements] ),
        "attention_mask": torch.tensor( [e["attention_mask"] for e in elements] ),
    }

    return batch


Map (num_proc=12):   0%|          | 0/1260 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/140 [00:00<?, ? examples/s]

In [11]:
dataset_tokenized

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})

## Configure and Apply LoRA (Low-Rank Adaptation) for Model Fine-Tuning

In this section, we configure and apply LoRA (Low-Rank Adaptation) to the model, a technique used to efficiently fine-tune large language models by adapting specific parts of the model. The code involves setting up the LoRA configuration, applying it to the model and adjusting the precision of model parameters.

In [12]:
# Prepares the model for training using low-bit 
# precision (8-bit or 4-bit precision)
model = prepare_model_for_kbit_training(model, 
                                        use_gradient_checkpointing=True)

# LoRA Configuration
peft_config = LoraConfig(
    r=32, # This parameter controls the rank of the low-rank adaptation 
          # matrices. A lower rank reduces the number of parameters, 
          # making the adaptation more efficient.

    lora_alpha=32, # This scaling factor balances the impact of the 
                   # low-rank adaptation. It helps control how much 
                   # the adaptation affects the original model parameters.
                   
    # Target all linear layers
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'dense',
        'fc1',
        'fc2',
    ], # These are the names of the linear layers in the model that will be 
       # adapted using LoRA. By targeting these modules, LoRA modifies only 
       # these parts of the model, making fine-tuning more efficient.

    modules_to_save = ["lm_head", "embed_tokens"],
    lora_dropout=0.05, # A small dropout rate is applied during training to 
                      # prevent overfitting. This means 5% of the neurons 
                      # will be randomly dropped during training.

    bias="none", # No additional bias is added in the adaptation layers.
    task_type="CAUSAL_LM", # Specifies that the task is causal language 
                           # modeling, indicating that the model predicts 
                           # the next word in a sequence.
    
)

# Apply LoRA to the Model:
model = get_peft_model(model, peft_config)

# Cast Mixed Precision Parameters:
# cast_mixed_precision_params(model, dtype=torch.float16)
# The model's parameters are cast to torch.float16. This reduces memory 
# usage and speeds up computation during training

# Print Trainable Parameters:
model.print_trainable_parameters()
# The output of this line said the number of trainable parameters 
# in the model after applying LoRA.

model.config.use_cache = False # disable the use of the cache 
                               # during the generation process.

trainable params: 304,756,857 || all params: 3,079,816,434 || trainable%: 9.8953


## Set Training Configuration and Initialize the Fine-Tuning Trainer

Here is defined various hyperparameters and configurations necessary for fine-tuning the model. The configuration is encapsulated in a SFTConfig object. Then initialize the SFTTrainer the handles the training loop, evaluation, and model saving according to the parameters defined in the SFTConfig.

In [13]:
# Define Hyperparameters:
max_seq_length = 2048 # Specifies the maximum sequence length for the model, 
                      # meaning each input sequence can be up to 2048 tokens long.
batch_size = 2 # Sets the number of samples processed per batch. A batch size 
               # of 1 chosen due to memory constraints.
gradient_accum_steps = 1 # Determines how many batches to process before performing 
                         # a gradient update. Since it's set to 1, gradients are 
                         # updated after every batch.
epochs = 1 # Indicates that the model will be trained for one complete pass 
           # through the dataset.
eval_steps = 50 # The model's performance will be evaluated every 50 steps.
save_steps = eval_steps * 2 # The model will be saved every 100 steps 
                            # (twice as often as evaluation).
logging_steps = 50 # Logging information will be output every 50 steps, 
                   # aligning with evaluation intervals.
lr = 2e-5   # Sets the learning rate for training. This controls how much to adjust 
            # the model's parameters with respect to the loss gradient during each update.

print("Eval Steps:", eval_steps)
print("Save Steps:", save_steps)

# Set Model and Adapter Path and Names

new_model_name = "phi-2-function-calling"
new_model_path = f"./{new_model_name}"

train_model_name = f"{new_model_name}-train"
train_model_path = f"./{train_model_name}"

adapter_name = f"{new_model_name}-adapter"
adapter_path = f"./{new_model_name}-adapter"

# Configure Training with SFTConfig:
sft_config = SFTConfig(
    output_dir=train_model_path, # Where the trained model will be saved.
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accum_steps,
    
    optim="paged_adamw_32bit", 
    save_strategy="epoch",
    logging_steps=logging_steps,
    logging_strategy="steps",
    learning_rate=lr, 
    fp16=False,
    bf16=False,
    group_by_length=True,
    disable_tqdm=False,
    max_seq_length=max_seq_length,
    dataset_text_field="messages_templated", # Which field in the dataset contains the text data to be used.
    packing=False, 
    report_to="wandb", # Training metrics will be reported to Weights & Biases
    run_name=train_model_name,
    eval_steps=eval_steps,
    save_steps=save_steps,
    lr_scheduler_type="constant", 
    eval_strategy="steps",
  )


# Initialize the SFTTrainer:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_tokenized['train'],
    eval_dataset=dataset_tokenized['test'],
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=sft_config,
    data_collator=collate,
)


Eval Steps: 50
Save Steps: 100


## Initialize Weights & Biases (W&B) Tracking for the Training Run

In [14]:
run = wandb.init(
    project="kensei-phi2",
    name="testrun-4.8.3",
)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mcarlos-garces-rodrigues[0m ([33mdatakensei[0m). Use [1m`wandb login --relogin`[0m to force relogin


## Evaluate the Model Before Training

The model is evaluated on the test dataset, and the perplexity is calculated. Perplexity

In [15]:
# eval_results = trainer.evaluate()
# print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
# eval_results

## Start the Model Training Process

This step involves feeding the model with the training data, adjusting its parameters to minimize the loss, and iterating through the dataset according to the prevoius configuration settings.

In [16]:
trainer.train()

  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss
50,1.4705,0.759203
100,0.5624,0.378981
150,0.3603,0.268035
200,0.203,0.234188
250,0.2947,0.223158
300,0.3456,0.219122
350,0.2618,0.212389
400,0.28,0.210371
450,0.2843,0.210798
500,0.2187,0.210504




TrainOutput(global_step=630, training_loss=0.3955709184919085, metrics={'train_runtime': 15336.2616, 'train_samples_per_second': 0.082, 'train_steps_per_second': 0.041, 'total_flos': 6471275893612128.0, 'train_loss': 0.3955709184919085, 'epoch': 1.0})

## Evaluate the Model After Training

In [17]:

# eval_results_after = trainer.evaluate()
# print(f">>> Perplexity: {math.exp(eval_results_after['eval_loss']):.2f}")

## Test a Chat Interaction with the New Model

In [18]:

def chat(messages):
    tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

    outputs = trainer.model.generate(tokenized_chat, max_new_tokens=128) #, stopping_criteria=["<|im_end|>"])
    print(tokenizer.decode(outputs[0]))

messages = [
    {
        "role":"system", 
        "content":"You are a helpful assistant with access to the following functions. Use these functions when they are relevant to assist with a user's request\n[\n	{\n		\"name\": \"calculate_retirement_savings\",\n		\"description\": \"Project the savings at retirement based on current contributions.\",\n		\"parameters\": {\n			\"type\": \"object\",\n			\"properties\": {\n				\"current_age\": {\n					\"type\": \"integer\",\n					\"description\": \"The current age of the individual.\"\n				},\n				\"retirement_age\": {\n					\type\": \"integer\",\n					\"description\": \"The desired retirement age.\"\n				},\n				\"current_savings\": {\n					\"type\": \"number\",\n					\"description\": \"The current amount of savings.\"\n				},\n				\"monthly_contribution\": {\n					\"type\": \"number\",\n					\"description\": \"The monthly contribution towards retirement savings.\"\n				}\n			},\n			\"required\": [\"current_age\", \"retirement_age\", \"current_savings\", \"monthly_contribution\"]\n		}\n	}\n]"
    },
    {
        "role": "user", 
        "content": "I am currently 40 years old and plan to retire at 65. I have no savings at the moment, but I intend to save $500 every month. Could you project the savings at retirement based on current contributions?"
    }
]
chat(messages)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)


tensor([[50295, 10057,   198,  1639,   389,   257,  7613,  8796,   351,  1895,
           284,   262,  1708,  5499,    13,  5765,   777,  5499,   618,   484,
           389,  5981,   284,  3342,   351,   257,  2836,   338,  2581,   198,
            58,   198,   197,    90,   198, 50294,     1,  3672,  1298,   366,
          9948,  3129,   378,    62,  1186, 24615,    62, 39308,   654,  1600,
           198, 50294,     1, 11213,  1298,   366, 16775,   262, 10653,   379,
         10737,  1912,   319,  1459,  9284, 33283,   198, 50294,     1, 17143,
          7307,  1298,  1391,   198, 50293,     1,  4906,  1298,   366, 15252,
          1600,   198, 50293,     1, 48310,  1298,  1391,   198, 50292,     1,
         14421,    62,   496,  1298,  1391,   198, 50291,     1,  4906,  1298,
           366, 41433,  1600,   198, 50291,     1, 11213,  1298,   366,   464,
          1459,  2479,   286,   262,  1981,   526,   198, 50292,  5512,   198,
         50292,     1,  1186, 24615,    62,   496,  

## Merging Model and Adapter

In [None]:
# Merge the model with adapter
new_model = trainer.model.merge_and_unload()

# Save merged model and push to hub

# Model
new_model.save_pretrained(new_model_path, token=True)
# Tokenizer
tokenizer.save_pretrained(new_model_path)
# Generation configuration
generation_config = GenerationConfig(
    max_new_tokens=100, 
    temperature=0.7,
    top_p=0.1,
    top_k=40,
    repetition_penalty=1.18,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
generation_config.save_pretrained(new_model_path)

In [None]:
# Upload
area = "DataKensei"
new_model.push_to_hub(f"{area}/{new_model_name}", token=True, safe_serialization=True)
tokenizer.push_to_hub(f"{area}/{new_model_name}")