# FINE-TUNING A MODEL FOR FUNCTION-CALLING

In this simple example, **we're going to fine-tune an LLM for function calling.**.

This notebook is part of the <a href="https://www.hf.co/learn/agents-course">Hugging Face Agents Course</a>, a free Course from beginner to expert, where you learn to build Agents.

<img src="https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/communication/share.png" alt="Agent Course"/>

## What is Function-Calling?

Function-calling is a process in which we provide an LLM with tools and allow it to determine the parameters with which it should call those tools.

I would separate function-calling from a regular agent because function-calling is something the agent has been explicitly trained to do, relying less on prompting.

Function-calling is typically a type of JSON agent that has been fine-tuned to follow instructions and invoke tools when necessary.

Unlike standard JSON agents, function-calling models have a new set of special tokens used to further delimit conversations.

For example, Mistral models have introduced new tokens to handle tools natively:
- `[AVAILABLE_TOOLS]` – Start the list of available tools  
- `[/AVAILABLE_TOOLS]` – End the list of available tools  
- `[TOOL_CALLS]` – Make a call to a tool (i.e., take an "Action")  
- `[TOOL_RESULTS]` – "Observe" the result of the action  
- `[/TOOL_RESULTS]` – End of the observation (i.e., the model can decode again)  

### Summary

If a normal JSON agent primarily operates through **prompting**, a function-calling agent relies on **training** to invoke tools effectively.

## How do we train our model for function-calling ?

> Answer : The necessicity for this is **data**

A model training can be seen in 3 steps :
- 1) The model is pretrained on a large quantity of data. The output of that step is a pre-trained model.
- 2) The model can be **fine-tuned** on instruction following either by the model creator or by an individual ( or both ).
- 3) The model can be **aligned** to the creator's preference. 

Usually a full fledged product like Gemini went throught all 3 steps while the models you can find on Hugging Face have passed by one or more steps of this training.

In this tutorial, we will build a function-calling model based on **"google/gemma-2-2b-it"**. 

The base model is "google/gemma-2-2b". The google team then fine-tuned the base model on instruction following : resulting in **"google/gemma-2-2b-it"**. In this case we will take **"google/gemma-2-2b-it"** as base and not the base model because the prior fine-tuning it has been through is important for our use-case.

Since we want to interract with our model throught conversations in messages, starting from the base model would requiere more training in order to learn instruction following, chat AND function-calling.

By taking the instruct-tuned model as a base, we minimize the amount of information that our model should learn.

### LoRA  (Low-Rank Adaptation of Large Language Models) :
LoRA (Low-Rank Adaptation of Large Language Models) is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share. 

<img src="https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit1/blog_multi-lora-serving_LoRA.gif" alt="LoRA inference" width="50%"/>

This helps reduce drastically the memory requiered tot rain a model.


In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl==0.12.2
!pip install -q -U tensorboardX
!pip install -q wandb

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
trl 0.12.2 requires transformers<4.47.0, but you have transformers 4.48.3 which is incompatible.[0m


In [2]:
from enum import Enum
from functools import partial
import pandas as pd
import torch
import json

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig, set_seed
from datasets import load_dataset
from trl import SFTTrainer
from peft import get_peft_model, LoraConfig, TaskType

seed = 42
set_seed(seed)

import os
os.environ['HF_TOKEN']="hf_xxx"

## Processing the dataset into inputs.

In order to train the model, we need to format the inputs into what we want the model to learn.

For this tutorial, I enhanced a popular dataset for function calling "NousResearch/hermes-function-calling-v1" by adding some new **thinking** step computer from **deepseek-ai/DeepSeek-R1-Distill-Qwen-32B**.

But in order for the model to learn, we need to format the conversation correctly. If you followed Unit 1, you know that going from a list of messages to a prompt is handled by the **chat_template**, or, thedefault chat_template of gemma-2-2B does not contain tool calls. So we will need to modify it !

This is the role of our **preprocess** function. To go from a list of messages, to a prompt that the model can understand.


In [14]:
model_name = "google/gemma-2-2b-it"
dataset_name = "Jofthomas/hermes-function-calling-thinking-V1"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.chat_template = "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"


def preprocess(samples):
    batch = []
    for conversations in zip(samples["conversations"]):
        conversation = conversations[0]
        
        # Instead of adding a system message, we merge the content into the first user message
        if conversation[0]["role"] == "system":
            system_message_content = conversation[0]["content"]
            # Merge system content with the first user message
            conversation[1]["content"] = system_message_content + "\n\n" + conversation[1]["content"]
            # Remove the system message from the conversation
            conversation.pop(0)
        
        batch.append(tokenizer.apply_chat_template(conversation, tokenize=False))
    
    return {"content": batch}

dataset = load_dataset(dataset_name)


In [4]:
dataset = dataset.map(
    preprocess,
    batched=True,
    remove_columns=dataset["train"].column_names
)
dataset = dataset["train"].train_test_split(0.1)
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['content'],
        num_rows: 3213
    })
    test: Dataset({
        features: ['content'],
        num_rows: 357
    })
})


In [5]:
print(dataset["train"][8]["content"])

<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.Here are the available tools:<tools> [{'type': 'function', 'function': {'name': 'get_news_headlines', 'description': 'Get the latest news headlines', 'parameters': {'type': 'object', 'properties': {'country': {'type': 'string', 'description': 'The country for which headlines are needed'}}, 'required': ['country']}}}, {'type': 'function', 'function': {'name': 'search_recipes', 'description': 'Search for recipes based on ingredients', 'parameters': {'type': 'object', 'properties': {'ingredients': {'type': 'array', 'items': {'type': 'string'}, 'description': 'The list of ingredients'}}, 'required': ['ingredients']}}}] </tools>Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall

In [6]:
peft_config = LoraConfig(r=8,
                         lora_alpha=16,
                         lora_dropout=0.1,
                         target_modules=["gate_proj","q_proj","lm_head","o_proj","k_proj","embed_tokens","down_proj","up_proj","v_proj"],
                         task_type=TaskType.CAUSAL_LM)

In [7]:
print(tokenizer.pad_token)
print(tokenizer.eos_token)

<pad>
<eos>


In [8]:
class ChatmlSpecialTokens(str, Enum):
    tools = "<tools>"
    eotools = "</tools>"
    think = "<think>"
    eothink = "</think>"
    tool_call="<tool_call>"
    eotool_call="</tool_call>"
    pad_token = "<pad>"
    eos_token = "<eos>"
    @classmethod
    def list(cls):
        return [c.value for c in cls]

tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        pad_token=ChatmlSpecialTokens.pad_token.value,
        additional_special_tokens=ChatmlSpecialTokens.list()
    )
tokenizer.chat_template = "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                              attn_implementation='eager',
                                             device_map="auto")
model.resize_token_embeddings(len(tokenizer))
model.to(torch.bfloat16)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256006, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2304, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (pre_feedforward_layernorm): Gemma2RMSNorm((2304,), eps

In [9]:
output_dir = "gemma-2-2B-it-thinking-function_calling"
per_device_train_batch_size = 1
per_device_eval_batch_size = 1
gradient_accumulation_steps = 4
logging_steps = 5
learning_rate = 1e-4
max_grad_norm = 1.0
num_train_epochs=1
warmup_ratio = 0.1
lr_scheduler_type = "cosine"
max_seq_length = 2048

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    save_strategy="no",
    evaluation_strategy="epoch",
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    max_grad_norm=max_grad_norm,
    weight_decay=0.1,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard",
    bf16=True,
    hub_private_repo=False,
    push_to_hub=False,
    num_train_epochs=num_train_epochs,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False}
)



In [10]:
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    packing=True,
    dataset_text_field="content",
    max_seq_length=max_seq_length,
    peft_config=peft_config,
    dataset_kwargs={
        "append_concat_token": False,
        "add_special_tokens": False,
    },
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


In [11]:
trainer.train()
trainer.save_model()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Epoch,Training Loss,Validation Loss
0,0.367,0.340518




In [12]:
trainer.push_to_hub()


Upload 7 LFS files:   0%|          | 0/7 [00:00<?, ?it/s]

events.out.tfevents.1739727155.r-jofthomas-fttest-0ihwmg95-70a55-shjb6:   0%|          | 0.00/22.1k [00:00<?, …

events.out.tfevents.1739725934.r-jofthomas-fttest-0ihwmg95-70a55-shjb6:   0%|          | 0.00/21.5k [00:00<?, …

adapter_model.safetensors:   0%|          | 0.00/2.42G [00:00<?, ?B/s]

events.out.tfevents.1739728410.r-jofthomas-fttest-0ihwmg95-70a55-shjb6:   0%|          | 0.00/22.1k [00:00<?, …

events.out.tfevents.1739724308.r-jofthomas-fttest-0ihwmg95-70a55-shjb6:   0%|          | 0.00/21.5k [00:00<?, …

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.69k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Jofthomas/gemma-2-2B-it-thinking-function_calling/commit/79a33a3d552cd3dbfaccd75dae85255a27024bad', commit_message='End of training', commit_description='', oid='79a33a3d552cd3dbfaccd75dae85255a27024bad', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Jofthomas/gemma-2-2B-it-thinking-function_calling', endpoint='https://huggingface.co', repo_type='model', repo_id='Jofthomas/gemma-2-2B-it-thinking-function_calling'), pr_revision=None, pr_num=None)

In [13]:
tokenizer.eos_token = "<eos>"
# push the tokenizer to hub
tokenizer.push_to_hub("Jofthomas/gemma-2-2B-it-thinking-function_calling", token=True)

README.md:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Jofthomas/gemma-2-2B-it-thinking-function_calling/commit/b3612520a79067b666d3776d44fdbfa8c50a7923', commit_message='Upload tokenizer', commit_description='', oid='b3612520a79067b666d3776d44fdbfa8c50a7923', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Jofthomas/gemma-2-2B-it-thinking-function_calling', endpoint='https://huggingface.co', repo_type='model', repo_id='Jofthomas/gemma-2-2B-it-thinking-function_calling'), pr_revision=None, pr_num=None)

In [14]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from datasets import load_dataset
import torch

bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )

peft_model_id = "Jofthomas/gemma-2-2B-it-thinking-function_calling"
device = "auto"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             device_map="auto",
                                             )
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, peft_model_id)
model.to(torch.bfloat16)
# model.cuda()
model.eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): lora.Embedding(
          (base_layer): Embedding(256006, 2304, padding_idx=0)
          (lora_dropout): ModuleDict(
            (default): Dropout(p=0.1, inplace=False)
          )
          (lora_A): ModuleDict()
          (lora_B): ModuleDict()
          (lora_embedding_A): ParameterDict(  (default): Parameter containing: [torch.cuda.BFloat16Tensor of size 8x256006 (cuda:0)])
          (lora_embedding_B): ParameterDict(  (default): Parameter containing: [torch.cuda.BFloat16Tensor of size 2304x8 (cuda:0)])
          (lora_magnitude_vector): ModuleDict()
        )
        (layers): ModuleList(
          (0-25): 26 x Gemma2DecoderLayer(
            (self_attn): Gemma2Attention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2304, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
             

In [21]:
prompt="""<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools> </tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
<tools>
[{'type': 'function', 'function': {'name': 'ConvertYenToEur', 'description': 'converts a price in Yen into Euros', 'parameters': {'type': 'object', 'properties': {'yen': {'type': 'string', 'description': 'The price in Yen'}}}}]
</tools>
For each function call return a json object (not a json blob needed ```json```) the function name and arguments within <tool_call> </tool_call> tags with the following schema:
<think>
I need to use ...
</think>
<tool_call>
{'arguments': <args-dict>, 'name': <function-name>}
</tool_call>
How much is 24000 Yens in Euros ?
<start_of_turn>model
<think>

"""
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}
outputs = model.generate(**inputs, 
                         max_new_tokens=300, 
                         do_sample=True, 
                         top_p=0.95, 
                         temperature=0.01, 
                         repetition_penalty=1.0, 
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools> </tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
<tools>
[{'type': 'function', 'function': {'name': 'ConvertYenToEur', 'description': 'converts a price in Yen into Euros', 'parameters': {'type': 'object', 'properties': {'yen': {'type': 'string', 'description': 'The price in Yen'}}}}]
</tools>
For each function call return a json object (not a json blob needed ```json```) the function name and arguments within <tool_call> </tool_call> tags with the following schema:
<think>
I need to use ...
</think>
<tool_call>
{'arguments': <args-dict>, 'name': <function-name>}
</tool_call>
How much is 24000 Yens in Euros ?
<start_of_turn>model
<think>

That's a great question! To answer it accurately, I need to convert 24000 Yen into Euros. The function 'ConvertYenToEur' is designed 

In [22]:
# Merge the base model and the adapter
model = model.merge_and_unload()

AttributeError: 'Gemma2ForCausalLM' object has no attribute 'merge_and_unload'

In [5]:
outputs = model.generate(**inputs, 
                         max_new_tokens=128, 
                         do_sample=True, 
                         top_p=0.95, 
                         temperature=0.01, 
                         repetition_penalty=1.0, 
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools> </tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
<tools>
[{'type': 'function', 'function': {'name': 'ConvertYenToEur', 'description': 'converts a price in Yen into Euros', 'parameters': {'type': 'object', 'properties': {'yen': {'type': 'string', 'description': 'The price in Yen'}}}}]
</tools>
For each function call return a json object (not a json blob needed ```json```) the function name and arguments within <tool_call> </tool_call> tags with the following schema:
<tool_call>
{'arguments': <args-dict>, 'name': <function-name>}
</tool_call>
How much is 24000 Yens in Euros ?
<start_of_turn>model
</tool_call>
</tool_call>
{
  "arguments": {
    "yen": 24000
  },
  "name": "ConvertYenToEur"
}
</tool_call><end_of_turn><eos>


In [1]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Define the LoRA adapter ID
peft_model_id = "Jofthomas/gemma-2-2B-it-function_calling"

# Load the configuration of the LoRA adapter
peft_config = PeftConfig.from_pretrained(peft_model_id)

# Load the base model (you may need to know the base model to load it properly)
base_model_id = peft_config.base_model_name_or_path
print(f"Base model ID: {base_model_id}")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

# Load the base model and LoRA adapter
model = AutoModelForCausalLM.from_pretrained(base_model_id)
model.resize_token_embeddings(len(tokenizer))


Base model ID: google/gemma-2-2b-it


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Embedding(256004, 2304, padding_idx=0)

In [3]:
#model.push_to_hub("gemma-resized")
peft_model_id = "Jofthomas/gemma-2-2B-it-function_calling"
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
tokenizer.push_to_hub("gemma-resized")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Jofthomas/gemma-resized/commit/ea89ab96187d414447af990ee561eadd331daffa', commit_message='Upload tokenizer', commit_description='', oid='ea89ab96187d414447af990ee561eadd331daffa', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Jofthomas/gemma-resized', endpoint='https://huggingface.co', repo_type='model', repo_id='Jofthomas/gemma-resized'), pr_revision=None, pr_num=None)