# Fine-Tuning a model for Function-Calling and Thinking

Parameter-Efficient Fine-Tuning is a technique coparable to full fine-tuning while reducing the computational and storage costs, it works by
1. Adapter-based Tuning: Instead of fine-tuning all the parameters of a massive LLM, PEFT introduces small, task-specific adapter modules that are inserted into the model's architecture.

2. Freezing Base Model: During training, the pre-trained model's weights are frozen, and only the parameters of these adapter modules are updated. This dramatically reduces the number of trainable parameters, making the process more efficient.

3. Low-Rank Adaptation (LoRA): peft heavily utilizes the LoRA technique, which injects low-rank matrices into the transformer layers of the LLM. This allows for a compact representation of the task-specific knowledge while

## Install dependencies

In [None]:
!pip install bitsandbytes # for memory optimization and quantization wrapper around CUDA

In [None]:
!pip install peft # parameter-efficient fine-tuning for efficiently adapting pre-trained LLMs to downstream tasks

In [3]:
!pip install -q -U trl # transformer reinforcement learning package from HF provides a framework for training and fine-tuning using reinforcemnt lealrning algorithms

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/318.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m317.4/318.9 kB[0m [31m14.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
!pip install -q -U tensorboardX

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/101.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
!pip install -q wandb

In [4]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/ColabNotebooks/AgentsCourse

Mounted at /content/drive
/content/drive/MyDrive/ColabNotebooks/AgentsCourse


## Library Imports

In [5]:
import json
import torch
import pandas as pd

from enum import Enum
from functools import partial

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, LoraConfig, TaskType
from trl import SFTConfig, SFTTrainer

from google.colab import userdata
from huggingface_hub import login, whoami

hf_hub_token = userdata.get('huggingface_hub_access_token')
login(token=hf_hub_token)

seed = 123
set_seed(seed)

In [6]:
model_name = "google/gemma-2-2b-it" # start with the instruction-tuned model instead of the base model google/gemma-2-2b tp minimize the amount of information the model needs to learn
dataset_name = "Jofthomas/hermes-function-calling-thinking-V1"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# tokenizer uses templet to format chat imputs: unserts the beginning and end-of-sequence token etc.
tokenizer.chat_template = "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"


tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

### Format data

In [7]:
def preprocess(sample):
  """ Adding the Thinking step when processing the list of messages into a prompt"""

  messages = sample["messages"]
  first_message = messages[0]

  # Instead of adding a system message, merge the content into the first user message
  if first_message["role"] == "system":
      system_message_content = first_message["content"]
      # Merge system content with the first user message
      messages[1]["content"] = system_message_content + "Also, before making a call to a function take the time to plan the function to take. Make that thinking process between <think>{your thoughts}</think>\n\n" + messages[1]["content"]
      # Remove the system message from the conversation
      messages.pop(0)

  return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

In [8]:
dataset = load_dataset(dataset_name)
print(dataset)
feature_names = list(dataset["train"].features)  # Get a list of feature names
print(feature_names[0:5])
dataset = dataset.rename_column("conversations", "messages")

README.md:   0%|          | 0.00/354 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['conversations'],
        num_rows: 3570
    })
})
['conversations']


## Custom dataset

Add the **Thinking** step to the function-calling dataset NousResearch/hermes-function-calling-v1,so that it gives the model time to compute some thinking tokens before any function call

In [9]:
dataset = dataset.map(preprocess, remove_columns="messages")
dataset = dataset["train"].train_test_split(0.1)
print(dataset)

Map:   0%|          | 0/3570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 3213
    })
    test: Dataset({
        features: ['text'],
        num_rows: 357
    })
})


## Inputs

In [12]:
#print tokenizer's special tokens
tokenizer.special_tokens_map

{'bos_token': '<bos>',
 'eos_token': '<eos>',
 'unk_token': '<unk>',
 'pad_token': '<pad>',
 'additional_special_tokens': ['<start_of_turn>', '<end_of_turn>']}

In [13]:
# look at how the dataset is formatted
print(dataset["train"][8]["text"])

<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.Here are the available tools:<tools> [{'type': 'function', 'function': {'name': 'get_stock_price', 'description': 'Get the current stock price of a company', 'parameters': {'type': 'object', 'properties': {'company_name': {'type': 'string', 'description': 'The name of the company'}, 'exchange': {'type': 'string', 'description': 'The stock exchange where the company is listed'}}, 'required': ['company_name', 'exchange']}}}, {'type': 'function', 'function': {'name': 'search_restaurants', 'description': 'Search for restaurants based on cuisine or location', 'parameters': {'type': 'object', 'properties': {'cuisine': {'type': 'string', 'description': 'The cuisine of the restaurant'}, 'location': {'type': 'string', 'description':

## Modify Tokenizer


Add the segmenting tokens think, tool_call, and tool_response to the tokenizer and modify the chat_template

In [15]:
class ChatmlSpecialTokens(str, Enum):
  """ Define special tokens as trings to mark different parts of the text such as tool calls,
  tool responses, thinking steps, padding and end of sequence"""
    tools = "<tools>"
    eotools = "</tools>"
    think = "<think>"
    eothink = "</think>"
    tool_call="<tool_call>"
    eotool_call="</tool_call>"
    tool_response="<tool_reponse>"
    eotool_response="</tool_reponse>"
    pad_token = "<pad>"
    eos_token = "<eos>"

    @classmethod
    def list(cls):
      """ A class method that returns a list of all the special tokens,
          used to add these tokens to tokenizer's vocabulary."""
        return [c.value for c in cls]

# update the tokenizer with newly defined special tokens
tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        pad_token=ChatmlSpecialTokens.pad_token.value,
        additional_special_tokens=ChatmlSpecialTokens.list()
    )

# using Jinja2 syntax for dynamic content generation, set the chat template
# which defines how the tokenizer should format chat inputs, insering special tokens
tokenizer.chat_template = "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             attn_implementation='eager',
                                             device_map="auto")
model.resize_token_embeddings(len(tokenizer))
model.to(torch.bfloat16)

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256008, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2304, bias=False)
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (post_attention_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (pre_feedforward_layernorm): Gemm

## Configure LoRA

Low-Rank Adaptation of Large Landuage Models is a lightweight training technique that reduces the number of trainable parameters.

LoRA works by adding pairs of rank decomposition matrices to Transformer layers, typically focusing on linear layers. During training, the rest of the model is "frozen" and will only update the weights of those newly added adapters.

By doing so, the number of parameters that need to be train drops considerably as only need to update the adapter’s weights.

Define the adapter parameters

In [16]:
from peft import LoraConfig, TaskType

# LoRA parameters
# r: rank dimension for LoRA update matrices (smaller = more compression)
rank_dimension = 16
# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)
lora_alpha = 64
# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.05

peft_config = LoraConfig(r=rank_dimension,
                         lora_alpha=lora_alpha,
                         lora_dropout=lora_dropout,
                         target_modules=["gate_proj","q_proj","lm_head","o_proj","k_proj","embed_tokens","down_proj","up_proj","v_proj"], # wich layer in the transformers do we target ?
                         task_type=TaskType.CAUSAL_LM)

## Fine-Tuning model and hyperparameters

In [17]:
username="elliemci"
output_dir = "gemma-2-2B-it-thinking-function_calling-V0" # trained model checkpoints, logs, and etc. directory, also be the default name of the model when pushed to the hub
per_device_train_batch_size = 1
per_device_eval_batch_size = 1
gradient_accumulation_steps = 4
logging_steps = 5
learning_rate = 1e-4 # initial learning rate for the optimizer

max_grad_norm = 1.0
num_train_epochs=1
warmup_ratio = 0.1
lr_scheduler_type = "cosine"
max_seq_length = 1500

training_arguments = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    save_strategy="no",
    eval_strategy="epoch",
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    max_grad_norm=max_grad_norm,
    weight_decay=0.1,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard",
    bf16=True,
    hub_private_repo=False,
    push_to_hub=False,
    num_train_epochs=num_train_epochs,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    packing=True,
    max_seq_length=max_seq_length,
)

In [19]:
# start the Supervised Fine-Tuning Trainer
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
    peft_config=peft_config,
)

In [20]:
trainer.train()
trainer.save_model()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss
0,0.3112,0.299605


The 'batch_size' argument of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'max_batch_size' argument instead.
The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


In [21]:
trainer.push_to_hub(f"{username}/{output_dir}")



Upload 5 LFS files:   0%|          | 0/5 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

events.out.tfevents.1740534646.fdfa8332f5ec.1478.0:   0%|          | 0.00/35.4k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.62k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/2.48G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/elliemci/gemma-2-2B-it-thinking-function_calling-V0/commit/8491f4856efd88f3a8fbc16d35e04ad0a23e417b', commit_message='elliemci/gemma-2-2B-it-thinking-function_calling-V0', commit_description='', oid='8491f4856efd88f3a8fbc16d35e04ad0a23e417b', pr_url=None, repo_url=RepoUrl('https://huggingface.co/elliemci/gemma-2-2B-it-thinking-function_calling-V0', endpoint='https://huggingface.co', repo_type='model', repo_id='elliemci/gemma-2-2B-it-thinking-function_calling-V0'), pr_revision=None, pr_num=None)

In [22]:
tokenizer.eos_token = "<eos>"
tokenizer.push_to_hub(f"{username}/{output_dir}", token=True)

README.md:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/elliemci/gemma-2-2B-it-thinking-function_calling-V0/commit/097f8648a29ccf55242541aaba5d8e84fecb6756', commit_message='Upload tokenizer', commit_description='', oid='097f8648a29ccf55242541aaba5d8e84fecb6756', pr_url=None, repo_url=RepoUrl('https://huggingface.co/elliemci/gemma-2-2B-it-thinking-function_calling-V0', endpoint='https://huggingface.co', repo_type='model', repo_id='elliemci/gemma-2-2B-it-thinking-function_calling-V0'), pr_revision=None, pr_num=None)

## Test the Model

Load a pre-trained language model, applie a fine-tuned adapter to it, and resize the model with the new tokens. Parameter-Efficient Fine-Tuning tunes the base model by only modifyinf a small set of parameters of the base model.

In [12]:
from peft import PeftModel, PeftConfig # tools for loading and working with Parameter-Efficient Fine-Tuning models
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from datasets import load_dataset
import torch

# set up 4-bit quantization to reduce memory and speed up
bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )

username="elliemci"
output_dir = "gemma-2-2B-it-thinking-function_calling-V0"

peft_model_id = f"{username}/{output_dir}"
device = "auto"

# load the configuration of the PEFT adapter
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             device_map="auto",
                                             )
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
# adjusts the model's token embeddings to match the size of the tokenizer's vocabulary because of the addition of new tokens
model.resize_token_embeddings(len(tokenizer))

# combine the loaded base model with the PEFT adapter into the final fine-tuned model
model = PeftModel.from_pretrained(model, peft_model_id)
# move the model to use bfloat16 precision which impoves inference speed and memory efficiency
model.to(torch.bfloat16)
model.eval()

adapter_config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.3k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/197 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


adapter_model.safetensors:   0%|          | 0.00/2.48G [00:00<?, ?B/s]

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): lora.Embedding(
          (base_layer): Embedding(256008, 2304, padding_idx=0)
          (lora_dropout): ModuleDict(
            (default): Dropout(p=0.05, inplace=False)
          )
          (lora_A): ModuleDict()
          (lora_B): ModuleDict()
          (lora_embedding_A): ParameterDict(  (default): Parameter containing: [torch.cuda.BFloat16Tensor of size 16x256008 (cuda:0)])
          (lora_embedding_B): ParameterDict(  (default): Parameter containing: [torch.cuda.BFloat16Tensor of size 2304x16 (cuda:0)])
          (lora_magnitude_vector): ModuleDict()
        )
        (layers): ModuleList(
          (0-25): 26 x Gemma2DecoderLayer(
            (self_attn): Gemma2Attention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2304, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
          

In [10]:
# from previously preprocessed, mapped and split "Jofthomas/hermes-function-calling-thinking-V1" dataset
print(dataset["test"][8]["text"])

<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.Here are the available tools:<tools> [{'type': 'function', 'function': {'name': 'get_movie_details', 'description': 'Get the details of a movie', 'parameters': {'type': 'object', 'properties': {'title': {'type': 'string', 'description': 'The title of the movie'}, 'year': {'type': 'integer', 'description': 'The release year of the movie'}, 'language': {'type': 'string', 'description': 'The language of the movie'}}, 'required': ['title', 'year']}}}, {'type': 'function', 'function': {'name': 'calculate_loan_payment', 'description': 'Calculate the monthly loan payment', 'parameters': {'type': 'object', 'properties': {'loan_amount': {'type': 'number', 'description': 'The total loan amount'}, 'interest_rate': {'type': 'number', '

### Prompt Engineering

A multiline prompt structure that defines the intructions and context for the fine-tuned model telling it that its a function-calling AI, how and what tools to use, specifying the schema, guiding for thinking step before calling a function

In [None]:
# prompt as a sub-sample of one of the test set examples to start the generation after the model generation starts
prompt="""<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.Here are the available tools:<tools> [{'type': 'function', 'function': {'name': 'convert_currency', 'description': 'Convert from one currency to another', 'parameters': {'type': 'object', 'properties': {'amount': {'type': 'number', 'description': 'The amount to convert'}, 'from_currency': {'type': 'string', 'description': 'The currency to convert from'}, 'to_currency': {'type': 'string', 'description': 'The currency to convert to'}}, 'required': ['amount', 'from_currency', 'to_currency']}}}, {'type': 'function', 'function': {'name': 'calculate_distance', 'description': 'Calculate the distance between two locations', 'parameters': {'type': 'object', 'properties': {'start_location': {'type': 'string', 'description': 'The starting location'}, 'end_location': {'type': 'string', 'description': 'The ending location'}}, 'required': ['start_location', 'end_location']}}}] </tools>Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{tool_call}
</tool_call>Also, before making a call to a function take the time to plan the function to take. Make that thinking process between <think>{your thoughts}</think>

Hi, I need to convert 500 USD to Euros. Can you help me with that?<end_of_turn><eos>
<start_of_turn>model
<think>"""

### Tokenization

The prompt is processed by the tokenizer and converted inot numerical representation, the special tokens are already included in the prompt

In [None]:
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
# moved the tokenized imput tensor to GPU
inputs = {k: v.to("cuda") for k,v in inputs.items()}

In [19]:
# unpacking inputs dictionary into key-value pairs with ** operator
unpacked_inputs = {**inputs}
print(unpacked_inputs.keys())
print(unpacked_inputs)

dict_keys(['input_ids', 'attention_mask'])
{'input_ids': tensor([[     2,    106,  17877,    108,   2045,    708,    476,   1411,  11816,
          16481,   2091, 235265,   1646,    708,   4646,    675,   1411,  48907,
           2819, 235248, 256000, 256001,  26176,  16323, 235265,   2045,   1249,
           2409,    974,    689,    978,   7257,    577,   5422,    675,    573,
           2425,   8164, 235265,   4257, 235303, 235251,   1501,  29125,   1105,
           1212,   4035,    577,  18330,   1280,   7257, 235265,   4858,    708,
            573,   2506,   8112, 235292, 256000,  20411, 235303,   1425,   2130,
            777,   1929,    920,    777,   1929,   2130,  19276,   1067,   2130,
            777,  20700, 235298,  16093,    920,    777,   6448,   2130,    777,
          29039,    774,    974,  18968,    577,   2550,    920,    777,  17631,
           2130,  19276,   1425,   2130,    777,   4736,    920,    777,  16696,
           2130,  19276,  10526,   2130,  19276,   1

### Model Inference

To test the function-calling capacities of the fine-tuned model, generate text based on the input

In [13]:
outputs = model.generate(**inputs, # unpack the inputs dictionary into keyword arguments
                         max_new_tokens=300, # change if necessary
                         do_sample=True,
                         top_p=0.95,
                         temperature=0.01,
                         repetition_penalty=1.0,
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.Here are the available tools:<tools> [{'type': 'function', 'function': {'name': 'convert_currency', 'description': 'Convert from one currency to another', 'parameters': {'type': 'object', 'properties': {'amount': {'type': 'number', 'description': 'The amount to convert'}, 'from_currency': {'type': 'string', 'description': 'The currency to convert from'}, 'to_currency': {'type': 'string', 'description': 'The currency to convert to'}}, 'required': ['amount', 'from_currency', 'to_currency']}}}, {'type': 'function', 'function': {'name': 'calculate_distance', 'description': 'Calculate the distance between two locations', 'parameters': {'type': 'object', 'properties': {'start_location': {'type': 'string', 'description': 'The star