# **Fine-tuning Open-Source LLMs for Function Calling**

This notebook demonstrates fine-tuning of an open-source model ([Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/)). It leverages the `transformers` and `PEFT` libraries from Hugging Face for quantization, LoRA, and training, and a custom-built data set for function calling.

This notebook builds on the [basic fine-tuning](/notebooks/fine-tuning/basic) example by introducing the following innovations:  
- A prompt loss mask to focus the model's attention and encourage structured responses  
- A stop sequence after responses to encourage conciseness  
- A small, but high-quality function-calling data set to fine-tune the model for responding with functions and query parametes  

**Notes**  
- The example data set used in this notebook is for function calling but these techniques work for any Q&A data set.  
- The system prompts have been omitted, but you can add them back if you wish to fine-tune for a certain system message.
- While you can run this notebook on an NVIDIA T4 GPU (free on Google Colab), I recommend using an A6000 or A100 to get better results. These larger machines are available in Google Colab Pro or at RunPod, Lambda Labs, et al.)

**Recommended Reading**
- [Tokenization](/docs/handbook/tokenization)
- [Low-Rank Adaption (LoRA)](/docs/handbook/lora)

**Attribution**
- Some functions in this notebook were adapted from [Trelis](https://trelis.com) examples with modifications.
- A related, but simpler training notebook by Hugging Face is available [here](https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing).

## Why should you read this notebook?

You want to learn how to:
- Fine-tune an open-source model for structured and concise
responses
- Fine-tune using just a single GPU  
- Learn how to use prompt loss-masks for controlling model attention

# Key Concepts

Typically, a model is graded on its prediction of the next token in both the question and answer. However, our primary goal is for the model to give thoughtful attention to the question, while its performance should be graded based soley on how it predicts the answer; this is achieved by attention and loss masks, respectively.  

## Attention mask
Attention is a mechanism used during training to instruct the model on what parts of the input text (e.g., a question or a context) it should pay attention to. It helps the model focus on the relevant information and ignore irrelevant portions of the input. An attention mask is simply a sequence of 1s and 0s that is multiplied by the input sequence IDs—resulting in a new input sequence where irrelevant tokens are zeroed out (i.e. masked).

```
{'input_ids': tensor([[9204, 18, 3763, 456, 222, 13563, 22580, 584]]),
 'attention_mask': tensor([[1, 1, 1, 0, 1, 1, 0, 1]])}
```

```
{'result': tensor([[9204, 18, 3763, 0, 222, 13563, 0, 584]])}
```

As an example, we usually want to make sure that `PAD` tokens are masked.

## Loss mask

A loss mask is used to calculate the loss or error during training. It specifies which parts of the model's output should be considered when computing the loss. When training a model, we take the losses and multiply them by the loss mask.

To improve model performance, in this notebook we mask the losses associated with prompt to ensure the model focuses on answering the question, not predicting the next sequence of tokens in the question.


## Stop sequence
Have you every noticed how verbose some models are? By fine-tuning with stop sequence, such as `USER:`, we can teach the model to be more concise:

```
{
  prompt: "Where is the stock price of Apple?\n\nBOT:",
  completion: "Apple stock price is $188.04.\n\nUSER: ",
},
...
```


# Mistral 7B Instruct

Mistral 7B Instruct is an instruction fine-tuned version of Mistral 7B available on [Hugging Face](https://huggingface.co/mistralai/Mistral-7B-v0.1).

Per the HF model card:  

> ## Instruction format
The template used to build a prompt for the Instruct model is defined as follows:
>
> ```
<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]
```
>
> ## Model architecture
>
>This instruction model is based on Mistral-7B-v0.1, a transformer model with the following architecture choices:  
>
>- Grouped-Query Attention  
- Sliding-Window Attention  
- Byte-fallback BPE tokenizer  


# Setup
- You can run QLoRa training on a free Google Colab Notebook for 7B models.
- To configure a GPU on Google Colab, navigate to **Connect to a new runtime** and select **T4 High-RAM**.  
- In the code below, be sure to comment out flash attention when loading the model since flash is only supported on newer Ampere GPUs (A6000, A100, H100, etc.) and not in T4s.
- (Optional) Uncomment the code to mount Google Drive to download the model to your Google Drive. This will reduce total start time.
- If you don't already have one, create a [Hugging Face account](https://huggingface.co) and [create an Access Token](https://huggingface.co/settings/tokens) called "Notebooks" or similar with `write` permissions.

In [None]:
# # Print GPU info
# gpu_info = !nvidia-smi
# gpu_info = '\n'.join(gpu_info)
# if gpu_info.find('failed') >= 0:
#   print('Not connected to a GPU')
# else:
#   print(gpu_info)

In [None]:
# # Print VRAM
# from psutil import virtual_memory
# ram_gb = virtual_memory().total / 1e9
# print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

# if ram_gb < 20:
#   print('Not using a high-RAM runtime')
# else:
#   print('You are using a high-RAM runtime!')

# Install

In [50]:
# Authenticate to Hugging Face to pull and push models
!pip install huggingface_hub -q
from huggingface_hub import notebook_login

notebook_login()

[0m

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
# (Optional) Configure Weights & Biases (wandb) to track training runs
!pip install wandb -q -U
import wandb
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mgadkins[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [4]:
# base_model = "./Mistral-7B-Instruct-v0.1-function-calling-v2"
# base_model = "meta-llama/Llama-2-7b-hf"
# base_model = "meta-llama/Llama-2-7b-chat-hf"
# base_model = "meta-llama/Llama-2-13b-chat-hf"
# base_model = "codellama/CodeLlama-34b-Instruct-hf"
# base_model = "meta-llama/Llama-2-70b-chat-hf"
base_model = "mistralai/Mistral-7B-Instruct-v0.1"
# base_model = "deepseek-ai/deepseek-coder-1.3b-instruct"
# base_model = "deepseek-ai/deepseek-coder-6.7b-instruct"
# base_model = "deepseek-ai/deepseek-coder-33b-instruct"
# base_model = "larryvrh/Yi-34B-200K-Llamafied"
# base_model = "./Yi-34B-200K-Llamafied-chat-SFT"
# base_model = "openchat/openchat_3.5"
# base_model = "SUSTech/SUS-Chat-34B"
# base_model = "mistralai/Mixtral-8x7B-Instruct-v0.1"
# base_model = "microsoft/phi-2"

cache_dir = '' # Initialise the cache_dir to null.
# (Optionally, you can set Google Drive as the cache_dir below)

In [5]:
# stable versions

!python -m pip install --upgrade pip
!pip install -U -q transformers
!pip install -q -U bitsandbytes
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q datasets
!pip install -q -U scipy
!pip install -q -U trl
!pip install -U flash-attn -q

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m95.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m


In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, AutoConfig
import transformers
import torch
from torch.utils.data import DataLoader, Dataset

## If using Google Colab + Google Drive

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
import os
cache_dir = "/content/drive/My Drive/huggingface_cache"
os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists


In [9]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

# Load model

**Note about quantization:**  

In this section, we have the option to load a quantized version of the model (see the [QLoRA notebook](/notebooks/fine-tuning/qlora) for quantization details) to reduce the computation requirements such that it will fit on a free T4 GPU in Google Colab. If cost is most important to you, then I recommend this option—just uncomment the `quantization_config` option below.

However, I've observed slightly better performance in function-calling fine-tunes when using models at full precision. Note that if you use full precision, you'll need a larger GPU such as an A100. If you're using Google Colab, you'll need to upgrade to Pro or use another service like RunPod or Lambda Labs (which are a bit cheaper).

In [10]:
# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Instantiate model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    # quantization_config=bnb_config, # Uncomment to use quantized version
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    #attn_implementation="flash_attention_2", # Supported in Ampere GPUs or newer
    cache_dir=cache_dir
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Tokenization

In [11]:
# # Required for certain tokenizers like Yi
# !pip install sentencepiece -q -U

In [12]:
tokenizer = AutoTokenizer.from_pretrained(base_model, cache_dir=cache_dir, trust_remote_code=True)
# tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", cache_dir=cache_dir)

In [13]:
print("EOS token:", tokenizer.eos_token)
print("EOS token id:", tokenizer.eos_token_id)

EOS token: </s>
EOS token id: 2


In [14]:
# If pad token is None, we'll need to set one in the next section
print("Pad token: ", tokenizer.pad_token)
print("Pad token ID: ", tokenizer.pad_token_id)

Pad token:  None
Pad token ID:  None


In [15]:
# Padding to the right (i.e. after) the prompt and response has better results
tokenizer.padding_side='right'
print(tokenizer)

LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-Instruct-v0.1', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


## Set pad token if none exists
Some models already have a pad token set. You can see whether they do or don't from the tokenizer print statement above. **If that's the case, then you don't need to do anything further.**  

If no pad token exists, then you have three options:  

**Options**  

1. Use an existing token in the vocab as the pad token, instead of introducing a new one. This is to avoid having to create a whole new instance of the tokenizer with a new pad token. For this option, we use the existing `<unk>` token (i.e. "unknown") to pad—note that this assumes the `<unk>` token exists in the vocab.  
2. The next option is to use the EOS token.  
3. The last option is to add a pad token. This expands the size of the model embeddings so that's it's no longer a factor of 16, which can slow down inference. So this is the last option.  

In [16]:
## (Recommended) OPTION 1
# If <unk> is in the tokenizer, set the pad token to <unk>
# Else, set pad token to EOS token
if '<unk>' in tokenizer.get_vocab():
    print('Found \'<unk>\' token in tokenizer. Using \'<unk>\' for pad.')
    # Set the pad token
    tokenizer.pad_token = '<unk>'
else:
    print(f'Using EOS token, \'{tokenizer.eos_token}\', for padding')
    tokenizer.pad_token = tokenizer.eos_token

## OPTION 2
# # Check if the pad token is already in the tokenizer vocabulary
# if '<pad>' not in tokenizer.get_vocab():
#     print('pad token not in the tokenizer')

#     # Add the pad token
#     tokenizer.add_tokens(['<pad>'])

# # Set the pad token
# tokenizer.pad_token = '<pad>'

# # Resize token embeddings
# model.resize_token_embeddings(len(tokenizer))

Found '<unk>' token in tokenizer. Using '<unk>' for pad.


In [17]:
# Update pad token id in model and its config
model.pad_token_id = tokenizer.pad_token_id
model.config.pad_token_id = tokenizer.pad_token_id

# Check if they are equal
assert model.pad_token_id == tokenizer.pad_token_id, "The model's pad token ID \
does not match the tokenizer's pad token ID!"

# Print the pad token ids
print('Tokenizer pad token ID:', tokenizer.pad_token_id)
print('Model pad token ID:', model.pad_token_id)
print('Model config pad token ID:', model.config.pad_token_id)
print('Number of tokens now in tokenizer:', len(tokenizer))

Tokenizer pad token ID: 0
Model pad token ID: 0
Model config pad token ID: 0
Number of tokens now in tokenizer: 32000


In [18]:
# Print model configuration
print(model.config)

MistralConfig {
  "_name_or_path": "mistralai/Mistral-7B-Instruct-v0.1",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.37.2",
  "use_cache": true,
  "vocab_size": 32000
}



In [19]:
# Sample string
# sample_string = ['hello [/INST]', 'my good friend</s>']
sample_string = ['Caio!']

# Tokenize the stringified JSON object
encoded_sample = tokenizer(sample_string, truncation=True, padding=True, max_length=1024, return_tensors='pt', add_special_tokens=True)

BOS_token_id = tokenizer.bos_token_id
EOS_token_id = tokenizer.eos_token_id
BOS_token = tokenizer.decode([BOS_token_id])
EOS_token = tokenizer.decode([EOS_token_id])

print(f"Beginning of the sequence: {sample_string[0]} (BOS token: {BOS_token}, id: {BOS_token_id})")
print(f"End of the sequence: {sample_string[-1]} (EOS token: {EOS_token}, id: {EOS_token_id})")

token_count = len(encoded_sample)

print(f"Tokens in the string: {token_count}")
print(f"Token IDs: {encoded_sample}")

# Decode the input_ids
decoded_sample = tokenizer.decode(encoded_sample['input_ids'][0], skip_special_tokens=False)

# Print the decoded string
print(f"Decoded string: {decoded_sample}")

# Print the attention mask
print(f"Attention mask: {encoded_sample['attention_mask']}")


Beginning of the sequence: Caio! (BOS token: <s>, id: 1)
End of the sequence: Caio! (EOS token: </s>, id: 2)
Tokens in the string: 2
Token IDs: {'input_ids': tensor([[    1, 11013,   691, 28808]]), 'attention_mask': tensor([[1, 1, 1, 1]])}
Decoded string: <s> Caio!
Attention mask: tensor([[1, 1, 1, 1]])


## Set up LoRa

In [20]:
# # If loading with adapters
# # Note: Instead, it's often faster to download base model then add adapters
# from peft import PeftModel

# # adapter_model = f'{base_model}' + '-function-calling-adapters' # replace

# # Load peft model with adapters
# model = PeftModel.from_pretrained(
#     model,
#     adapter_model,
# )

In [21]:
# To reduce VRAM usage (supported by most models)
model.gradient_checkpointing_enable()

# If using quantized model
# from peft import prepare_model_for_kbit_training
# model = prepare_model_for_kbit_training(model)

In [22]:
# Print list of modules
print(model.state_dict().keys())

odict_keys(['model.embed_tokens.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.0.post_attention_layernorm.weight', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.1.self_attn.o_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.1.input_layernorm.weight', 'model.layers.1.post_attention_layernorm.weight', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.2.self_attn.o_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'mod

In [23]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
    (norm): MistralRMSNorm()
  

In [24]:
# # If extending model context
# def set_added_trainable_params(model):
#     """
#     Sets the parameters with names containing "embed" or "norm" as trainable.
#     """
#     trainable_params_dict = {}

#     for name, param in model.named_parameters():
#         if "embed" in name or "norm" in name: #for most models
#         # if "ln" in name or "embd" in name: #for Phi-2
#             param.requires_grad_()
#             trainable_params_dict[name] = param

#     return trainable_params_dict

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()

    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable %: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model

# Initialize LoRA configuration
config = LoraConfig(
    # Lower rank results in smaller update matrices with fewer trainable params
    r=8, # Use 8 for models >=7B or larger, else 128
    lora_alpha=32,
    target_modules=[
    #     "Wqkv", #for Phi-2
    #     "fc1", #for Phi-2
    #     "fc2" #for Phi-2
      "self_attn.q_proj",
      "self_attn.k_proj",
      "self_attn.v_proj",
      "self_attn.o_proj",
      # "self_attn.rotary_emb.inv_freq",
      "mlp.gate_proj",
      "mlp.up_proj",
      "mlp.down_proj",
      # "input_layernorm.weight",
      # "post_attention_layernorm.weight",
      # "model.norm.weight",
      # "lm_head.weight"
    ],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to the model
model = get_peft_model(model, config)

# # Set added parameters with names containing "embed" or "norm" as trainable.
# # Recommended if you are extending an LLM's context window.
# set_added_trainable_params(model)

# Print out the number of trainable parameters
print_trainable_parameters(model)

trainable params: 20971520 || all params: 7262703616 || trainable %: 0.2887563792882719


# Prepare data

Each function in the data set is stored as JSON in its own file. All functions follow OpenAI's metadata format.

### JSON data format
``` json
{
    "type": "function",
    "function": {
        "name": "function_name",
        "description": "function description",
        "parameters": {
            "type": "object",
            "properties": {
                "property_1": {
                    "type": "property_type", //#e.g. string
                    "description": "property description"
                },
                "property_2": {
                    "type": "property_type", //#e.g. string
                    "description": "property description"
                }
            },
            "required": ["property_1","property_2"]
        }
    },
    "samplePromptResponsePairs": [
        {
            "prompt": "sample_prompt",
            "response": {
                "name": "generate_password",
                "arguments": {
                    "property_1": "property_value",
                    "property_2": "property_value"
                }
            }
        },
        ...
    ]
}

```

In [25]:
!pip install -q -U datasets

[0m

In [26]:
from datasets import load_dataset

# From Hugging Face Hub
data = load_dataset(
    "Trelis/function_calling_v3"
    )

Downloading readme:   0%|          | 0.00/8.93k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/104k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.83k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/32.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [27]:
print(data)

DatasetDict({
    train: Dataset({
        features: ['functionList', 'userPrompt', 'assistantResponse'],
        num_rows: 66
    })
    validation: Dataset({
        features: ['functionList', 'userPrompt', 'assistantResponse'],
        num_rows: 19
    })
    test: Dataset({
        features: ['functionList', 'userPrompt', 'assistantResponse'],
        num_rows: 7
    })
})


In [28]:
class TextDataset(Dataset):
    def __init__(self, encodings, response_lengths, input_lengths):
        self.encodings = encodings
        self.response_lengths = response_lengths
        self.input_lengths = input_lengths

    def __getitem__(self, idx):
        item = {key: val[idx].clone().detach() for key, val in self.encodings.items()}

        # Set labels to be the same as input_ids
        item["labels"] = item["input_ids"].clone()

        # Calculate the start and end positions of the response
        response_start_position = self.input_lengths[idx]
        response_end_position = self.input_lengths[idx] + self.response_lengths[idx]

        # Create a loss mask that covers only the response tokens
        item["loss_mask"] = torch.zeros_like(item["input_ids"])
        item["loss_mask"][response_start_position:response_end_position] = 1

        # Shift the loss mask to the left by one position
        shifted_loss_mask = torch.cat([item["loss_mask"][1:], torch.tensor([0])])
        item["loss_mask"] = shifted_loss_mask

        # Shift the labels to the left by one position
        item["labels"][:-1] = item["input_ids"][1:]

        # Replace the token after the response with an EOS token
        item["labels"][response_end_position - 1] = tokenizer.eos_token_id

        # Replace the token after the response with an 1 in the loss mask
        item["loss_mask"][response_end_position - 1] = 1

        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [29]:
# Define the function start and end strings
# \n\n is added at the end during training to avoid different tokenizations of
# the E_INST string with whatever follows.
B_FUNC, E_FUNC = "You have access to the following functions. Use them if required:\n\n", "\n\n"

# Define the user prompt start and end strings
# B_INST, E_INST = "GPT4 Correct User: ", "<|end_of_turn|>GPT4 Correct Assistant:" # OpenChat style
B_INST, E_INST = "[INST] ", " [/INST]" # Llama 2 or Mistral style
# B_INST, E_INST = "Instruct:", "\nOutput:" # Phi 2
# B_INST, E_INST = "\n### Instruction:\n", "\n### Response:\n" # DeepSeek Coder style
# B_INST, E_INST = "Human: ", " Assistant:" # Yi style for function calling, no training space
# B_INST, E_INST = "### Human: ", "\n\n### Assistant: " # SUSChat

In [30]:
def prepare_dataset(dataset, tokenizer):
    # Create the formatted text with the correct roles for each part of the dialogue
    formatted_dataset = dataset.map(
        lambda x: {
            "input_text": "".join([
                f"{B_INST}{B_FUNC}{x['functionList'].strip()}{E_FUNC}",
                f"{x['userPrompt'].strip()}{E_INST}\n\n",
                f"{x['assistantResponse'].strip()}",  # append EOS token in TextData...
            ]),
            "response_text": "".join([
                f"{x['assistantResponse'].strip()}",  # append EOS token in TextData...
            ]),
        }
    )

    # Tokenize the datasets
    encodings = tokenizer([dialogue["input_text"] for dialogue in \
                           formatted_dataset], truncation=True, padding=True, \
                          max_length=1024, return_tensors='pt', \
                          add_special_tokens=True)

    # Tokenize the response one by one without padding and special tokens for
    # the purpose of calculating length
    response_lengths = [len(tokenizer.encode(dialogue["response_text"], \
                                             truncation=True, max_length=1024, \
                                             padding=False, \
                                             add_special_tokens=False)) \
                        for dialogue in formatted_dataset]

    # Tokenize the input one by one without padding and with the initial
    # special token for the purpose of calculating length
    total_lengths = [len(tokenizer.encode(dialogue["input_text"], \
                                          truncation=True, max_length=1024, \
                                          padding=False, \
                                          add_special_tokens=True)) \
                     for dialogue in formatted_dataset]
    input_lengths = [total_length - response_length \
                     for total_length, response_length in \
                     zip(total_lengths, response_lengths)]

    # Create TextDataset
    text_dataset = TextDataset(encodings, response_lengths, input_lengths)

    return text_dataset

In [31]:
# Apply function to your datasets
train_dataset = prepare_dataset(data['train'], tokenizer)
test_dataset = prepare_dataset(data['test'], tokenizer)
validation_dataset = prepare_dataset(data['validation'], tokenizer)

Map:   0%|          | 0/66 [00:00<?, ? examples/s]

Map:   0%|          | 0/7 [00:00<?, ? examples/s]

Map:   0%|          | 0/19 [00:00<?, ? examples/s]

### Examine the datasets

In [32]:
# Print the number of items in the dataset
print(f"Number of samples in the dataset: {len(train_dataset)}")

# Get a sample item
sample_item = train_dataset[1]  # replace with the index of any sample

# Print the dimensions of the sample item
print(f"Dimensions of input_ids: {sample_item['input_ids'].shape}")
print(f"Dimensions of attention_mask: {sample_item['attention_mask'].shape}")
print(f"Dimensions of loss_mask: {sample_item['loss_mask'].shape}")
print(f"Dimensions of labels: {sample_item['labels'].shape}")

# Print some tokens from the start and end of the sample
num_tokens_to_print = 336  # replace with the number of tokens you want to print

print("\nTokens at the start of the sample:")
print(sample_item['input_ids'][:num_tokens_to_print].tolist())
print(tokenizer.convert_ids_to_tokens(sample_item['input_ids'][:num_tokens_to_print].tolist()))

print("\nLabels at the start of the sample:")
print(sample_item['labels'][:num_tokens_to_print].tolist())
print(tokenizer.convert_ids_to_tokens(sample_item['labels'][:num_tokens_to_print].tolist()))

print("Attention mask at the start of the sample:")
print(sample_item['attention_mask'][:num_tokens_to_print].tolist())

print("Loss mask at the start of the sample:")
print(sample_item['loss_mask'][:num_tokens_to_print].tolist())

print("\nTokens at the end of the sample:")
print(sample_item['input_ids'][-num_tokens_to_print:].tolist())
print(tokenizer.convert_ids_to_tokens(sample_item['input_ids'][-num_tokens_to_print:].tolist()))

print("\nLabels at the end of the sample:")
print(sample_item['labels'][-num_tokens_to_print:].tolist())
print(tokenizer.convert_ids_to_tokens(sample_item['labels'][-num_tokens_to_print:].tolist()))

print("Attention mask at the end of the sample:")
print(sample_item['attention_mask'][-num_tokens_to_print:].tolist())

print("Loss mask at the end of the sample:")
print(sample_item['loss_mask'][-num_tokens_to_print:].tolist())


Number of samples in the dataset: 66
Dimensions of input_ids: torch.Size([677])
Dimensions of attention_mask: torch.Size([677])
Dimensions of loss_mask: torch.Size([677])
Dimensions of labels: torch.Size([677])

Tokens at the start of the sample:
[1, 733, 16289, 28793, 995, 506, 2735, 298, 272, 2296, 5572, 28723, 5938, 706, 513, 3030, 28747, 13, 13, 28792, 13, 2287, 371, 13, 5390, 345, 1123, 1264, 345, 2628, 548, 13, 5390, 345, 2628, 1264, 371, 13, 17422, 345, 861, 1264, 345, 2360, 28730, 283, 28744, 449, 548, 13, 17422, 345, 6518, 1264, 345, 7009, 354, 3332, 10374, 356, 1010, 28814, 449, 28723, 6746, 938, 302, 5771, 28725, 3994, 304, 5457, 12765, 390, 7658, 298, 5175, 3471, 2373, 272, 5709, 9191, 13, 17422, 345, 11438, 1264, 371, 13, 1417, 28705, 345, 1123, 1264, 345, 2814, 548, 13, 1417, 28705, 345, 10723, 1264, 371, 13, 359, 2287, 345, 3385, 1264, 371, 13, 359, 5390, 345, 1123, 1264, 345, 1427, 548, 13, 359, 5390, 345, 6518, 1264, 345, 1014, 3472, 5709, 1423, 28739, 13, 359, 2287, 4

# Generate a sample

In [33]:
import textwrap
wrapper = textwrap.TextWrapper(width=80)

In [34]:
import re  # import regular expressions module

In [35]:
import gc  # import Python's garbage collection module

def generate(index,data_split="test"):

    functionList = data[data_split][index]['functionList']
    user_prompt = data[data_split][index]['userPrompt']
    correct_answer = data[data_split][index]['assistantResponse']

    # model.config.use_cache = True    # Unsure this is needed

    # Format your prompt template
    prompt = f"{B_INST}{B_FUNC}{functionList.strip()}\
    {E_FUNC}{user_prompt.strip()}{E_INST}\n\n"

    print(f"Using the {data_split} data split.\n\nPrompt:")
    print(prompt)

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    if "token_type_ids" in inputs:
        del inputs["token_type_ids"]

    # print(f'model is on: {next(model.parameters()).device}')  # Debug
    # print(f'input_ids is on: {inputs["input_ids"].device}')  # Debug

    output = model.generate(**inputs,
                            max_new_tokens=200,
                            # do_sample=False,
                            pad_token_id=tokenizer.pad_token_id,
                            eos_token_id=tokenizer.eos_token_id,
                            # temperature=0.01,
                            # top_k=0
                           )

    print()

    # Subtract the length of input_ids from output to get only the model response
    output_text = tokenizer.decode(output[0, len(inputs.input_ids[0]):], \
                                   skip_special_tokens=False)
    output_text = re.sub('\n+', '\n', output_text)  # remove excessive newlines

    print("**Generated Assistant Response:**")
    print(output_text)

    print()

    print("**Correct Assistant Response:**")
    print(correct_answer)

    print()

    # Clear GPU cache and run garbage collection
    torch.cuda.empty_cache()  # Clear GPU cache
    gc.collect()  # Run garbage collection

### Run validation before fine-tuning
Before fine-tuning the model, let's take a look at how the model responds to the validation set prompts.  

Notice that the model should respond with a function name and query params, yet it tries writing code itself and includes lots of extra words.

In [41]:
# Run validation before training
for index in range(len(test_dataset)):
    print(f'---Running index {index}---')
    generate(index, "test")

---Running index 0---
Using the test data split.

Prompt:
[INST] You have access to the following functions. Use them if required:

[
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the stock price of an array of stocks",
            "parameters": {
                "type": "object",
                "properties": {
                    "names": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        },
                        "description": "An array of stocks"
                    }
                },
                "required": [
                    "names"
                ]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_big_stocks",
            "description": "Get the names of the largest N stocks by market cap",
            "parameters": {
              

# Training

In [37]:
import torch.nn as nn

In [38]:
class CustomTrainer(transformers.Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        # Define number of tokens to display
        # Displays actual and predicted token info at end of each sequence
        num_tokens = 25

        labels = inputs.pop("labels")

        # # Get first hundred label IDs for each sequence in the batch
        # first_hundred_label_ids = labels[:, :200]

        # # Convert to tokens
        # first_hundred_tokens = [tokenizer.convert_ids_to_tokens(label_ids) \
        # for label_ids in first_hundred_label_ids]

        # # Print them
        # for batch_idx, tokens in enumerate(first_hundred_tokens):
        #     print(f"First 200 decoded tokens for sequence {batch_idx + 1}: {tokens}")

        loss_mask = inputs.pop("loss_mask")

        # Forward pass
        outputs = model(**inputs)

        logits = outputs.logits

        # Check for NaN in logits and labels
        if torch.isnan(logits).any():
            print("NaN detected in logits")
            print(logits)

        # Convert logits to probabilities using softmax function
        probs = nn.functional.softmax(logits, dim=-1)

        # Get the most probable tokens
        predicted_token_ids = torch.argmax(probs, dim=-1)

        # Compute the loss
        loss_fct = nn.CrossEntropyLoss(reduction='none')
        losses = loss_fct(logits.view(-1, self.model.config.vocab_size), labels.view(-1))

        # Reshaping the losses to have dimensions [batch_size, seq_length]
        losses = losses.view(-1, inputs['input_ids'].size(1))

        # Apply the loss mask
        masked_loss = losses * loss_mask

        # Check for NaN in losses and zero in loss_mask.sum()
        if torch.isnan(losses).any():
            print("NaN detected in losses")
            # print(losses)

        if loss_mask.sum() == 0:
            print("Sum of loss_mask is zero")
            return (torch.tensor(0).to(loss_mask.device), outputs) \
            if return_outputs else torch.tensor(0).to(loss_mask.device)  # Early return

        # Aggregate the masked losses
        # Normalize by the number of tokens considered + epsilon to prevent
        # division by zero
        loss = masked_loss.sum() / (loss_mask.sum() + 1e-9)

        # Print formatted tokens
        batch_size, seq_length = inputs['input_ids'].size()

        # num_tokens = len(inputs['input_ids'][0])

        # # Useful for debugging training
        # # Recommend training a small number of steps
        # print("-" * 120)
        # print(f"Token analysis for last {num_tokens} tokens:")
        # header_format = "{:<10}{:<20}{:<20}{:<20}{:<20}{:<30}{:<30}".format("Index", "Input Token", "Predicted Token", "True Token", "Loss Mask", "Raw Loss", "Masked Loss")

        # for batch_idx in range(batch_size):
        #     input_tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][batch_idx])  # Using batch_idx
        #     predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_token_ids[batch_idx])  # Using batch_idx
        #     true_tokens = tokenizer.convert_ids_to_tokens(labels[batch_idx])  # Using batch_idx

        #     print(f"\nBatch {batch_idx + 1} of {batch_size}:")
        #     print(header_format)
        #     for i in range(-num_tokens, 0, 1):
        #         index = seq_length + i  # Correct index based on sequence length
        #         print("{:<10}{:<20}{:<20}{:<20}{:<20.1f}{:<30.6f}{:<30.6f}".format(index, input_tokens[index], predicted_tokens[index], true_tokens[index], loss_mask[batch_idx, i].item(), losses[batch_idx, i], masked_loss[batch_idx, i]))
        #     print("-" * 120)

        return (loss, outputs) if return_outputs else loss

    def get_train_dataloader(self):
      train_dataset = self.train_dataset
      data_collator = self.data_collator

      dataloader_params = {
          "batch_size": self.args.train_batch_size,
          "collate_fn": data_collator,
          "num_workers": self.args.dataloader_num_workers,
          "pin_memory": self.args.dataloader_pin_memory,
      }

      if not isinstance(train_dataset, torch.utils.data.IterableDataset):
          dataloader_params["sampler"] = self._get_train_sampler()
          dataloader_params["drop_last"] = self.args.dataloader_drop_last

      return DataLoader(train_dataset, **dataloader_params)

    def get_eval_dataloader(self, eval_dataset=None):
      eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
      if eval_dataset is None:
          raise ValueError("Trainer: evaluation requires an eval_dataset.")

      data_collator = self.data_collator

      # Parameters for the DataLoader
      dataloader_params = {
          "batch_size": self.args.eval_batch_size,
          "collate_fn": data_collator,
          "num_workers": self.args.dataloader_num_workers,
          "pin_memory": self.args.dataloader_pin_memory,
      }

      # If your dataset isn't an instance of torch's IterableDataset, you can
      # provide sampler and drop_last
      if not isinstance(eval_dataset, torch.utils.data.IterableDataset):
          dataloader_params["sampler"] = self._get_eval_sampler(eval_dataset)
          # Typically we don't drop the last batch for evaluation
          dataloader_params["drop_last"] = False

      return DataLoader(eval_dataset, **dataloader_params)

In [39]:
class CustomDataCollator: # Needed if the EOS token is included in training
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __call__(self, batch):

        input_ids = torch.stack([item['input_ids'] for item in batch])
        attention_mask = torch.stack([item['attention_mask'] for item in batch])
        labels = torch.stack([item['labels'] for item in batch])
        loss_mask = torch.stack([item['loss_mask'] for item in batch])

        # # Debugging: print details of the first sequence in the batch
        # num_elements_to_view = 20  # Number of last elements to view

        # # Decoding the input_ids
        # decoded_input_tokens = self.tokenizer.convert_ids_to_tokens(input_ids[0].tolist())

        # print("Debugging last", num_elements_to_view, "elements of the first sequence in the batch:")
        # print("{:<20}{:<20}{:<20}{:<20}".format("Token", "Input ID", "Label", "Loss Mask"))
        # for i in range(-num_elements_to_view, 0, 1):
        #   print("{:<20}{:<20}{:<20}{:<20}".format(decoded_input_tokens[i], input_ids[0, i].item(), labels[0, i].item(), loss_mask[0, i].item()))

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels,
            'loss_mask': loss_mask
        }

data_collator = CustomDataCollator(tokenizer)


In [40]:
trainer = CustomTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    args=transformers.TrainingArguments(
        # max_steps=1,
        num_train_epochs=1, # Larger models typically only need 1 epoch
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        evaluation_strategy="steps",
        max_grad_norm=1,
        warmup_ratio=0.1,
        eval_steps=0.2,
        learning_rate=1e-4, # 1e-4 for LoRA
        # learning_rate=1e-5, # 1e-5 for full fine-tuning
        # fp16=True, # If not using an Ampere series (i.e. not H100, A100, A6000)
        bf16=True,
        logging_steps=1,
        output_dir="outputs",
        # optim="paged_adamw_8bit", # For training in 4bit (quantized)
        optim="adamw_torch", # For training in full fp16/bf16 precision
        lr_scheduler_type='constant',
        hub_private_repo=True
    ),
    data_collator=data_collator,
    # data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # Silence warnings (Set to True for inference!)

In [42]:
trainer.train()
torch.cuda.empty_cache()



Step,Training Loss,Validation Loss
14,0.109,0.818482
28,0.0306,0.809609
42,0.0013,0.767155
56,0.1517,0.75937


# Example After Fine Tuning

In [43]:
model.config.use_cache = True
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear(
                (base_layer): Linear(in_features

In [44]:
# Run validation
for index in range(len(test_dataset)):
    print(f'---Running index {index}---')
    generate(index, "test")

---Running index 0---
Using the test data split.

Prompt:
[INST] You have access to the following functions. Use them if required:

[
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the stock price of an array of stocks",
            "parameters": {
                "type": "object",
                "properties": {
                    "names": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        },
                        "description": "An array of stocks"
                    }
                },
                "required": [
                    "names"
                ]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_big_stocks",
            "description": "Get the names of the largest N stocks by market cap",
            "parameters": {
              

# Merge Adapters and Save Model to Hub

In [53]:
# Extract the last portion of the base_model
base_model_name = base_model.split("/")[-1]

adapter_model = f"gadkins/{base_model_name}-function-calling-adapters"
new_model = f"gadkins/{base_model_name}-function-calling" # Your HF account

print(f"Adapter Model: {adapter_model}\nNew Model: {new_model}")

Adapter Model: gadkins/Mistral-7B-Instruct-v0.1-function-calling-adapters
New Model: gadkins/Mistral-7B-Instruct-v0.1-function-calling


In [56]:
# (Optional) Create repo + branch for gguf and awq

from huggingface_hub import HfApi, create_branch, create_repo

# Initialize the HfApi class
api = HfApi()

create_repo(new_model, private=False)

create_branch(new_model, repo_type="model", branch="gguf")

# create_branch(new_model, repo_type="model", branch="awq")

# create_branch(new_model, repo_type="model", branch="gptq")

In [57]:
# model.config._name_or_path="gadkins/Yi-34B-200K-Llamafied-chat-SFT"
# print(model.config._name_or_path)

In [58]:
# Save the model
model.save_pretrained(adapter_model, push_to_hub=True, use_auth_token=True)

In [77]:
# Push the model to the hub
# model.push_to_hub(adapter_model, use_auth_token=True)

In [80]:
# # ## reload the base model (you might need a pro subscription for this because you may need a high RAM environment since this is loading the full original model, not quantized)
# # ## you may prefer to use auto instead of cpu if you have a gpu
# # ## if you are training in full precision (not quantized), you may not need to reload the model, you can directly merge and unload.
# # ## if you are training very large models you may need to restart the kernel and reload the base model as there may not be enough space on gpu.

# # from transformers import AutoModelForCausalLM, PretrainedConfig
# # import torch

# # model = AutoModelForCausalLM.from_pretrained(base_model, device_map='auto', trust_remote_code=True, torch_dtype=torch.float16, cache_dir=cache_dir)

# from peft import PeftModel

# # load perf model with new adapters
# model = PeftModel.from_pretrained(
#     model,
#     './gadkins/Yi-34B-200K-Llamafied-chat-SFT-function-calling-adapters-v2',
# )

In [78]:
model = model.merge_and_unload() # merge adapters with the base model.

In [79]:
# (Optional) Allows you to save the model locally to do inference without downloading
model.save_pretrained(f"gadkins/{base_model_name}-function-calling-v3")

In [81]:
model.push_to_hub(new_model, token=True, max_shard_size="10GB",safe_serialization=True)

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/gadkins/Mistral-7B-Instruct-v0.1-function-calling/commit/ca16aa19012184d3f5721a3a4ba7829876a385d1', commit_message='Upload MistralForCausalLM', commit_description='', oid='ca16aa19012184d3f5721a3a4ba7829876a385d1', pr_url=None, pr_revision=None, pr_num=None)

### Base README.md and also tokenizer.model (needed for GGUF and GPTQ)

In [82]:
import os
import requests
from huggingface_hub import HfApi

def download_file_from_huggingface(model_id, filename, save_path):
    url = f"https://huggingface.co/{model_id}/resolve/main/{filename}"
    r = requests.get(url)
    if r.status_code != 200:
        print(f"Failed to download {filename}. HTTP Status Code: {r.status_code}")
        return False
    with open(os.path.join(save_path, filename), 'wb') as f:
        f.write(r.content)
    return True

def main():
    # Files to download and upload
    files_to_process = ["tokenizer.model", "README.md"]

    # Directory to save the downloaded files
    save_path = "./models"
    if not os.path.exists(save_path):
        os.makedirs(save_path)

    # Initialize HfApi class
    api = HfApi()

    # Specify the repository where you want to upload the files
    repo_id = new_model  # Assuming new_model is in the format "username/repo"

    for filename in files_to_process:
        # Download the file
        success = download_file_from_huggingface(base_model, filename, save_path)
        if success:
            print(f"Successfully downloaded {filename}")
        else:
            print(f"Failed to download {filename}")
            continue  # Skip uploading if download failed

        # File path to upload
        local_file_path = os.path.join(save_path, filename)

        # Upload the file
        api.upload_file(
            path_or_fileobj=local_file_path,
            path_in_repo=filename,  # Using filename directly, adjust as needed
            repo_id=repo_id,
            repo_type="model",  # Assuming it's a model; can be "dataset" or "space" as well
        )
        print(f"Uploaded {filename} to {repo_id}")

if __name__ == "__main__":
    main()


Successfully downloaded tokenizer.model
Uploaded tokenizer.model to gadkins/Mistral-7B-Instruct-v0.1-function-calling
Successfully downloaded README.md
Uploaded README.md to gadkins/Mistral-7B-Instruct-v0.1-function-calling


## Set up chat template (advanced option)
This is a more advanced step that allows you to customize a chat template for function calling.

Typically you need to start by grabbing the `chat_template` from `tokenizer_config.json` of the base file and pasting that into the box below. You then need to customize that template to include `function_metadata`, `function_response` and `function_call` roles. You can see one example below but it won't be correct for all models.

In [64]:
print(tokenizer.chat_template)

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token + ' ' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}


In [65]:
print(tokenizer.bos_token)
print(tokenizer.eos_token)

<s>
</s>


In [66]:
import json

In [67]:
function_metadata = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "This function gets the current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city, e.g., San Francisco"
                    },
                    "format": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "The temperature unit to use."
                    }
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_clothes",
            "description": "This function provides a suggestion of clothes to wear based on the current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "temperature": {
                        "type": "string",
                        "description": "The temperature, e.g., 15 C or 59 F"
                    },
                    "condition": {
                        "type": "string",
                        "description": "The weather condition, e.g., 'Cloudy', 'Sunny', 'Rainy'"
                    }
                },
                "required": ["temperature", "condition"]
            }
        }
    }
]

In [68]:
# Comment out later messages to test various stages of generation.

sample_messages = [
    # {
    #     "role": "system",
    #     "content": "you are a helpful assistant"
    # },
    {
        "role": "function_metadata",
        "content": "FUNCTION_METADATA"
    },
    {
        "role": "user",
        "content": "What is the current weather in London?"
    },
    # {
    #     "role": "function_call",
    #     "content": "{\n    \"name\": \"get_current_weather\",\n    \"arguments\": {\n        \"city\": \"London\"\n    }\n}</s>"
    # },
    # {
    #     "role": "function_response",
    #     "content": "{\n    \"temperature\": \"15 C\",\n    \"condition\": \"Cloudy\"\n}"
    # },
    # {
    #     "role": "assistant",
    #     "content": "The current weather in London is Cloudy with a temperature of 15 Celsius.</s>"
    # },
    # {
    #     "role": "user",
    #     "content": "That's great. Now say hello."
    # },
    # {
    #     "role": "assistant",
    #     "content": "Hello!</s>"
    # }
]

In [69]:
# Iterate through each message in the list
for message in sample_messages:
    if message['role'] == 'function_metadata':
        # Replace 'FUNCTION_METADATA' with 'function_metadata' in the content
        message['content'] = message['content'].replace('FUNCTION_METADATA', json.dumps(function_metadata, indent=4))

In [70]:
# Llama 2 templates / Mistral
tokenizer.chat_template = """{{ bos_token }} [INST] {% for message in messages %}{% if message['role'] == 'system' %}<<SYS>>\n{{ message['content'] }}\n<</SYS>>\n\n{% elif message['role'] == 'function_metadata' %}You have access to the following functions. Use them if required:\n\n{{ message['content'] }}\n\n{% elif message['role'] == 'user' %}{{ message['content'] }} [/INST]\n\n{% elif message['role'] == 'assistant' %}{{ message['content'] }} [INST] {% elif message['role'] == 'function_call' %}{{ message['content'] }} [INST] {% elif message['role'] == 'function_response' %}Here is the response to the function call. If helpful, use it to respond to my question:\n\n{{ message['content'] }} [/INST]\n\n{% endif %}{% endfor %}"""

In [71]:
print(tokenizer.chat_template)

{{ bos_token }} [INST] {% for message in messages %}{% if message['role'] == 'system' %}<<SYS>>
{{ message['content'] }}
<</SYS>>

{% elif message['role'] == 'function_metadata' %}You have access to the following functions. Use them if required:

{{ message['content'] }}

{% elif message['role'] == 'user' %}{{ message['content'] }} [/INST]

{% elif message['role'] == 'assistant' %}{{ message['content'] }} [INST] {% elif message['role'] == 'function_call' %}{{ message['content'] }} [INST] {% elif message['role'] == 'function_response' %}Here is the response to the function call. If helpful, use it to respond to my question:

{{ message['content'] }} [/INST]

{% endif %}{% endfor %}


In [72]:
# View the template applied without tokenization
prompt = tokenizer.apply_chat_template(sample_messages, tokenize=False)
print(prompt)

<s> [INST] You have access to the following functions. Use them if required:

[
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "This function gets the current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city, e.g., San Francisco"
                    },
                    "format": {
                        "type": "string",
                        "enum": [
                            "celsius",
                            "fahrenheit"
                        ],
                        "description": "The temperature unit to use."
                    }
                },
                "required": [
                    "city"
                ]
            }
        }
    },
    {
        "type": "function",
        "f

In [73]:
## Test generation

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

if "token_type_ids" in inputs:
    del inputs["token_type_ids"]

# print(f'model is on: {next(model.parameters()).device}')  # Debug line
# print(f'input_ids is on: {inputs["input_ids"].device}')  # Debug line

output = model.generate(**inputs,
                        max_new_tokens=200,
                        do_sample=False,
                        pad_token_id=tokenizer.pad_token_id,
                        eos_token_id=tokenizer.eos_token_id,
                        # temperature=0.01,
                        # top_k=0
                       )

print()

# Subtract the length of input_ids from output to get only the model's response
output_text = tokenizer.decode(output[0, len(inputs.input_ids[0]):], skip_special_tokens=False)
print(output_text)


{
    "name": "get_current_weather",
    "arguments": {
        "city": "London"
    }
}</s>


## Push Tokenizer

In [74]:
# optional, but allows you to save the model locally so you can immediately inference without downloading
tokenizer.save_pretrained(f"gadkins/{base_model_name}-function-calling-v3")

('gadkins/Mistral-7B-Instruct-v0.1-function-calling-v3/tokenizer_config.json',
 'gadkins/Mistral-7B-Instruct-v0.1-function-calling-v3/special_tokens_map.json',
 'gadkins/Mistral-7B-Instruct-v0.1-function-calling-v3/tokenizer.model',
 'gadkins/Mistral-7B-Instruct-v0.1-function-calling-v3/added_tokens.json',
 'gadkins/Mistral-7B-Instruct-v0.1-function-calling-v3/tokenizer.json')

In [75]:
# #Push the tokenizer
tokenizer.push_to_hub(new_model, token=True)

## RELOAD IF NEEDED (NOT RECOMMENDED IF tokenizer.chat_template was updated.
# from transformers import AutoTokenizer
# # reload the tokenizer because you don't want to have an off-size tokenizer with pad tokens.
# tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/gadkins/Mistral-7B-Instruct-v0.1-function-calling/commit/dfe2015a6083826389d11212b55d530816a0e0c6', commit_message='Upload tokenizer', commit_description='', oid='dfe2015a6083826389d11212b55d530816a0e0c6', pr_url=None, pr_revision=None, pr_num=None)