# Model Distillation for Function Calling

This notebook is part of a series demonstrating advanced model distillation techniques for creating specialized, function-calling-aware models. The goal is to distill the knowledge from a large language model (Amazon Nova Premier) into a smaller, more efficient model while maintaining high-quality function calling capabilities.

## Learning Objectives
- Prepare training data for function calling model distillation
- Design structured output formats for consistent function parameter generation
- Implement function selection and parameter extraction
- Create evaluation datasets for measuring function calling accuracy

## Dataset: Berkeley Function Calling Leaderboard (BFCL) V2 Live
We use the Berkeley Function Calling Leaderboard (BFCL) V2 Live dataset as our base dataset. This dataset is particularly suitable for function-calling model training because:

1. Contains 2,251 question-function-answer pairs total
2. Provides diverse function calling scenarios:
   - 258 simple calls
   - 1,053 multiple parameter calls
   - 16 parallel function calls
   - 24 parallel multiple parameter calls
   - 882 irrelevance detection cases
   - 18 relevance detection cases
3. Offers complexity with an average of 3 function choices per entry (maximum 37)
4. Includes parameter diversity with an average of 4 parameters per function (maximum 28)

The dataset is processed and stored in optimized formats for efficient model training and evaluation.

In [None]:
# upgrade boto3 
%pip install --upgrade pip --quiet
%pip install boto3 --upgrade --quiet
%pip install bcfl-eval

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

We need to set the project root so results are put in the correct location. 

In [12]:
import os
os.environ['BFCL_PROJECT_ROOT'] = os.getcwd()

If you're running this on your own machine, enter your AWS access keys in an .env file.
Uncomment the below cell to copy down an example .env.

In [None]:
# # set up environment file
# %cp $(python -c "import bfcl_eval; print(bfcl_eval.__path__[0])")/.env.example $BFCL_PROJECT_ROOT/.env
# # Fill in necessary values in `.env`

In [None]:
# download sample data
# %cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'test_case_ids_to_generate.json.example')") $BFCL_PROJECT_ROOT/test_case_ids_to_generate.json

For this example we're going to use a combination of v3_simple, v3_multiple, v3_live_relevance, and v3_irrelevance to train and evaluate with. For more information on these categories and their intents, please visit the [official BFCL documentation](https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard)

Let's move these to our local directory so we can begin preparing the data for distillation.

In [25]:
%mkdir questions
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'BFCL_v3_simple.json')") ./questions/BFCL_v3_simple.json
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'BFCL_v3_multiple.json')") ./questions/BFCL_v3_multiple.json
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'BFCL_v3_irrelevance.json')") ./questions/BFCL_v3_irrelevance.json
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'BFCL_v3_live_relevance.json')") ./questions/BFCL_v3_live_relevance..json

Now will grab the corresponding answers. For the simple and multiple datasets, we are provided possible answers and we'll use these for our mix-in labels.

Per the BFCL documentation, the correct answer for any question in the `BFCL_v3_irrelevance` datset is an empty list of functions, as these are design specifically to test the model's ability to correctly identify zero possible functions that are relevant.

The correct answer for any question in the `BFCL_v3_live_relevance` dataset is "at least one" function call returned.

Here's an excerpt from [their documentation](https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard):

> **Irrelevance Detection (875):** The scenario where none of the function choices provided are relevant to the user query and none should be invoked. We expect the model to not output a function call; the model can either output a message explaining why the function provided are not relevant or simply output a non-function call response (e.g., an empty list).

> **Relevance Detection (41):** The opposite of irrelevance detection. The scenario where at least one of the function choices provided are relevant to the user query and should be invoked, but the way the user prompt or the function doc is stated means that there could be infinitely many correct function calls and impossible to use a pre-defined possible answer set to evaluate. We expect the model to output some function call (one or multiple) that is relevant to the user query; we don't check for the correctness of the function call in this category (eg, correct parameter value).

In [26]:
%mkdir answers
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'possible_answer' / 'BFCL_v3_simple.json')") ./answers/BFCL_v3_simple.json
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'possible_answer' / 'BFCL_v3_multiple.json')") ./answers/BFCL_v3_multiple.json


To make things a bit cleaner, we'll create answer files for the relevance and irrelevance datasets as well. This will make it easier to combine for our training dataset as our mix-in will require a few ground truth examples.
We'll emulate the ground truth response structure the BFCL team is using, and return this for irrelevance answers:
```json
{"id": "irrelevance_13", "ground_truth": []}
```

and this for relevance answers, by picking a random function name from the list. Note that they "don't check for the correctness of the function call in this category (eg, correct parameter value)." so we'll use placeholders for the actual ground truth answers. Remember we're just doing this for labeled mix-in training data to hint our teacher model during distillation:
```json
{"id": "live_relevance_5-5-0", "ground_truth": [{"get_copyright_info": {"copyright_content": ["The specific content that is claimed to be copyrighted."], "copyright_holder": ["The name of the individual or organization that holds the copyright."], "confidence_score": [0.8]}}]}
```

### Steps to build training data set:


* using pandas, pick a random 50% of the records from the question dataset and put those into a subdirectory called `training`. Put the remaining in a subdirectory called `eval`.
* for the training data, create a directory called `mix_in` that will hold our labeled questions we'll use in our training dataset. Put 10% of the record for each dataset into this folder and keep the rest where they are.
* Build bedrock invoke api prompt with tool calling, using the questions
* for the mix-in data, attach ground truth answers to these by looking up the answer in the answer dataset and entering an assistant response with the assistant's correct answer.
* Combine all of the training data, including mix-in data, into a single `jsonl` file and in a format compatible with what's required for the bedrock distillation service format and save in the training directory with `.jsonl` extension.

Let's take a look at teach of the files

For our distillation, we're going to use a 10% mix-in of labeled data to hint to the teacher model what we would expect a good response to look like

In [None]:
import json
import sys
import os
import re
import pandas as pd
import numpy as np

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
skip_dir = os.path.dirname(parent_dir)
sys.path.append(skip_dir)
from utils import read_jsonl_to_dataframe

splits = {'train': 'squad_v2/train-00000-of-00001.parquet', 'validation': 'squad_v2/validation-00000-of-00001.parquet'}
df_train = pd.read_parquet("hf://datasets/rajpurkar/squad_v2/" + splits["train"])
df_eval = pd.read_parquet("hf://datasets/rajpurkar/squad_v2/" + splits["validation"])

## Advanced System Prompt Engineering

This section implements a specialized system prompt for [Amazon Nova Foundation models](https://docs.aws.amazon.com/nova/latest/userguide/prompting.html). The prompt engineering focuses on:

1. **Function Selection**: Accurate identification of the appropriate function to call
2. **Parameter Extraction**: Precise extraction of required parameters from user queries
3. **Structured Output**: JSON-based response format for consistent parsing
4. **Edge Case Handling**: Explicit handling of irrelevant queries and parameter validation

### System Prompt JSON Schema
The system prompt leverages the following formatting to support function calling with parameters. In the following cells we'll build helper functions to parse these out to measure function calling accuracy. This style of prompting is specific to Amazon Nova and will provide the best accuracy for function calling use cases.
- **Function Selection Logic**: Accurate identification of the most appropriate function
- **Parameter Validation**: Ensuring all required parameters are properly extracted
- **Schema Compliance**: Adherence to function signature requirements
- **Extensibility**: Support for complex nested parameters and multiple function calls

```json
{
  "query": "What's the weather like in San Francisco today?",
  "function_call": {
    "name": "get_weather",
    "parameters": {
      "location": "San Francisco",
      "unit": "celsius",
      "date": "today"
    }
  }
}
```

## Data Processing Pipeline Implementation

The evaluation of distilled models using Bedrock native tools requires 3 different datasets, all using our squad dataset.
1. **Distillation dataset.** This dataset will be used to fine-tune Nova Lite. Here we're using a 10% mix-in, so 10% of the records will include ground-truth answers and the rest will not. For more information on these best practices please visit [our documentation on this topic](https://docs.aws.amazon.com/nova/latest/userguide/custom-distill-prepare.html)
2. **Batch Inference dataset.** This will be the hold out of records we'll use for evaluating each model's performance. We'll use this dataset in Batch Inference to get each model's responses.
3. **Labeled dataset.** Using the same records from our batch inference dataset, we'll create a labeled dataset that includes the correct answers. We'll use this in our Evaluation job to measure each model's response to the ground-truth answer.

In [None]:
def parse_answer_structure(answers_dict):
    """
    Parse different formats of answer dictionaries and extract text and start positions.
    Returns lists of texts and start positions.
    """
    # Case 1: NumPy arrays with direct keys
    if 'text' in answers_dict and isinstance(answers_dict['text'], np.ndarray):
        texts = answers_dict['text'].tolist()
        starts = answers_dict['answer_start'].tolist()
        
    # Case 2: Lists or single values with direct keys
    elif 'text' in answers_dict:
        texts = answers_dict['text'] if isinstance(answers_dict['text'], list) else [answers_dict['text']]
        starts = answers_dict['answer_start'] if isinstance(answers_dict['answer_start'], list) else [answers_dict['answer_start']]
        
    # Case 4: String JSON that needs parsing (handled in calling function)
    else:
        raise ValueError(f"Unknown answer format: {answers_dict}")
        
    return texts, starts

def create_xml_answer(row, no_answer_text='I could not find an exact answer to the question.'):
    """
    takes a pandas df row and parses the 'answers' column XML answers
    """
    try:
        # Handle answers as string (JSON) if needed
        answers_dict = row['answers']
        if isinstance(answers_dict, str):
            import json
            answers_dict = json.loads(answers_dict)
            
        # Parse answer structure using our helper function
        texts, starts = parse_answer_structure(answers_dict)
        context = row['context']
        
        # Split context into sentences more accurately
        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s', context)
        
        # Build XML structure
        xml_parts = ['<answer>']
        
        if len(texts) > 0:
            for i, (text, start) in enumerate(zip(texts, starts)):
                xml_parts.append('<answer_part>')
                xml_parts.append('<text>')
                xml_parts.append(str(text))
                xml_parts.append('</text>')
                xml_parts.append('<sources>')
                
                # Find the sentence containing the answer based on the start position
                char_count = 0
                source_sentence = "No relevant source found"
                for sentence in sentences:
                    sentence_len = len(sentence) + 1  # +1 for the space after sentence
                    if char_count <= int(start) < (char_count + sentence_len):
                        source_sentence = sentence.strip()
                        break
                    char_count += sentence_len
                
                xml_parts.append(f'<source>{source_sentence}</source>')
                xml_parts.append('</sources>')
                xml_parts.append('</answer_part>')
        
            xml_parts.append('</answer>')
        else: # use no answer text
            xml_parts.append(f"<answer_part>\n<text>\n{no_answer_text}\n</text>\n</answer_part></answer>")
        return '\n'.join(xml_parts)
    except Exception as e:
        return f"<answer>\n<error>Error generating XML: {str(e)}</error>\n</answer>"



In [None]:
def create_bedrock_payload(row, model_type="conversation", system_prompt=None, include_answer=False, additional_params=None):
    """
    Create a payload dictionary for Amazon Bedrock batch inference API requests.
    Batch inference uses the invoke api.
    
    Args:
        row: A row from the pandas DataFrame containing context, question, and optionally answers
        model_type: The type of model payload to create ("conversation" or "invoke")
        system_prompt: The system message to include (for conversation-based models)
        include_answer: Whether to include the answer in the conversation (for evaluation)
        additional_params: Dictionary of additional parameters to include in the payload
    
    Returns:
        dict: A formatted payload dictionary ready for Bedrock batch inference API
    """
    try:
        # Extract needed information
        context = row['context']
        question = row['question']
        
        # Create the user prompt with context and question
        user_prompt = f"""<context>{context}</context> <question>{question}</question>"""
        
        # Get the answer if needed
        assistant_response = create_xml_answer(row) if include_answer else None
        
        # Create appropriate payload based on model_type
        if model_type == "conversation":
            
            payload = {
                "schemaVersion": "bedrock-conversation-2024",
                "system": [{"text": system_prompt}] if system_prompt else [],
                "messages": [
                    {
                        "role": "user",
                        "content": [{"text": user_prompt}]
                    }
                ]
            }
            
            # Add assistant response if needed (for evaluation)
            if include_answer and assistant_response:
                payload["messages"].append({
                    "role": "assistant",
                    "content": [{"text": assistant_response}]
                })
                
        elif model_type == "invoke":
            # For basic invoke request (non-conversation models like Titan, etc.)
            payload = {
                "system": [{"text": system_prompt}] if system_prompt else [],
                "messages": [
                    {
                        "role": "user",
                        "content": [{"text": user_prompt}]
                    }
                ],
                "inferenceConfig":{ 
                    # "maxTokens": int, 
                    "temperature": .1, 
                    "topP": .9, 
                    "topK": 50, 
                    "stopSequences": ['</answer>']
                }
            }
            if include_answer and assistant_response:
                payload["messages"].append({
                    "role": "assistant",
                    "content": [{"text": assistant_response}]
                })
            
            # Add optional parameters specific to invoke requests
            if additional_params:
                payload.update(additional_params)
                
        else:
            raise ValueError(f"Unsupported model_type: {model_type}")
            
        # Add any additional parameters passed
        if additional_params and model_type == "conversation":
            # For conversation models, additional params might need to be added at the root level
            for key, value in additional_params.items():
                if key not in payload:
                    payload[key] = value
                    
        return payload
        
    except Exception as e:
        print(f"Error creating payload for row: {str(e)}")
        return None

In [None]:
def create_batch_inf_record(row, system_prompt, include_answer=False):
    """
    Takes a pandas df row and creates a jsonl record for batch inference
    """
    conversation = create_bedrock_payload(
                                row=row, 
                                system_prompt=system_prompt, 
                                model_type="invoke", 
                                additional_params={},
                                include_answer=include_answer)
    return {
        "recordId": row['id'],
        "modelInput": conversation
    }

### Including non-answers
Any citations use case will need to support scenarios where a correct answer is not possible given the passages available.
To this end, we'll make half of our training dataset include non-answers, and half will include examples with answers.
Bedrock distillation jobs can have a maximum of 15,000 records.

In [None]:
# Apply the function to create a new column
# Filter for empty answers
empty_answers_df = df_train[df_train['answers'].apply(lambda x: 
    len(x['text']) == 0 and len(x['answer_start']) == 0)]

# Filter for rows with actual answers
with_answers_df = df_train[df_train['answers'].apply(lambda x: len(x['text']) > 0)]

# take 7500 of each dataframe and combine to use in distillation. 
df_train_revised = pd.concat([
    empty_answers_df.sample(n=7500, random_state=42), 
    with_answers_df.sample(n=7500, random_state=42)], ignore_index=True) # max 15k for bedrock distillation

As stated earlier, it is a best practice to include a ground truth answer for ~10% of the total training set. We will take a random sample of 10% and use our `create_bedrock_payload` with include_anwer set to True. The remaining 90% we leave out the ground truth answer.

In [None]:
row_count = len(df_train_revised)
ground_truth_included = 0.1 * row_count

# here we'll take 10% of our training data set and add the answers
training_data_with_gt_df = df_train_revised.sample(n=int(ground_truth_included), random_state=17)

# next we'll drop our ground truth examples so as not to mix with our labels excluding answers.
training_data_without_gt_df = df_train_revised.drop(training_data_with_gt_df.index)

# next we'll build our training data with ground truth
training_data_with_gt_df['conversation'] = training_data_with_gt_df.apply(lambda row: create_bedrock_payload(row=row, model_type="conversation", system_prompt=system_prompt, include_answer=True), axis=1)


In [None]:
# then we'll build our training data without ground truth
training_data_without_gt_df['conversation'] = training_data_without_gt_df.apply(lambda row: create_bedrock_payload(row=row, model_type="conversation", system_prompt=system_prompt, include_answer=False), axis=1)

In [None]:
# Now we'll concatenate the dataframes
final_training_dataset = pd.concat([training_data_with_gt_df, training_data_without_gt_df], axis=0, ignore_index=True).sort_index()

In [None]:
# now we'll output to .jsonl to use in distillation job
final_training_dataset['conversation'].to_json('distillation_data.jsonl', orient='records', lines=True)

## Batch Inference Dataset Creation

Now that our distillation data set is created, we'll move on to creating our batch inference dataset.
Because we'll also be using the same dataset (with labeled answers) for our evaluation jobs, Bedrock Evaluations will only handle a maxium of 1000 records.
We'll use 500 total rows our data set, as this is a sufficient number for evaluation and the right balance between processing time and proper evaluation accuracy.

In [None]:
eval_empty_answers_df = df_eval[df_eval['answers'].apply(lambda x: 
    len(x['text']) == 0 and len(x['answer_start']) == 0)]

# Filter for rows with actual answers
eval_with_answers_df = df_eval[df_eval['answers'].apply(lambda x: len(x['text']) > 0)]

batch_inf_df = pd.concat([
    eval_empty_answers_df.sample(n=250, random_state=15), 
    eval_with_answers_df.sample(n=250, random_state=15)], ignore_index=True)


batch_inf_df.apply(lambda row: create_batch_inf_record(row, system_prompt), axis=1).to_json('batch_inf_data.jsonl', orient='records', lines=True)

## Labeled Dataset for BYOI Bedrock Evaluation
This section creates a labeled dataset by applying our `create_batch_inf_record` method on each row and setting `include_answer` to True.

In [None]:
batch_inf_df.apply(lambda row: create_batch_inf_record(row, system_prompt=system_prompt, include_answer=True), axis=1).to_json('labeled_data.jsonl', orient='records', lines=True)

### Datasets Created
By now you should see 3 `.jsonl` files:
1. distillation_data.jsonl. This is the dataset we'll use for distillation.
2. batch_inf_data.jsonl. This is the dataset we'll use to run inference on all of our models, including our distilled one.
3. labeled_data.jsonl. This is the dataset we'll use to evaluate each model's performance against the ground truth.

### Next Steps

Proceed to [02_distill.ipynb](02_distill.ipynb) to:
1. Submit a distillation job using our distillation dataset
2. Create a provisioned throughput endpoint to hose our distilled model.

You can now move on to `02_distill.ipynb` where we'll kick of our distillation job!