# Function Calling Evaluation Results Analysis

This notebook analyzes the Berkeley Function Calling Leaderboard (BFCL) evaluation results for different Nova models across specific test categories.

## Overview
- **Models**: custom-nova-lite, nova-lite-v1.0, nova-micro-v1.0, nova-premier-v1.0, nova-pro-v1.0
- **Test Categories**: irrelevance, multiple, live_relevance, simple
- **Analysis Focus**: Model-specific performance comparison using accuracy scores from individual test categories

## Key Features
- ✅ **Model-specific analysis**: Results from each model's score directory
- ✅ **Category-focused**: Only the 4 test categories with generated results
- ✅ **Accuracy metrics**: Using first row summary statistics from each score file
- ✅ **Comparative visualizations**: Performance comparison across models and categories

## Set up BFCL Evaluations

In [None]:
%store -r custom_model_deployment_arn_tool_config

In [None]:
# this example depends on a specific version of the framework
!git clone --branch v1.3 https://github.com/ShishirPatil/gorilla.git bfcl

In [None]:
!python --version

In [None]:
%cd bfcl/berkeley-function-call-leaderboard
!pip install -e .

In [None]:
print("Here is the value for your model_id", custom_model_deployment_arn_tool_config)

## Update BFCL framework
We'll update the BFCL framework to use your custom model deployment by copying a few updated files to the correct locations. This will update the correct classes in the framework so your model can be used to generate responses using the BFCL generate and evaluate functions.

In [None]:
%cd ..

In [None]:
!pwd

### Update model_config
Update [`model_config.py`](./model_config.py) with your custom model deployment ARN as found here. This assumes we are using nova-lite as the base model: 

```python    
"custom-nova-lite": ModelConfig(
        model_name="<CMD or PT ARN here",  
        display_name="My Custom Nova Lite Model (FC)",
        url="https://my-organization.com/models/custom-nova",
        org="My Organization",
        license="Custom License",
        model_handler=NovaHandler,
        input_price=1.5,
        output_price=6.0,
        is_fc_model=True,
        underscore_to_dot=True,
        base_model="nova-lite-v1.0",  # Specify the base model this custom model is derived from
    ),
```

Now copy the `model_config`, `supported_models.py`, and `nova.py` to the correct directory in the BFCL eval framework. These are modifications we need to make to support custom model evaluation on bedrock

In [None]:
!cp model_config bfcl/berkeley-function-call-leaderboard/bfcl_eval/constants/model_config.py
!cp supported_models.py bfcl/berkeley-function-call-leaderboard/bfcl_eval/constants/supported_models.py
!cp nova.py bfcl/berkeley-function-call-leaderboard/bfcl_eval/model_handler/api_inference/nova.py

### Generate Responses
Uncomment each of the following commands to run the same test for our custom model and the other Nova base models for comparison.
Before we do that, let's copy our `test_case_ids_to_generate.json` file to the correct location so we can leverage the `--run-ids` flag that will only generate responses for the evaluation data we held out of our training.


In [None]:
!cp test_case_ids_to_generate.json bfcl/berkeley-function-call-leaderboard/test_case_ids_to_generate.json

Now we can run our evaluation on our evaluation set using the `generate` command. If you didn't catch the model name assigned to your model, checkout [`model_config.py`](bfcl/berkeley-function-call-leaderboard/bfcl_eval/constants/model_config.py) and it should be the first model in the list.

In [None]:
!bfcl generate --model custom-nova-lite --run-ids
# !bfcl generate --model nova-lite-v1.0 --run-ids
# !bfcl generate --model nova-pro-v1.0 --run-ids

### Run Evaluations
With your responses generated we can now run the evaluations. This will make a folder of scores for each model.

In [None]:
!bfcl evaluate --model custom-nova-lite --test-category irrelevance,multiple,live_relevance,simple
# !bfcl evaluate --model nova-lite-v1.0 --test-category irrelevance,multiple,live_relevance,simple
# !bfcl evaluate --model nova-pro-v1.0 --test-category irrelevance,multiple,live_relevance,simple
# !bfcl evaluate --model nova-premier-v1.0 --test-category irrelevance,multiple,live_relevance,simple
# !bfcl evaluate --model nova-micro-v1.0 --test-category irrelevance,multiple,live_relevance,simple

## Evaluation Data Ready for Analysis
Now let's take a look at how our customized model compares to others.

In [None]:
# Import required libraries
import json
import os
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Configure plotly for jupyter
import plotly.io as pio
pio.renderers.default = 'notebook'

print("Libraries imported successfully!")

## 📊 Load Model-Specific Score Data

We'll load the accuracy scores from each model's score files. Each score file contains a summary row with accuracy, correct_count, and total_count metrics.

In [None]:
def load_score_summary(file_path):
    """Load the first row (summary statistics) from a score file."""
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            first_line = f.readline().strip()
            if first_line:
                return json.loads(first_line)
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
    return None

def build_score_dataframe(score_folder='score'):
    """Build a comprehensive DataFrame from all score files."""
    score_data = []
    
    # Define the test categories we're interested in
    test_categories = ['irrelevance', 'live_relevance', 'multiple', 'simple']
    
    # Iterate through model directories
    for model_dir in os.listdir(score_folder):
        model_path = os.path.join(score_folder, model_dir)
        
        # Skip if not a directory or if it's a CSV file
        if not os.path.isdir(model_path) or model_dir.startswith('data_'):
            continue
            
        print(f"Processing model: {model_dir}")
        
        # Process each test category
        for category in test_categories:
            score_file = f"BFCL_v3_{category}_score.json"
            score_path = os.path.join(model_path, score_file)
            
            if os.path.exists(score_path):
                summary = load_score_summary(score_path)
                if summary:
                    row = {
                        'model': model_dir,
                        'test_category': category,
                        'accuracy': summary.get('accuracy', 0),
                        'correct_count': summary.get('correct_count', 0),
                        'total_count': summary.get('total_count', 0),
                        'accuracy_percentage': summary.get('accuracy', 0) * 100
                    }
                    score_data.append(row)
                    print(f"  {category}: {summary.get('accuracy', 0):.3f} ({summary.get('correct_count', 0)}/{summary.get('total_count', 0)})")
            else:
                print(f"  {category}: Score file not found")
    
    df = pd.DataFrame(score_data)
    print(f"\n📈 Score DataFrame created with {len(df)} rows")
    print(f"Models: {sorted(df['model'].unique())}")
    print(f"Test categories: {sorted(df['test_category'].unique())}")
    
    return df

# Load the score data
print("Loading model-specific score data...")
scores_df = build_score_dataframe(score_folder='bfcl/berkeley-function-call-leaderboard/score')
print("\n✅ Score data loaded successfully!")

## 📋 Score Summary Table

Let's examine the accuracy scores for each model across all test categories.

In [None]:
# Display the complete score data
print("🎯 Model Performance Summary:")
print("=" * 80)

# Create a pivot table for better readability
pivot_df = scores_df.pivot(index='model', columns='test_category', values='accuracy_percentage')
pivot_df = pivot_df.round(2)

# Add overall average
pivot_df['average'] = pivot_df.mean(axis=1).round(2)

# Sort by average performance
pivot_df = pivot_df.sort_values('average', ascending=False)

print(pivot_df)
print("\n📊 Values shown as accuracy percentages (%)")

# Also show the raw DataFrame
print("\n📈 Detailed Score Data:")
scores_df_display = scores_df.copy()
scores_df_display['accuracy'] = scores_df_display['accuracy'].round(4)
scores_df_display['accuracy_percentage'] = scores_df_display['accuracy_percentage'].round(2)
scores_df_display

## 📊 Performance Visualization

### Overall Performance Comparison

In [None]:
# Create overall performance comparison chart
fig = px.bar(
    scores_df, 
    x='model', 
    y='accuracy_percentage',
    color='test_category',
    title='🎯 Model Performance Across Test Categories',
    labels={
        'accuracy_percentage': 'Accuracy (%)',
        'model': 'Model',
        'test_category': 'Test Category'
    },
    color_discrete_map={
        'simple': '#2E86AB',
        'multiple': '#A23B72', 
        'live_relevance': '#F18F01',
        'irrelevance': '#C73E1D'
    }
)

fig.update_layout(
    height=500,
    xaxis_tickangle=-45,
    showlegend=True,
    plot_bgcolor='white',
    font=dict(size=12)
)

fig.show()

### Category-by-Category Performance Analysis

In [None]:
# Create subplots for each test category
categories = sorted(scores_df['test_category'].unique())
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[f'{cat.replace("_", " ").title()} Test Category' for cat in categories],
    vertical_spacing=0.15,
    horizontal_spacing=0.1
)

colors = ['#2E86AB', '#A23B72', '#F18F01', '#C73E1D']

for i, category in enumerate(categories):
    cat_data = scores_df[scores_df['test_category'] == category].sort_values('accuracy_percentage', ascending=True)
    
    row = (i // 2) + 1
    col = (i % 2) + 1
    
    fig.add_trace(
        go.Bar(
            x=cat_data['accuracy_percentage'],
            y=cat_data['model'],
            orientation='h',
            name=category,
            marker_color=colors[i],
            showlegend=False,
            text=cat_data['accuracy_percentage'].round(1),
            textposition='outside'
        ),
        row=row, col=col
    )
    
    fig.update_xaxes(title_text="Accuracy (%)", row=row, col=col)
    fig.update_yaxes(title_text="Model", row=row, col=col)

fig.update_layout(
    height=700,
    title_text="📊 Performance Analysis by Test Category",
    showlegend=False,
    plot_bgcolor='white'
)

fig.show()

### Model Ranking Analysis

In [None]:
# Calculate average performance and ranking
avg_performance = scores_df.groupby('model')['accuracy_percentage'].mean().sort_values(ascending=False)

print("🏆 MODEL RANKINGS BY AVERAGE ACCURACY")
print("=" * 50)

for rank, (model, avg_acc) in enumerate(avg_performance.items(), 1):
    print(f"{rank}. {model:<20} {avg_acc:6.2f}%")

# Create ranking visualization
fig = go.Figure()

fig.add_trace(go.Bar(
    x=avg_performance.values,
    y=avg_performance.index,
    orientation='h',
    marker=dict(
        color=avg_performance.values,
        colorscale='RdYlBu_r',
        showscale=True,
        colorbar=dict(title="Accuracy (%)")
    ),
    text=[f"{val:.1f}%" for val in avg_performance.values],
    textposition='outside'
))

fig.update_layout(
    title='🏆 Overall Model Ranking by Average Accuracy',
    xaxis_title='Average Accuracy (%)',
    yaxis_title='Model',
    height=400,
    plot_bgcolor='white',
    font=dict(size=12)
)

fig.show()

## 📈 Performance Statistics Summary

In [None]:
# Generate comprehensive statistics
print("📊 COMPREHENSIVE PERFORMANCE STATISTICS")
print("=" * 60)

# Overall statistics
overall_stats = scores_df['accuracy_percentage'].describe()
print("\n🎯 Overall Accuracy Statistics:")
for stat, value in overall_stats.items():
    print(f"{stat.capitalize():<10}: {value:6.2f}%")

# Best and worst performers by category
print("\n🏆 Best Performers by Category:")
for category in sorted(scores_df['test_category'].unique()):
    cat_data = scores_df[scores_df['test_category'] == category]
    best = cat_data.loc[cat_data['accuracy_percentage'].idxmax()]
    print(f"{category.replace('_', ' ').title():<15}: {best['model']:<20} ({best['accuracy_percentage']:.2f}%)")

print("\n📉 Performance Ranges by Category:")
for category in sorted(scores_df['test_category'].unique()):
    cat_data = scores_df[scores_df['test_category'] == category]['accuracy_percentage']
    print(f"{category.replace('_', ' ').title():<15}: {cat_data.min():.2f}% - {cat_data.max():.2f}% (range: {cat_data.max()-cat_data.min():.2f}%)")

# Model consistency analysis
print("\n📊 Model Consistency (Standard Deviation):")
model_std = scores_df.groupby('model')['accuracy_percentage'].std().sort_values()
for model, std in model_std.items():
    consistency = "High" if std < 5 else "Medium" if std < 10 else "Low"
    print(f"{model:<20}: {std:5.2f}% ({consistency} consistency)")

## 🔍 Detailed Analysis Insights

In [None]:
# Generate insights and observations
print("🔍 KEY INSIGHTS FROM FUNCTION CALLING EVALUATION")
print("=" * 60)

# Find top performer
top_model = avg_performance.index[0]
top_score = avg_performance.iloc[0]
print(f"\n🥇 Top Performer: {top_model} with {top_score:.2f}% average accuracy")

# Find most challenging category
category_avg = scores_df.groupby('test_category')['accuracy_percentage'].mean().sort_values()
hardest_category = category_avg.index[0]
hardest_score = category_avg.iloc[0]
easiest_category = category_avg.index[-1]
easiest_score = category_avg.iloc[-1]

print(f"\n📊 Most Challenging Category: {hardest_category.replace('_', ' ').title()} ({hardest_score:.2f}% avg)")
print(f"📊 Easiest Category: {easiest_category.replace('_', ' ').title()} ({easiest_score:.2f}% avg)")

# Custom model performance
if 'custom-nova-lite' in scores_df['model'].values:
    custom_avg = avg_performance['custom-nova-lite']
    custom_rank = list(avg_performance.index).index('custom-nova-lite') + 1
    
    print(f"\n🎯 Custom Model Performance:")
    print(f"   custom-nova-lite ranks #{custom_rank} with {custom_avg:.2f}% average accuracy")
    
    # Compare with base models
    if 'nova-lite-v1.0' in avg_performance.index:
        base_lite_avg = avg_performance['nova-lite-v1.0']
        improvement = custom_avg - base_lite_avg
        direction = "improvement" if improvement > 0 else "decrease"
        print(f"   vs nova-lite-v1.0: {abs(improvement):.2f}% {direction}")

print(f"\n📈 Performance Distribution:")
high_performers = (avg_performance >= 90).sum()
medium_performers = ((avg_performance >= 80) & (avg_performance < 90)).sum()
low_performers = (avg_performance < 80).sum()

print(f"   High (≥90%): {high_performers} models")
print(f"   Medium (80-90%): {medium_performers} models")
print(f"   Lower (<80%): {low_performers} models")

## 💾 Export Results

Save the analysis results for further use or reporting.

In [None]:
# Export results to CSV
output_dir = 'analysis_output'
os.makedirs(output_dir, exist_ok=True)

# Export detailed scores
scores_df.to_csv(f'{output_dir}/model_scores_detailed.csv', index=False)
print(f"✅ Detailed scores exported to {output_dir}/model_scores_detailed.csv")

# Export pivot table
pivot_df.to_csv(f'{output_dir}/model_scores_pivot.csv')
print(f"✅ Pivot table exported to {output_dir}/model_scores_pivot.csv")

# Export rankings
ranking_df = pd.DataFrame({
    'model': avg_performance.index,
    'average_accuracy': avg_performance.values,
    'rank': range(1, len(avg_performance) + 1)
})
ranking_df.to_csv(f'{output_dir}/model_rankings.csv', index=False)
print(f"✅ Rankings exported to {output_dir}/model_rankings.csv")

print(f"\n📁 All analysis files saved to '{output_dir}/' directory")

---

## 🎯 Next Steps

Now that you've completed the function calling evaluation analysis, here are recommended next steps:

### 1. **Model Selection and Optimization**
- Review the model rankings and select the best-performing model for your use case
- Consider the trade-offs between accuracy and model size/cost
- If using custom-nova-lite, analyze its performance relative to base models

### 2. **Deep Dive into Specific Categories**
- Investigate why certain test categories are more challenging
- Examine individual test cases in categories where models struggle
- Consider additional training or fine-tuning for underperforming categories

### 3. **Production Deployment Planning**
- Use the consistency analysis to understand model reliability
- Plan fallback strategies for function calling failures
- Consider ensemble approaches using multiple models

### 4. **Further Analysis Options**
- Analyze error patterns from the detailed score files
- Compare token usage and latency across models (from result files)
- Conduct cost-benefit analysis including inference costs

### 5. **Documentation and Reporting**
- Use the exported CSV files for stakeholder reports
- Document model selection rationale
- Create monitoring dashboards for production function calling performance

### 📚 **Additional Resources**
- [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html) - Official BFCL leaderboard
- [Amazon Bedrock Custom Models](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-models.html) - Documentation for custom model development
- [Function Calling Best Practices](https://docs.aws.amazon.com/bedrock/latest/userguide/function-calling.html) - AWS Bedrock function calling guidance

### 🛠️ **Tools for Continued Analysis**
- Review notebook `01_prepare_data.ipynb` for data preparation insights
- Use notebook `02_generate_responses.ipynb` for generating additional test cases
- Examine notebook `03_evaluate_responses.ipynb` for evaluation methodology

---

*This analysis focused on model-specific performance using official BFCL evaluation scores for the four test categories: irrelevance, live_relevance, multiple, and simple.*