# W2 Form Extraction - Custom Model Evaluation

Evaluate the fine-tuned model and compare it against the base model baseline
established in notebook 01.

In this notebook we will:

- Run the fine-tuned model on 100 test samples
- Compare accuracy metrics against the base model
- Clean up AWS resources (deployment, IAM role and policy)

**Prerequisite:** Run `04_deploy_on_bedrock.ipynb` first and ensure the deployment is active.

## Environment Setup

In [None]:
import warnings
warnings.filterwarnings("ignore")

from util import *

clients = get_aws_clients()
session         = clients["session"]
bedrock         = clients["bedrock"]
bedrock_runtime = clients["bedrock_runtime"]
account_id      = clients["account_id"]

In [None]:
%store -r bucket_name
%store -r test_s3_paths
%store -r deployment_arn
%store -r role_name
%store -r policy_arn
%store -r base_eval_results

print(f"Deployment ARN: {deployment_arn}")

## Evaluate Fine-tuned Model (100 test samples)

In [3]:
test_data = build_test_data_in_memory(test_s3_paths, account_id)

print("Running custom model evaluation on 100 test samples...")
custom_eval_results = evaluate_model_on_test_data(
    bedrock_runtime, test_data, 100, deployment_arn
)

so = custom_eval_results["structured_output"]
print(f"\nStructured Output (Valid JSON):")
print(f"  Parse success rate: {so['parse_success_rate']:.2%} ({so['parse_successes']}/{so['total']})")
print(f"  Parse failures:     {so['parse_failures']}")

print(f"\nOverall Field Extraction Accuracy: {custom_eval_results['overall_accuracy']:.2%}")
print("\nAccuracy by Field Category:")
for category, accuracy in custom_eval_results["category_accuracies"].items():
    print(f"  - {category}: {accuracy:.2%}")

Running custom model evaluation on 100 test samples...


Evaluating model: 100%|██████████| 100/100 [03:19<00:00,  1.99s/it]


Structured Output (Valid JSON):
  Parse success rate: 100.00% (100/100)
  Parse failures:     0

Overall Field Extraction Accuracy: 91.13%

Accuracy by Field Category:
  - Employee Information: 92.67%
  - Employer Information: 84.33%
  - Earnings: 88.71%
  - Benefits: 100.00%
  - Multi-State Employment: 93.68%
  - Other: 0.00%





## Compare Base vs. Fine-tuned Model

In [4]:
base_so   = base_eval_results["structured_output"]
custom_so = custom_eval_results["structured_output"]
base_acc   = base_eval_results["overall_accuracy"]
custom_acc = custom_eval_results["overall_accuracy"]

print(f"{'Metric':<30} {'Base Model':>12} {'Fine-tuned':>12} {'Improvement':>12}")
print("-" * 68)
print(f"{'Structured Output Rate':<30} {base_so['parse_success_rate']:>11.2%} {custom_so['parse_success_rate']:>11.2%} {custom_so['parse_success_rate'] - base_so['parse_success_rate']:>+11.2%}")
print(f"{'Overall Accuracy':<30} {base_acc:>11.2%} {custom_acc:>11.2%} {custom_acc - base_acc:>+11.2%}")
print()

for category in base_eval_results["category_accuracies"]:
    b = base_eval_results["category_accuracies"][category]
    c = custom_eval_results["category_accuracies"].get(category, 0)
    print(f"{category:<30} {b:>11.2%} {c:>11.2%} {c - b:>+11.2%}")

Metric                           Base Model   Fine-tuned  Improvement
--------------------------------------------------------------------
Structured Output Rate             100.00%     100.00%      +0.00%
Overall Accuracy                    51.90%      91.13%     +39.22%

Employee Information                60.00%      92.67%     +32.67%
Employer Information                48.00%      84.33%     +36.33%
Earnings                            42.57%      88.71%     +46.14%
Benefits                            50.00%     100.00%     +50.00%
Multi-State Employment              62.91%      93.68%     +30.77%
Other                                0.00%       0.00%      +0.00%


## Resource Cleanup

Delete the deployment and IAM resources to avoid unnecessary costs.

In [None]:
clean_up(
    session,
    bedrock,
    deployment_arn=deployment_arn,
    role_name=role_name,
    policy_arn=policy_arn,
)