# Serbian Legal Named Entity Recognition (NER) Pipeline - 5-Fold Cross-Validation

This notebook implements 5-fold cross-validation for the Serbian Legal NER pipeline using the base BERT model (classla/bcms-bertic).

## Key Features
- **5-Fold Cross-Validation**: Robust evaluation across different data splits
- **Base BERT Architecture**: Uses classla/bcms-bertic for token classification
- **Sliding Window Tokenization**: Handles long sequences without truncation
- **Comprehensive Metrics**: Precision, recall, F1-score, and accuracy tracking
- **Statistical Analysis**: Mean and standard deviation across folds

## Entity Types
- **COURT**: Court institutions
- **DECISION_DATE**: Dates of legal decisions
- **CASE_NUMBER**: Case identifiers
- **CRIMINAL_ACT**: Criminal acts/charges
- **PROSECUTOR**: Prosecutor entities
- **DEFENDANT**: Defendant entities
- **JUDGE**: Judge names
- **REGISTRAR**: Court registrar
- **SANCTION**: Sanctions/penalties
- **SANCTION_TYPE**: Type of sanction
- **SANCTION_VALUE**: Value/duration of sanction
- **PROVISION**: Legal provisions
- **PROCEDURE_COSTS**: Legal procedure costs

In [1]:
# Mount Google Drive (for Colab)
try:
    from google.colab import drive
    drive.mount('/content/drive')
    USE_COLAB = True
except ImportError:
    USE_COLAB = False
    print("Running locally")

Mounted at /content/drive


## 1. Environment Setup and Dependencies

In [2]:
# Install required packages
!pip install transformers torch datasets tokenizers scikit-learn seqeval pandas numpy matplotlib seaborn tqdm

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m870.1 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=adfd8f0b362064cd1d5d59f943a6acddbaf34a377f3736bbd97513222d909e9b
  Stored in directory: /root/.cache/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [3]:
# Import shared modules
import sys
import os

# Add the shared modules to path
if USE_COLAB:
    sys.path.append('/content/drive/MyDrive/NER_Master/ner/')
else:
    sys.path.append('../shared')

import importlib
import shared
import shared.model_utils
import shared.data_processing
import shared.dataset
import shared.evaluation
import shared.config
importlib.reload(shared.config)
importlib.reload(shared.data_processing)
importlib.reload(shared.dataset)
importlib.reload(shared.model_utils)
importlib.reload(shared.evaluation)
importlib.reload(shared)

# Import from shared modules
from shared import (
    # Configuration
    ENTITY_TYPES, BIO_LABELS, DEFAULT_TRAINING_ARGS,
    get_default_model_config, get_paths, setup_environment,

    # Data processing
    LabelStudioToBIOConverter, load_labelstudio_data,
    analyze_labelstudio_data, validate_bio_examples,

    # Dataset
    NERDataset, split_dataset, tokenize_and_align_labels_with_sliding_window,
    print_sequence_analysis, create_huggingface_datasets,

    # Model utilities
    load_model_and_tokenizer, create_training_arguments, create_trainer,
    detailed_evaluation, save_model_info, setup_device_and_seed,

    # Evaluation
    generate_evaluation_report, plot_training_history, plot_entity_distribution
)

# Standard imports
import warnings
warnings.filterwarnings('ignore')
import numpy as np
from sklearn.model_selection import KFold
import torch
from transformers import DataCollatorForTokenClassification, AutoTokenizer

# Setup device and random seed
device = setup_device_and_seed(42)

🔧 Setup complete:
  PyTorch version: 2.8.0+cu126
  CUDA available: False
  Device: cpu
  Random seed: 42


## 2. Configuration and Environment Setup

In [4]:
# Setup environment and paths
env_setup = setup_environment(use_local=not USE_COLAB, create_dirs=True)
paths = env_setup['paths']

# Model configuration
MODEL_NAME = "classla/bcms-bertic"
model_config = get_default_model_config()

# Output directory
OUTPUT_DIR = f"{paths['models_dir']}/bertic_base_5fold_cv"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"🔧 Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Output directory: {OUTPUT_DIR}")
print(f"  Entity types: {len(ENTITY_TYPES)}")
print(f"  BIO labels: {len(BIO_LABELS)}")

🔧 Environment setup (cloud):
  ✅ labelstudio_json: /content/drive/MyDrive/NER_Master/annotations.json
  ✅ judgments_dir: /content/drive/MyDrive/NER_Master/judgments
  ✅ labelstudio_files_dir: /content/drive/MyDrive/NER_Master/judgments
  ✅ mlm_data_dir: /content/drive/MyDrive/NER_Master/dapt-mlm
  ✅ models_dir: /content/drive/MyDrive/NER_Master/models
  ✅ logs_dir: /content/drive/MyDrive/NER_Master/logs
  ✅ results_dir: /content/drive/MyDrive/NER_Master/results
🔧 Configuration:
  Model: classla/bcms-bertic
  Output directory: /content/drive/MyDrive/NER_Master/models/bertic_base_5fold_cv
  Entity types: 16
  BIO labels: 33


## 3. Data Loading and Analysis

In [5]:
# Load LabelStudio data
labelstudio_data = load_labelstudio_data(paths['labelstudio_json'])

# Analyze the data
if labelstudio_data:
    analysis = analyze_labelstudio_data(labelstudio_data)
else:
    print("❌ No data loaded. Please check your paths.")
    exit()

✅ Loaded 225 annotated documents from /content/drive/MyDrive/NER_Master/annotations.json
📊 Analysis Results:
Total documents: 225
Total annotations: 225
Unique entity types: 14

Entity distribution:
  DEFENDANT: 1240
  PROVISION_MATERIAL: 1177
  CRIMINAL_ACT: 792
  PROVISION_PROCEDURAL: 686
  REGISTRAR: 460
  COURT: 458
  JUDGE: 451
  PROSECUTOR: 395
  DECISION_DATE: 359
  SANCTION_TYPE: 248
  SANCTION_VALUE: 241
  VERDICT: 238
  PROCEDURE_COSTS: 231
  CASE_NUMBER: 225


## 4. Data Preprocessing and BIO Conversion

In [6]:
# Convert LabelStudio data to BIO format
converter = LabelStudioToBIOConverter(
    judgments_dir=paths['judgments_dir'],
    labelstudio_files_dir=paths.get('labelstudio_files_dir')
)

bio_examples = converter.convert_to_bio(labelstudio_data)
print(f"✅ Converted {len(bio_examples)} examples to BIO format")

# Validate BIO examples
valid_examples, stats = validate_bio_examples(bio_examples)
print(f"📊 Validation complete: {stats['valid_examples']} valid examples")

✅ Converted 225 examples to BIO format
📊 BIO Validation Results:
Total examples: 225
Valid examples: 225
Invalid examples: 0
Empty examples: 0
📊 Validation complete: 225 valid examples


## 5. Dataset Preparation

In [7]:
# Create NER dataset
ner_dataset = NERDataset(valid_examples)
prepared_examples = ner_dataset.prepare_for_training()

print(f"📊 Dataset statistics:")
print(f"  Number of unique labels: {ner_dataset.get_num_labels()}")
print(f"  Prepared examples: {len(prepared_examples)}")

# Get label statistics
label_stats = ner_dataset.get_label_statistics()
print(f"  Total tokens: {label_stats['total_tokens']}")
print(f"  Entity types found: {len(label_stats['entity_counts'])}")

📊 Dataset statistics:
  Number of unique labels: 29
  Prepared examples: 225
  Total tokens: 232475
  Entity types found: 14


## 6. K-Fold Cross-Validation Setup

In [8]:
# Set up 5-fold cross-validation
N_FOLDS = 5
kfold = KFold(n_splits=N_FOLDS, shuffle=True, random_state=42)

# Convert to numpy array for easier indexing
examples_array = np.array(prepared_examples, dtype=object)

print(f"Setting up {N_FOLDS}-fold cross-validation")
print(f"Total examples: {len(prepared_examples)}")
print(f"Examples per fold (approx): {len(prepared_examples) // N_FOLDS}")

# Load tokenizer (will be used across all folds)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"\nLoaded tokenizer for {MODEL_NAME}")
print(f"Tokenizer vocab size: {tokenizer.vocab_size}")

# Store results from all folds
fold_results = []

Setting up 5-fold cross-validation
Total examples: 225
Examples per fold (approx): 45


tokenizer_config.json:   0%|          | 0.00/83.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/467 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]


Loaded tokenizer for classla/bcms-bertic
Tokenizer vocab size: 32000


## 7. K-Fold Cross-Validation Helper Functions

In [9]:
# ============================================================================
# K-FOLD CROSS-VALIDATION HELPER FUNCTIONS
# ============================================================================

def prepare_fold_data(train_examples, val_examples, tokenizer, ner_dataset):
    """
    Prepare training and validation datasets for a specific fold.

    Args:
        train_examples: Training examples for this fold
        val_examples: Validation examples for this fold
        tokenizer: Tokenizer instance
        ner_dataset: NER dataset instance

    Returns:
        tuple: (train_dataset, val_dataset, data_collator)
    """
    # Tokenize datasets with sliding window
    train_tokenized = tokenize_and_align_labels_with_sliding_window(
        train_examples, tokenizer, ner_dataset.label_to_id,
        max_length=model_config['max_length'], stride=model_config['stride']
    )

    val_tokenized = tokenize_and_align_labels_with_sliding_window(
        val_examples, tokenizer, ner_dataset.label_to_id,
        max_length=model_config['max_length'], stride=model_config['stride']
    )

    # Create HuggingFace datasets
    train_dataset, val_dataset, _ = create_huggingface_datasets(
        train_tokenized, val_tokenized, val_tokenized  # Using val as placeholder for test
    )

    # Data collator
    data_collator = DataCollatorForTokenClassification(
        tokenizer=tokenizer,
        padding=True,
        return_tensors="pt"
    )

    return train_dataset, val_dataset, data_collator

print("✅ K-fold helper functions defined successfully!")

✅ K-fold helper functions defined successfully!


In [19]:
def create_model_and_trainer(fold_num, train_dataset, val_dataset, data_collator, tokenizer, ner_dataset, device):
    """
    Create model and trainer for a specific fold.

    Args:
        fold_num: Current fold number
        train_dataset: Training dataset for this fold
        val_dataset: Validation dataset for this fold
        data_collator: Data collator
        tokenizer: Tokenizer instance
        ner_dataset: NER dataset instance
        device: Device to use (cuda/cpu)

    Returns:
        tuple: (model, trainer, fold_output_dir)
    """
    # Create fold-specific output directory
    fold_output_dir = f"{OUTPUT_DIR}/fold_{fold_num}"
    import os
    os.makedirs(fold_output_dir, exist_ok=True)

    # Load fresh model for this fold
    model, _ = load_model_and_tokenizer(
        MODEL_NAME,
        ner_dataset.get_num_labels(),
        ner_dataset.id_to_label,
        ner_dataset.label_to_id
    )

    # Move model to device
    model.to(device)

    # Create training arguments for this fold
    training_args = create_training_arguments(
        output_dir=fold_output_dir,
        num_epochs=model_config['num_epochs'],
        batch_size=model_config['batch_size'],
        learning_rate=model_config['learning_rate'],
        warmup_steps=500,
        weight_decay=0.01,
        logging_steps=50,
        eval_steps=100,
        save_steps=500,
        early_stopping_patience=3
    )

    # Create trainer
    trainer = create_trainer(
        model=model,
        training_args=training_args,
        train_dataset=train_dataset,
        val_dataset=val_dataset,
        data_collator=data_collator,
        tokenizer=tokenizer,
        id_to_label=ner_dataset.id_to_label,
        early_stopping_patience=3
    )

    print(f"Trainer initialized for fold {fold_num}")
    return model, trainer, fold_output_dir

print("✅ Model and trainer creation function defined successfully!")

✅ Model and trainer creation function defined successfully!


In [20]:
def train_and_evaluate_fold(fold_num, trainer, val_dataset, ner_dataset):
    """
    Train and evaluate a model for a specific fold.

    Args:
        fold_num: Current fold number
        trainer: Trainer instance
        val_dataset: Validation dataset for this fold
        ner_dataset: NER dataset instance

    Returns:
        dict: Fold results including metrics
    """
    print(f"\n🏋️  Training fold {fold_num}...")

    # Train the model
    trainer.train()

    print(f"💾 Saving model for fold {fold_num}...")
    trainer.save_model()

    # Evaluate on validation set
    print(f"📊 Evaluating fold {fold_num}...")
    eval_results = detailed_evaluation(
        trainer, val_dataset, f"Fold {fold_num} Validation", ner_dataset.id_to_label
    )

    # Extract metrics
    fold_result = {
        'fold': fold_num,
        'precision': eval_results['precision'],
        'recall': eval_results['recall'],
        'f1': eval_results['f1'],
        'accuracy': eval_results['accuracy'],
        'true_predictions': eval_results['true_predictions'],
        'true_labels': eval_results['true_labels']
    }

    print(f"\nFold {fold_num} completed successfully!")
    return fold_result

print("✅ Training and evaluation helper function defined successfully!")

✅ Training and evaluation helper function defined successfully!


## 8. K-Fold Cross-Validation Training Loop

In [None]:
# Check device availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Main K-Fold Cross-Validation Loop
print(f"\n{'='*80}")
print(f"STARTING {N_FOLDS}-FOLD CROSS-VALIDATION")
print(f"{'='*80}")
print(f"Total examples: {len(examples_array)}")
print(f"Model: {MODEL_NAME}")
print(f"Device: {device}")

# Execute K-Fold training
for fold_num, (train_idx, val_idx) in enumerate(kfold.split(examples_array), 1):
    print(f"\n{'='*80}")
    print(f"FOLD {fold_num}/{N_FOLDS}")
    print(f"{'='*80}")
    print(f"Train indices: {len(train_idx)}, Val indices: {len(val_idx)}")

    # Get fold data
    train_examples = examples_array[train_idx].tolist()
    val_examples = examples_array[val_idx].tolist()

    print(f"Training examples: {len(train_examples)}")
    print(f"Validation examples: {len(val_examples)}")

    # Prepare data for this fold
    print(f"\n🔤 Preparing data for fold {fold_num}...")
    train_dataset, val_dataset, data_collator = prepare_fold_data(
        train_examples, val_examples, tokenizer, ner_dataset
    )

    print(f"📦 Fold {fold_num} datasets:")
    print(f"  Training: {len(train_dataset)} examples")
    print(f"  Validation: {len(val_dataset)} examples")

    # Create model and trainer for this fold
    print(f"\n🤖 Creating model and trainer for fold {fold_num}...")
    model, trainer, fold_output_dir = create_model_and_trainer(
        fold_num, train_dataset, val_dataset, data_collator, tokenizer, ner_dataset, device
    )

    # Train and evaluate this fold
    fold_result = train_and_evaluate_fold(fold_num, trainer, val_dataset, ner_dataset)
    fold_results.append(fold_result)

    # Clean up to free memory
    del model, trainer, train_dataset, val_dataset
    torch.cuda.empty_cache() if torch.cuda.is_available() else None

    print(f"\n✅ Fold {fold_num} completed!")
    print(f"   Precision: {fold_result['precision']:.4f}")
    print(f"   Recall: {fold_result['recall']:.4f}")
    print(f"   F1-Score: {fold_result['f1']:.4f}")
    print(f"   Accuracy: {fold_result['accuracy']:.4f}")

print(f"\n{'='*80}")
print(f"K-FOLD CROSS-VALIDATION COMPLETED!")
print(f"{'='*80}")

Using device: cpu

STARTING 5-FOLD CROSS-VALIDATION
Total examples: 225
Model: classla/bcms-bertic
Device: cpu

FOLD 1/5
Train indices: 180, Val indices: 45
Training examples: 180
Validation examples: 45

🔤 Preparing data for fold 1...
📦 Created HuggingFace datasets:
  Training: 1845 examples
  Validation: 500 examples
  Test: 500 examples
📦 Fold 1 datasets:
  Training: 1845 examples
  Validation: 500 examples

🤖 Creating model and trainer for fold 1...
🔄 Loading model and tokenizer...
📥 Model: classla/bcms-bertic
🏷️  Number of labels: 29
✅ Loaded tokenizer (vocab size: 32000)


Some weights of ElectraForTokenClassification were not initialized from the model checkpoint at classla/bcms-bertic and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded model (parameters: 110,049,053)
🖥️  Device: cpu
⚙️  Training configuration:
  Epochs: 8
  Batch size: 4
  Learning rate: 3e-05
  Warmup steps: 500
  Weight decay: 0.01
  Early stopping patience: 3
🏋️  Created trainer with early stopping (patience: 3)
📊 Training dataset size: 1845
📊 Validation dataset size: 500
Trainer initialized for fold 1

🏋️  Training fold 1...


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mpericapero1[0m ([33mpericapero1-faculty-of-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss


## 9. K-Fold Results Analysis and Summary

In [None]:
# ============================================================================
# K-FOLD RESULTS SUMMARY
# ============================================================================

print(f"\n{'='*80}")
print(f"K-FOLD CROSS-VALIDATION RESULTS SUMMARY")
print(f"{'='*80}")

# Extract metrics from all folds
precisions = [result['precision'] for result in fold_results]
recalls = [result['recall'] for result in fold_results]
f1_scores = [result['f1'] for result in fold_results]
accuracies = [result['accuracy'] for result in fold_results]

# Calculate statistics
print(f"\n📊 PERFORMANCE METRICS ACROSS {N_FOLDS} FOLDS:")
print(f"{'='*50}")

print(f"\n🎯 PRECISION:")
print(f"  Mean: {np.mean(precisions):.4f} ± {np.std(precisions):.4f}")
print(f"  Min:  {np.min(precisions):.4f} (Fold {np.argmin(precisions) + 1})")
print(f"  Max:  {np.max(precisions):.4f} (Fold {np.argmax(precisions) + 1})")

print(f"\n🎯 RECALL:")
print(f"  Mean: {np.mean(recalls):.4f} ± {np.std(recalls):.4f}")
print(f"  Min:  {np.min(recalls):.4f} (Fold {np.argmin(recalls) + 1})")
print(f"  Max:  {np.max(recalls):.4f} (Fold {np.argmax(recalls) + 1})")

print(f"\n🎯 F1-SCORE:")
print(f"  Mean: {np.mean(f1_scores):.4f} ± {np.std(f1_scores):.4f}")
print(f"  Min:  {np.min(f1_scores):.4f} (Fold {np.argmin(f1_scores) + 1})")
print(f"  Max:  {np.max(f1_scores):.4f} (Fold {np.argmax(f1_scores) + 1})")

print(f"\n🎯 ACCURACY:")
print(f"  Mean: {np.mean(accuracies):.4f} ± {np.std(accuracies):.4f}")
print(f"  Min:  {np.min(accuracies):.4f} (Fold {np.argmin(accuracies) + 1})")
print(f"  Max:  {np.max(accuracies):.4f} (Fold {np.argmax(accuracies) + 1})")

# Individual fold results
print(f"\n📋 INDIVIDUAL FOLD RESULTS:")
print(f"{'='*50}")
for i, result in enumerate(fold_results, 1):
    print(f"Fold {i}: P={result['precision']:.4f}, R={result['recall']:.4f}, F1={result['f1']:.4f}, Acc={result['accuracy']:.4f}")

In [None]:
# ============================================================================
# VISUALIZATION OF K-FOLD RESULTS
# ============================================================================

import matplotlib.pyplot as plt

# Create visualization of fold results
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle(f'{N_FOLDS}-Fold Cross-Validation Results - Base BERT Model', fontsize=16, fontweight='bold')

fold_numbers = list(range(1, N_FOLDS + 1))

# Precision plot
ax1.bar(fold_numbers, precisions, alpha=0.7, color='skyblue', edgecolor='navy')
ax1.set_title('Precision by Fold', fontweight='bold')
ax1.set_xlabel('Fold')
ax1.set_ylabel('Precision')
ax1.set_ylim(0, 1)
ax1.grid(True, alpha=0.3)
for i, v in enumerate(precisions):
    ax1.text(i+1, v+0.01, f'{v:.3f}', ha='center', va='bottom')

# Recall plot
ax2.bar(fold_numbers, recalls, alpha=0.7, color='lightgreen', edgecolor='darkgreen')
ax2.set_title('Recall by Fold', fontweight='bold')
ax2.set_xlabel('Fold')
ax2.set_ylabel('Recall')
ax2.set_ylim(0, 1)
ax2.grid(True, alpha=0.3)
for i, v in enumerate(recalls):
    ax2.text(i+1, v+0.01, f'{v:.3f}', ha='center', va='bottom')

# F1-Score plot
ax3.bar(fold_numbers, f1_scores, alpha=0.7, color='gold', edgecolor='orange')
ax3.set_title('F1-Score by Fold', fontweight='bold')
ax3.set_xlabel('Fold')
ax3.set_ylabel('F1-Score')
ax3.set_ylim(0, 1)
ax3.grid(True, alpha=0.3)
for i, v in enumerate(f1_scores):
    ax3.text(i+1, v+0.01, f'{v:.3f}', ha='center', va='bottom')

# Accuracy plot
ax4.bar(fold_numbers, accuracies, alpha=0.7, color='lightcoral', edgecolor='darkred')
ax4.set_title('Accuracy by Fold', fontweight='bold')
ax4.set_xlabel('Fold')
ax4.set_ylabel('Accuracy')
ax4.set_ylim(0, 1)
ax4.grid(True, alpha=0.3)
for i, v in enumerate(accuracies):
    ax4.text(i+1, v+0.01, f'{v:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

In [None]:
# Box plot for metric distribution
fig, ax = plt.subplots(figsize=(10, 6))
metrics_data = [precisions, recalls, f1_scores, accuracies]
labels = ['Precision', 'Recall', 'F1-Score', 'Accuracy']

box_plot = ax.boxplot(metrics_data, labels=labels, patch_artist=True)
colors = ['skyblue', 'lightgreen', 'gold', 'lightcoral']
for patch, color in zip(box_plot['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax.set_title(f'{N_FOLDS}-Fold Cross-Validation Metrics Distribution', fontsize=14, fontweight='bold')
ax.set_ylabel('Score')
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 1)

plt.tight_layout()
plt.show()

## 10. Save Results and Final Summary

In [None]:
# ============================================================================
# SAVE RESULTS TO FILE
# ============================================================================

import json
import pandas as pd
from datetime import datetime

# Create results summary
results_summary = {
    'experiment_info': {
        'model_name': MODEL_NAME,
        'n_folds': N_FOLDS,
        'total_examples': len(prepared_examples),
        'timestamp': datetime.now().isoformat(),
        'device': str(device)
    },
    'overall_metrics': {
        'precision': {
            'mean': float(np.mean(precisions)),
            'std': float(np.std(precisions)),
            'min': float(np.min(precisions)),
            'max': float(np.max(precisions))
        },
        'recall': {
            'mean': float(np.mean(recalls)),
            'std': float(np.std(recalls)),
            'min': float(np.min(recalls)),
            'max': float(np.max(recalls))
        },
        'f1_score': {
            'mean': float(np.mean(f1_scores)),
            'std': float(np.std(f1_scores)),
            'min': float(np.min(f1_scores)),
            'max': float(np.max(f1_scores))
        },
        'accuracy': {
            'mean': float(np.mean(accuracies)),
            'std': float(np.std(accuracies)),
            'min': float(np.min(accuracies)),
            'max': float(np.max(accuracies))
        }
    },
    'fold_results': [
        {
            'fold': result['fold'],
            'precision': float(result['precision']),
            'recall': float(result['recall']),
            'f1': float(result['f1']),
            'accuracy': float(result['accuracy'])
        }
        for result in fold_results
    ]
}

# Save results to JSON
results_file = f"{OUTPUT_DIR}/5fold_cv_results.json"
with open(results_file, 'w', encoding='utf-8') as f:
    json.dump(results_summary, f, indent=2, ensure_ascii=False)

print(f"✅ Results saved to: {results_file}")

# Create CSV for easy analysis
df_results = pd.DataFrame([
    {
        'Fold': result['fold'],
        'Precision': result['precision'],
        'Recall': result['recall'],
        'F1-Score': result['f1'],
        'Accuracy': result['accuracy']
    }
    for result in fold_results
])

# Add summary row
summary_row = {
    'Fold': 'Mean ± Std',
    'Precision': f"{np.mean(precisions):.4f} ± {np.std(precisions):.4f}",
    'Recall': f"{np.mean(recalls):.4f} ± {np.std(recalls):.4f}",
    'F1-Score': f"{np.mean(f1_scores):.4f} ± {np.std(f1_scores):.4f}",
    'Accuracy': f"{np.mean(accuracies):.4f} ± {np.std(accuracies):.4f}"
}

df_results = pd.concat([df_results, pd.DataFrame([summary_row])], ignore_index=True)

csv_file = f"{OUTPUT_DIR}/5fold_cv_results.csv"
df_results.to_csv(csv_file, index=False)
print(f"✅ Results CSV saved to: {csv_file}")

# Display final summary table
print(f"\n📊 FINAL RESULTS TABLE:")
print(df_results.to_string(index=False))

## 11. Conclusion

This notebook successfully implemented 5-fold cross-validation for the Serbian Legal NER pipeline using the base BERT model (classla/bcms-bertic).

### Key Achievements:
- ✅ **Robust Evaluation**: 5-fold cross-validation provides reliable performance estimates
- ✅ **Comprehensive Metrics**: Precision, recall, F1-score, and accuracy tracked across all folds
- ✅ **Statistical Analysis**: Mean and standard deviation calculated for all metrics
- ✅ **Visualization**: Clear charts showing performance across folds
- ✅ **Results Persistence**: JSON and CSV files saved for further analysis

### Next Steps:
1. **Compare with Other Models**: Use this same framework to evaluate other model variants
2. **Hyperparameter Tuning**: Optimize learning rate, batch size, and other parameters
3. **Error Analysis**: Examine misclassified entities to identify improvement opportunities
4. **Ensemble Methods**: Combine predictions from multiple folds for better performance

The 5-fold cross-validation framework is now ready to be applied to other models in your pipeline!