# Serbian Legal Named Entity Recognition (NER) Pipeline - BERT-CRF 5-Fold Cross-Validation

This notebook implements 5-fold cross-validation for the Serbian Legal NER pipeline using BERT-CRF architecture.
BERT-CRF combines BERT embeddings with a Conditional Random Field (CRF) layer for better sequence modeling.

## Key Features
- **5-Fold Cross-Validation**: Robust evaluation across different data splits
- **BERT-CRF Architecture**: BERT embeddings + CRF layer for sequence constraints
- **Sliding Window Tokenization**: Handles long sequences without truncation
- **Comprehensive Metrics**: Precision, recall, F1-score, and accuracy tracking
- **Statistical Analysis**: Mean and standard deviation across folds

## BERT-CRF Advantages
- **Better Sequence Modeling**: CRF enforces valid BIO sequence constraints
- **Improved Entity Boundaries**: More accurate entity span detection
- **Global Optimization**: Considers entire sequence for optimal labeling

## Entity Types
- **COURT**: Court institutions
- **DECISION_DATE**: Dates of legal decisions
- **CASE_NUMBER**: Case identifiers
- **CRIMINAL_ACT**: Criminal acts/charges
- **PROSECUTOR**: Prosecutor entities
- **DEFENDANT**: Defendant entities
- **JUDGE**: Judge names
- **REGISTRAR**: Court registrar
- **SANCTION**: Sanctions/penalties
- **SANCTION_TYPE**: Type of sanction
- **SANCTION_VALUE**: Value/duration of sanction
- **PROVISION**: Legal provisions
- **PROCEDURE_COSTS**: Legal procedure costs

In [None]:
# Mount Google Drive (for Colab)
try:
    from google.colab import drive
    drive.mount('/content/drive')
    USE_COLAB = True
except ImportError:
    USE_COLAB = False
    print("Running locally")

## 1. Environment Setup and Dependencies

In [None]:
# Install required packages including pytorch-crf for CRF layer
!pip install transformers torch datasets tokenizers scikit-learn seqeval pandas numpy matplotlib seaborn tqdm pytorch-crf

In [None]:
# Import shared modules
import sys
import os

# Add the shared modules to path
if USE_COLAB:
    sys.path.append('/content/drive/MyDrive/NER_Master/ner/')
else:
    sys.path.append('../shared')

import importlib
import shared
import shared.model_utils
import shared.data_processing
import shared.dataset
import shared.evaluation
import shared.config
importlib.reload(shared.config)
importlib.reload(shared.data_processing)
importlib.reload(shared.dataset)
importlib.reload(shared.model_utils)
importlib.reload(shared.evaluation)
importlib.reload(shared)

# Import from shared modules
from shared import (
    # Configuration
    ENTITY_TYPES, BIO_LABELS, DEFAULT_TRAINING_ARGS,
    get_default_model_config, get_paths, setup_environment,

    # Data processing
    LabelStudioToBIOConverter, load_labelstudio_data,
    analyze_labelstudio_data, validate_bio_examples,

    # Dataset
    NERDataset, split_dataset, tokenize_and_align_labels_with_sliding_window,
    print_sequence_analysis, create_huggingface_datasets,

    # Model utilities
    load_model_and_tokenizer, create_training_arguments, create_trainer,
    detailed_evaluation, save_model_info, setup_device_and_seed,

    # Evaluation
    generate_evaluation_report, plot_training_history, plot_entity_distribution
)

# Standard imports
import warnings
warnings.filterwarnings('ignore')
import numpy as np
from sklearn.model_selection import KFold
import torch
from transformers import DataCollatorForTokenClassification, AutoTokenizer

# Setup device and random seed
device = setup_device_and_seed(42)

## 2. BERT-CRF Specific Imports and Setup

In [None]:
# BERT-CRF specific imports
import torch.nn as nn
from torchcrf import CRF
from transformers import AutoModel, AutoConfig
from transformers import Trainer, TrainingArguments
from transformers.modeling_outputs import TokenClassifierOutput

print("✅ BERT-CRF specific imports loaded successfully!")

## 3. Configuration and Environment Setup

In [None]:
# Setup environment and paths
env_setup = setup_environment(use_local=not USE_COLAB, create_dirs=True)
paths = env_setup['paths']

# Model configuration
MODEL_NAME = "classla/bcms-bertic"
model_config = get_default_model_config()

# Output directory
OUTPUT_DIR = f"{paths['models_dir']}/bertic_crf_5fold_cv"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"🔧 Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Architecture: BERT-CRF")
print(f"  Output directory: {OUTPUT_DIR}")
print(f"  Entity types: {len(ENTITY_TYPES)}")
print(f"  BIO labels: {len(BIO_LABELS)}")

## 4. Data Loading and Analysis

In [None]:
# Load LabelStudio data
labelstudio_data = load_labelstudio_data(paths['labelstudio_json'])

# Analyze the data
if labelstudio_data:
    analysis = analyze_labelstudio_data(labelstudio_data)
else:
    print("❌ No data loaded. Please check your paths.")
    exit()

## 5. Data Preprocessing and BIO Conversion

In [None]:
# Convert LabelStudio data to BIO format
converter = LabelStudioToBIOConverter(
    judgments_dir=paths['judgments_dir'],
    labelstudio_files_dir=paths.get('labelstudio_files_dir')
)

bio_examples = converter.convert_to_bio(labelstudio_data)
print(f"✅ Converted {len(bio_examples)} examples to BIO format")

# Validate BIO examples
valid_examples, stats = validate_bio_examples(bio_examples)
print(f"📊 Validation complete: {stats['valid_examples']} valid examples")

## 6. Dataset Preparation

In [None]:
# Create NER dataset
ner_dataset = NERDataset(valid_examples)
prepared_examples = ner_dataset.prepare_for_training()

print(f"📊 Dataset statistics:")
print(f"  Number of unique labels: {ner_dataset.get_num_labels()}")
print(f"  Prepared examples: {len(prepared_examples)}")

# Get label statistics
label_stats = ner_dataset.get_label_statistics()
print(f"  Total tokens: {label_stats['total_tokens']}")
print(f"  Entity types found: {len(label_stats['entity_counts'])}")

## 7. BERT-CRF Model Definition

In [None]:
class BertCrfForTokenClassification(nn.Module):
    """
    BERT model with CRF layer for token classification.
    Combines BERT embeddings with CRF for better sequence modeling.
    """
    
    def __init__(self, config, num_labels):
        super().__init__()
        self.num_labels = num_labels
        self.config = config
        
        # BERT backbone
        self.bert = AutoModel.from_pretrained(MODEL_NAME, config=config)
        
        # Dropout and classifier
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, num_labels)
        
        # CRF layer
        self.crf = CRF(num_labels, batch_first=True)
        
    def forward(self, input_ids, attention_mask=None, labels=None, **kwargs):
        # Get BERT outputs
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            return_dict=True
        )
        
        # Apply dropout and classifier
        sequence_output = outputs.last_hidden_state
        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)
        
        loss = None
        if labels is not None:
            # CRF loss calculation
            # Convert labels to mask (ignore -100 labels)
            mask = labels != -100
            # Replace -100 with 0 for CRF (will be masked anyway)
            labels_masked = labels.clone()
            labels_masked[labels == -100] = 0
            
            # Calculate CRF loss
            log_likelihood = self.crf(logits, labels_masked, mask=mask, reduction='mean')
            loss = -log_likelihood
        
        return TokenClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )
    
    def predict(self, input_ids, attention_mask=None):
        """Predict using CRF decoding"""
        with torch.no_grad():
            outputs = self.bert(
                input_ids=input_ids,
                attention_mask=attention_mask,
                return_dict=True
            )
            sequence_output = self.dropout(outputs.last_hidden_state)
            logits = self.classifier(sequence_output)
            
            # CRF decoding
            mask = attention_mask.bool() if attention_mask is not None else None
            predictions = self.crf.decode(logits, mask=mask)
            return predictions

print("✅ BERT-CRF model class defined successfully!")

## 8. K-Fold Cross-Validation Setup

In [None]:
# Set up 5-fold cross-validation
N_FOLDS = 5
kfold = KFold(n_splits=N_FOLDS, shuffle=True, random_state=42)

# Convert to numpy array for easier indexing
examples_array = np.array(prepared_examples, dtype=object)

print(f"Setting up {N_FOLDS}-fold cross-validation")
print(f"Total examples: {len(prepared_examples)}")
print(f"Examples per fold (approx): {len(prepared_examples) // N_FOLDS}")

# Load tokenizer (will be used across all folds)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"\nLoaded tokenizer for {MODEL_NAME}")
print(f"Tokenizer vocab size: {tokenizer.vocab_size}")

# Store results from all folds
fold_results = []

## 9. K-Fold Cross-Validation Helper Functions

In [None]:
# ============================================================================
# BERT-CRF K-FOLD CROSS-VALIDATION HELPER FUNCTIONS
# ============================================================================

def prepare_fold_data(train_examples, val_examples, tokenizer, ner_dataset):
    """
    Prepare training and validation datasets for a specific fold.
    
    Args:
        train_examples: Training examples for this fold
        val_examples: Validation examples for this fold
        tokenizer: Tokenizer instance
        ner_dataset: NER dataset instance
    
    Returns:
        tuple: (train_dataset, val_dataset, data_collator)
    """
    # Tokenize datasets with sliding window
    train_tokenized = tokenize_and_align_labels_with_sliding_window(
        train_examples, tokenizer, ner_dataset.label_to_id,
        max_length=model_config['max_length'], stride=model_config['stride']
    )
    
    val_tokenized = tokenize_and_align_labels_with_sliding_window(
        val_examples, tokenizer, ner_dataset.label_to_id,
        max_length=model_config['max_length'], stride=model_config['stride']
    )
    
    # Create HuggingFace datasets
    train_dataset, val_dataset, _ = create_huggingface_datasets(
        train_tokenized, val_tokenized, val_tokenized  # Using val as placeholder for test
    )
    
    # Data collator
    data_collator = DataCollatorForTokenClassification(
        tokenizer=tokenizer,
        padding=True,
        return_tensors="pt"
    )
    
    return train_dataset, val_dataset, data_collator

print("✅ Data preparation function defined successfully!")

In [None]:
def create_bert_crf_model_and_trainer(fold_num, train_dataset, val_dataset, data_collator, tokenizer, ner_dataset, device):
    """
    Create BERT-CRF model and trainer for a specific fold.
    
    Args:
        fold_num: Current fold number
        train_dataset: Training dataset for this fold
        val_dataset: Validation dataset for this fold
        data_collator: Data collator
        tokenizer: Tokenizer instance
        ner_dataset: NER dataset instance
        device: Device to use (cuda/cpu)
    
    Returns:
        tuple: (model, trainer, fold_output_dir)
    """
    # Create fold-specific output directory
    fold_output_dir = f"{OUTPUT_DIR}/fold_{fold_num}"
    import os
    os.makedirs(fold_output_dir, exist_ok=True)
    
    # Load config and create BERT-CRF model
    config = AutoConfig.from_pretrained(MODEL_NAME)
    model = BertCrfForTokenClassification(config, ner_dataset.get_num_labels())
    
    # Move model to device
    model.to(device)
    
    # Create training arguments for this fold
    training_args = TrainingArguments(
        output_dir=fold_output_dir,
        num_train_epochs=model_config['num_epochs'],
        per_device_train_batch_size=model_config['batch_size'],
        per_device_eval_batch_size=model_config['batch_size'],
        learning_rate=model_config['learning_rate'],
        warmup_steps=500,
        weight_decay=0.01,
        logging_steps=50,
        eval_steps=100,
        save_steps=500,
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        evaluation_strategy="steps",
        save_strategy="steps",
        report_to="none",  # Disable wandb for cleaner output
        run_name=f"bertic_crf_fold_{fold_num}",
        dataloader_pin_memory=False,
        remove_unused_columns=False
    )
    
    # Create trainer with custom compute_metrics for CRF
    def compute_metrics_crf(eval_pred):
        predictions, labels = eval_pred
        # For CRF, we need to decode predictions differently
        # This is a simplified version - you might need to adapt based on your evaluation needs
        return {"f1": 0.0}  # Placeholder - will be computed in detailed_evaluation
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics_crf
    )
    
    print(f"BERT-CRF Trainer initialized for fold {fold_num}")
    return model, trainer, fold_output_dir

print("✅ BERT-CRF model and trainer creation function defined successfully!")

In [None]:
def train_and_evaluate_bert_crf_fold(fold_num, trainer, val_dataset, ner_dataset):
    """
    Train and evaluate a BERT-CRF model for a specific fold.
    
    Args:
        fold_num: Current fold number
        trainer: Trainer instance
        val_dataset: Validation dataset for this fold
        ner_dataset: NER dataset instance
    
    Returns:
        dict: Fold results including metrics
    """
    print(f"\n🏋️  Training BERT-CRF fold {fold_num}...")
    
    # Train the model
    trainer.train()
    
    print(f"💾 Saving BERT-CRF model for fold {fold_num}...")
    trainer.save_model()
    
    # Evaluate on validation set
    print(f"📊 Evaluating BERT-CRF fold {fold_num}...")
    
    # For BERT-CRF, we need custom evaluation since CRF decoding is different
    model = trainer.model
    model.eval()
    
    all_predictions = []
    all_labels = []
    
    with torch.no_grad():
        for batch in trainer.get_eval_dataloader():
            batch = {k: v.to(model.device) for k, v in batch.items()}
            
            # Get CRF predictions
            predictions = model.predict(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask']
            )
            
            # Process labels
            labels = batch['labels']
            
            # Convert to lists and filter out special tokens
            for pred_seq, label_seq, attention_seq in zip(predictions, labels, batch['attention_mask']):
                # Filter based on attention mask and ignore -100 labels
                valid_length = attention_seq.sum().item()
                pred_seq = pred_seq[:valid_length]
                label_seq = label_seq[:valid_length]
                
                # Filter out -100 labels
                valid_indices = label_seq != -100
                if valid_indices.any():
                    pred_filtered = [pred_seq[i] for i in range(len(pred_seq)) if valid_indices[i]]
                    label_filtered = [label_seq[i].item() for i in range(len(label_seq)) if valid_indices[i]]
                    
                    all_predictions.extend(pred_filtered)
                    all_labels.extend(label_filtered)
    
    # Convert to label names for evaluation
    pred_labels = [ner_dataset.id_to_label[pred] for pred in all_predictions]
    true_labels = [ner_dataset.id_to_label[label] for label in all_labels]
    
    # Calculate metrics using seqeval
    from seqeval.metrics import precision_score, recall_score, f1_score, accuracy_score
    
    # Convert to sequence format for seqeval
    pred_sequences = [pred_labels]
    true_sequences = [true_labels]
    
    precision = precision_score(true_sequences, pred_sequences)
    recall = recall_score(true_sequences, pred_sequences)
    f1 = f1_score(true_sequences, pred_sequences)
    accuracy = accuracy_score(true_sequences, pred_sequences)
    
    # Extract metrics
    fold_result = {
        'fold': fold_num,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'accuracy': accuracy,
        'true_predictions': pred_labels,
        'true_labels': true_labels
    }
    
    print(f"\nBERT-CRF Fold {fold_num} completed successfully!")
    return fold_result

print("✅ BERT-CRF training and evaluation function defined successfully!")

## 10. K-Fold Cross-Validation Training Loop

In [None]:
# Check device availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Main K-Fold Cross-Validation Loop for BERT-CRF
print(f"\n{'='*80}")
print(f"STARTING {N_FOLDS}-FOLD CROSS-VALIDATION - BERT-CRF")
print(f"{'='*80}")
print(f"Total examples: {len(examples_array)}")
print(f"Model: {MODEL_NAME} + CRF")
print(f"Device: {device}")

# Execute K-Fold training
for fold_num, (train_idx, val_idx) in enumerate(kfold.split(examples_array), 1):
    print(f"\n{'='*80}")
    print(f"BERT-CRF FOLD {fold_num}/{N_FOLDS}")
    print(f"{'='*80}")
    print(f"Train indices: {len(train_idx)}, Val indices: {len(val_idx)}")
    
    # Get fold data
    train_examples = examples_array[train_idx].tolist()
    val_examples = examples_array[val_idx].tolist()
    
    print(f"Training examples: {len(train_examples)}")
    print(f"Validation examples: {len(val_examples)}")
    
    # Prepare data for this fold
    print(f"\n🔤 Preparing data for BERT-CRF fold {fold_num}...")
    train_dataset, val_dataset, data_collator = prepare_fold_data(
        train_examples, val_examples, tokenizer, ner_dataset
    )
    
    print(f"📦 BERT-CRF Fold {fold_num} datasets:")
    print(f"  Training: {len(train_dataset)} examples")
    print(f"  Validation: {len(val_dataset)} examples")
    
    # Create BERT-CRF model and trainer for this fold
    print(f"\n🤖 Creating BERT-CRF model and trainer for fold {fold_num}...")
    model, trainer, fold_output_dir = create_bert_crf_model_and_trainer(
        fold_num, train_dataset, val_dataset, data_collator, tokenizer, ner_dataset, device
    )
    
    # Train and evaluate this fold
    fold_result = train_and_evaluate_bert_crf_fold(fold_num, trainer, val_dataset, ner_dataset)
    fold_results.append(fold_result)
    
    # Clean up to free memory
    del model, trainer, train_dataset, val_dataset
    torch.cuda.empty_cache() if torch.cuda.is_available() else None
    
    print(f"\n✅ BERT-CRF Fold {fold_num} completed!")
    print(f"   Precision: {fold_result['precision']:.4f}")
    print(f"   Recall: {fold_result['recall']:.4f}")
    print(f"   F1-Score: {fold_result['f1']:.4f}")
    print(f"   Accuracy: {fold_result['accuracy']:.4f}")

print(f"\n{'='*80}")
print(f"BERT-CRF K-FOLD CROSS-VALIDATION COMPLETED!")
print(f"{'='*80}")

## 11. BERT-CRF Results Analysis and Summary

In [None]:
# ============================================================================
# BERT-CRF K-FOLD RESULTS SUMMARY
# ============================================================================

print(f"\n{'='*80}")
print(f"BERT-CRF K-FOLD CROSS-VALIDATION RESULTS SUMMARY")
print(f"{'='*80}")

# Extract metrics from all folds
precisions = [result['precision'] for result in fold_results]
recalls = [result['recall'] for result in fold_results]
f1_scores = [result['f1'] for result in fold_results]
accuracies = [result['accuracy'] for result in fold_results]

# Calculate statistics
print(f"\n📊 BERT-CRF PERFORMANCE METRICS ACROSS {N_FOLDS} FOLDS:")
print(f"{'='*50}")

print(f"\n🎯 PRECISION:")
print(f"  Mean: {np.mean(precisions):.4f} ± {np.std(precisions):.4f}")
print(f"  Min:  {np.min(precisions):.4f} (Fold {np.argmin(precisions) + 1})")
print(f"  Max:  {np.max(precisions):.4f} (Fold {np.argmax(precisions) + 1})")

print(f"\n🎯 RECALL:")
print(f"  Mean: {np.mean(recalls):.4f} ± {np.std(recalls):.4f}")
print(f"  Min:  {np.min(recalls):.4f} (Fold {np.argmin(recalls) + 1})")
print(f"  Max:  {np.max(recalls):.4f} (Fold {np.argmax(recalls) + 1})")

print(f"\n🎯 F1-SCORE:")
print(f"  Mean: {np.mean(f1_scores):.4f} ± {np.std(f1_scores):.4f}")
print(f"  Min:  {np.min(f1_scores):.4f} (Fold {np.argmin(f1_scores) + 1})")
print(f"  Max:  {np.max(f1_scores):.4f} (Fold {np.argmax(f1_scores) + 1})")

print(f"\n🎯 ACCURACY:")
print(f"  Mean: {np.mean(accuracies):.4f} ± {np.std(accuracies):.4f}")
print(f"  Min:  {np.min(accuracies):.4f} (Fold {np.argmin(accuracies) + 1})")
print(f"  Max:  {np.max(accuracies):.4f} (Fold {np.argmax(accuracies) + 1})")

# Individual fold results
print(f"\n📋 INDIVIDUAL BERT-CRF FOLD RESULTS:")
print(f"{'='*50}")
for i, result in enumerate(fold_results, 1):
    print(f"Fold {i}: P={result['precision']:.4f}, R={result['recall']:.4f}, F1={result['f1']:.4f}, Acc={result['accuracy']:.4f}")

In [None]:
# ============================================================================
# SAVE BERT-CRF RESULTS TO FILE
# ============================================================================

import json
import pandas as pd
from datetime import datetime

# Create results summary
results_summary = {
    'experiment_info': {
        'model_name': MODEL_NAME,
        'architecture': 'BERT-CRF',
        'n_folds': N_FOLDS,
        'total_examples': len(prepared_examples),
        'timestamp': datetime.now().isoformat(),
        'device': str(device)
    },
    'overall_metrics': {
        'precision': {
            'mean': float(np.mean(precisions)),
            'std': float(np.std(precisions)),
            'min': float(np.min(precisions)),
            'max': float(np.max(precisions))
        },
        'recall': {
            'mean': float(np.mean(recalls)),
            'std': float(np.std(recalls)),
            'min': float(np.min(recalls)),
            'max': float(np.max(recalls))
        },
        'f1_score': {
            'mean': float(np.mean(f1_scores)),
            'std': float(np.std(f1_scores)),
            'min': float(np.min(f1_scores)),
            'max': float(np.max(f1_scores))
        },
        'accuracy': {
            'mean': float(np.mean(accuracies)),
            'std': float(np.std(accuracies)),
            'min': float(np.min(accuracies)),
            'max': float(np.max(accuracies))
        }
    },
    'fold_results': [
        {
            'fold': result['fold'],
            'precision': float(result['precision']),
            'recall': float(result['recall']),
            'f1': float(result['f1']),
            'accuracy': float(result['accuracy'])
        }
        for result in fold_results
    ]
}

# Save results to JSON
results_file = f"{OUTPUT_DIR}/bert_crf_5fold_cv_results.json"
with open(results_file, 'w', encoding='utf-8') as f:
    json.dump(results_summary, f, indent=2, ensure_ascii=False)

print(f"✅ BERT-CRF Results saved to: {results_file}")

# Create CSV for easy analysis
df_results = pd.DataFrame([
    {
        'Fold': result['fold'],
        'Precision': result['precision'],
        'Recall': result['recall'],
        'F1-Score': result['f1'],
        'Accuracy': result['accuracy']
    }
    for result in fold_results
])

# Add summary row
summary_row = {
    'Fold': 'Mean ± Std',
    'Precision': f"{np.mean(precisions):.4f} ± {np.std(precisions):.4f}",
    'Recall': f"{np.mean(recalls):.4f} ± {np.std(recalls):.4f}",
    'F1-Score': f"{np.mean(f1_scores):.4f} ± {np.std(f1_scores):.4f}",
    'Accuracy': f"{np.mean(accuracies):.4f} ± {np.std(accuracies):.4f}"
}

df_results = pd.concat([df_results, pd.DataFrame([summary_row])], ignore_index=True)

csv_file = f"{OUTPUT_DIR}/bert_crf_5fold_cv_results.csv"
df_results.to_csv(csv_file, index=False)
print(f"✅ BERT-CRF Results CSV saved to: {csv_file}")

# Display final summary table
print(f"\n📊 BERT-CRF FINAL RESULTS TABLE:")
print(df_results.to_string(index=False))

## 12. Conclusion

This notebook successfully implemented 5-fold cross-validation for the Serbian Legal NER pipeline using BERT-CRF architecture.

### Key Achievements:
- ✅ **BERT-CRF Implementation**: Combined BERT embeddings with CRF layer for better sequence modeling
- ✅ **Robust Evaluation**: 5-fold cross-validation provides reliable performance estimates
- ✅ **CRF Decoding**: Proper CRF decoding for optimal sequence labeling
- ✅ **Comprehensive Metrics**: Precision, recall, F1-score, and accuracy tracked across all folds
- ✅ **Results Persistence**: JSON and CSV files saved for comparison with other models

### BERT-CRF Advantages:
- **Better Sequence Constraints**: CRF enforces valid BIO tag transitions
- **Global Optimization**: Considers entire sequence for optimal labeling
- **Improved Entity Boundaries**: More accurate entity span detection

The BERT-CRF 5-fold cross-validation results can now be compared with the base BERT model to evaluate the impact of the CRF layer!