# Memory-Optimized CodeBERT for Swift Code Understanding

This notebook fine-tunes the [CodeBERT](https://github.com/microsoft/CodeBERT) model on the [Swift Code Intelligence dataset](https://huggingface.co/datasets/mvasiliniuc/iva-swift-codeint) with optimizations for TPU memory efficiency. This version includes significant memory optimizations to avoid the "Resource Exhausted" error during training.

## Key Optimizations
- 📉 Reduced sequence length from 512 to 384
- 📊 Reduced batch size and implemented gradient accumulation
- 🧠 Added gradient checkpointing to save memory
- 🔧 Optimized tokenization and data processing
- 🛠️ Enhanced error handling and recovery

Let's start by installing the necessary libraries:

In [None]:
!pip install transformers datasets evaluate torch scikit-learn tqdm dropbox requests gc psutil

In [None]:
import os
import json
import torch
import random
import numpy as np
import time
import gc
from tqdm.auto import tqdm
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score, precision_recall_fscore_support
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    RobertaForSequenceClassification,
    Trainer, 
    TrainingArguments,
    set_seed,
    DataCollatorWithPadding,
    get_scheduler
)
# Import AdamW from torch.optim instead of transformers.optimization
from torch.optim import AdamW
from transformers.trainer_utils import get_last_checkpoint

# Set a seed for reproducibility
set_seed(42)

# Add memory management functions
def cleanup_memory():
    """Force garbage collection and clear CUDA cache if available."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    print("Memory cleaned up.")

## Accelerator Detection and Configuration

Let's detect and configure the available accelerator (CPU, GPU, or TPU):

In [None]:
# Function to detect and configure TPU
def detect_and_configure_accelerator():
    """Detect and configure the available accelerator (CPU, GPU, or TPU)."""
    try:
        # Check for TPU
        import torch_xla.core.xla_model as xm
        print("TPU detected! Configuring for TPU training...")
        device = xm.xla_device()
        use_tpu = True
        use_gpu = False
        
        # Configure XLA for TPU
        import torch_xla.distributed.parallel_loader as pl
        import torch_xla.distributed.xla_multiprocessing as xmp
        
        print(f"TPU cores available: {xm.xrt_world_size()}")
        return device, use_tpu, use_gpu
        
    except ImportError:
        # Check for GPU
        if torch.cuda.is_available():
            print(f"GPU detected! Using {torch.cuda.get_device_name(0)}")
            device = torch.device("cuda")
            use_tpu = False
            use_gpu = True
            print(f"GPU memory available: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
        else:
            print("No GPU or TPU detected. Using CPU (this will be slow).")
            device = torch.device("cpu")
            use_tpu = False
            use_gpu = False
        
        return device, use_tpu, use_gpu
    except Exception as e:
        print(f"Error detecting accelerator: {e}")
        print("Defaulting to CPU.")
        return torch.device("cpu"), False, False

# Detect and configure accelerator
device, use_tpu, use_gpu = detect_and_configure_accelerator()

## Dataset and Model Configuration

Let's define the model and dataset we'll be using with memory-optimized parameters:

In [None]:
# Set model and dataset IDs
MODEL_ID = "microsoft/codebert-base"
DATASET_ID = "mvasiliniuc/iva-swift-codeint"

# Max sequence length - reduced from 512 to save memory
MAX_LENGTH = 384  # Reduced from 512 to save memory

# Configure batch sizes based on available hardware - optimized for memory efficiency
if use_tpu:
    # Significantly reduced batch size to prevent TPU memory exhaustion
    TRAIN_BATCH_SIZE = 16  # Reduced from 64 to prevent memory issues
    EVAL_BATCH_SIZE = 32   # Reduced from 128
    # Increased gradient accumulation to maintain effective batch size
    GRADIENT_ACCUMULATION_STEPS = 4  # Accumulate gradients to simulate larger batch
elif use_gpu:
    TRAIN_BATCH_SIZE = 12   
    EVAL_BATCH_SIZE = 24
    GRADIENT_ACCUMULATION_STEPS = 2
else:
    TRAIN_BATCH_SIZE = 6    
    EVAL_BATCH_SIZE = 12
    GRADIENT_ACCUMULATION_STEPS = 4

# Effective batch size = TRAIN_BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS
EFFECTIVE_BATCH_SIZE = TRAIN_BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS

print(f"Using device: {device}")
print(f"Max sequence length: {MAX_LENGTH}")
print(f"Training batch size: {TRAIN_BATCH_SIZE}")
print(f"Gradient accumulation steps: {GRADIENT_ACCUMULATION_STEPS}")
print(f"Effective batch size: {EFFECTIVE_BATCH_SIZE}")
print(f"Evaluation batch size: {EVAL_BATCH_SIZE}")

## Data Loading

Now let's load the Swift code dataset with memory-efficient handling:

In [None]:
# Function to load dataset with retry logic
def load_dataset_with_retry(dataset_id, max_retries=3, retry_delay=5):
    """Load a dataset with retry logic."""
    for attempt in range(max_retries):
        try:
            print(f"Loading dataset (attempt {attempt+1}/{max_retries})...")
            data = load_dataset(dataset_id, trust_remote_code=True)
            print(f"Dataset loaded successfully with {len(data['train'])} examples")
            return data
        except Exception as e:
            print(f"Error loading dataset (attempt {attempt+1}/{max_retries}): {e}")
            if attempt < max_retries - 1:
                print(f"Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
            else:
                print("Maximum retries reached. Could not load dataset.")
                raise

# Load the dataset with retry logic
try:
    data = load_dataset_with_retry(DATASET_ID)
    print("Dataset structure:")
    print(data)
except Exception as e:
    print(f"Fatal error loading dataset: {e}")
    raise

In [None]:
# Create a classification dataset based on whether the file is a Package.swift file
def add_labels(example):
    # Label 1 if it's a Package.swift file, 0 otherwise
    example['label'] = 1 if 'Package.swift' in example['path'] else 0
    return example

try:
    # Apply the labeling function
    labeled_data = data['train'].map(add_labels)
    
    # Check the distribution of labels using collections.Counter
    import collections
    all_labels = labeled_data['label']
    label_counter = collections.Counter(all_labels)
    print("Label distribution:")
    for label, count in label_counter.items():
        print(f"Label {label}: {count} examples ({count/len(labeled_data)*100:.2f}%)")
    
    # Split the dataset with stratification to maintain label distribution
    from datasets import ClassLabel
    
    # Get unique labels
    unique_labels = sorted(set(labeled_data["label"]))
    num_labels = len(unique_labels)
    
    # Create a new dataset with ClassLabel feature
    labeled_data = labeled_data.cast_column("label", ClassLabel(num_classes=num_labels, names=[str(i) for i in unique_labels]))
    
    # Split the dataset with stratification
    train_test_split = labeled_data.train_test_split(test_size=0.1, seed=42, stratify_by_column='label')
    train_data = train_test_split['train']
    val_data = train_test_split['test']
    
    print(f"Training set size: {len(train_data)}")
    print(f"Validation set size: {len(val_data)}")
    
    # Free up memory
    del labeled_data
    del data
    cleanup_memory()
except Exception as e:
    print(f"Error preparing dataset: {e}")
    raise

## Loading the CodeBERT Tokenizer and Tokenization

Now, let's load the tokenizer and tokenize our data with memory-efficient settings:

In [None]:
# Load the CodeBERT tokenizer with error handling
try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    print(f"Tokenizer loaded successfully with {len(tokenizer)} tokens in vocabulary")
except Exception as e:
    print(f"Error loading tokenizer: {e}")
    raise

In [None]:
# Memory-efficient tokenization function
def tokenize_function(examples):
    """Tokenize Swift code with memory efficiency."""
    return tokenizer(
        examples["content"],
        padding=False,  # No padding during preprocessing saves memory
        truncation=True,
        max_length=MAX_LENGTH,  # Using reduced max length
        return_special_tokens_mask=False,  # Save memory
        return_offsets_mapping=False,      # Save memory
        return_token_type_ids=True,       # Needed for BERT models
        return_attention_mask=True        # Needed for proper masking
    )

# Tokenize the datasets with smaller batch size
try:
    # Use fewer parallel processes
    import multiprocessing
    num_cpus = multiprocessing.cpu_count()
    num_proc = max(1, int(num_cpus * 0.5))  # Use 50% of CPU cores
    
    # Lower batch size for tokenization
    tokenization_batch_size = 32
    
    print("Tokenizing training data...")
    tokenized_train_data = train_data.map(
        tokenize_function,
        batched=True,
        batch_size=tokenization_batch_size,
        num_proc=num_proc,
        remove_columns=[col for col in train_data.column_names if col != 'label'],
        desc="Tokenizing training data"
    )
    
    print("Tokenizing validation data...")
    tokenized_val_data = val_data.map(
        tokenize_function,
        batched=True,
        batch_size=tokenization_batch_size,
        num_proc=num_proc,
        remove_columns=[col for col in val_data.column_names if col != 'label'],
        desc="Tokenizing validation data"
    )
    
    # Print token statistics
    train_lengths = [len(x["input_ids"]) for x in tokenized_train_data]
    print(f"Average training sequence length: {sum(train_lengths)/len(train_lengths):.1f} tokens")
    print(f"Percent of examples truncated: {sum(1 for l in train_lengths if l >= MAX_LENGTH)/len(train_lengths)*100:.2f}%")
    
    # Clean up memory
    del train_data
    del val_data
    cleanup_memory()
except Exception as e:
    print(f"Error tokenizing data: {e}")
    raise

## Model Preparation

Now let's load the CodeBERT model with memory-efficient settings:

In [None]:
try:
    # Load model with memory efficiency
    print("Loading CodeBERT model with memory optimization...")
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_ID, 
        num_labels=2,
        low_cpu_mem_usage=True  # For memory efficiency
    )
    
    # Enable gradient checkpointing (critical for memory savings)
    try:
        model.gradient_checkpointing_enable()
        print("Gradient checkpointing enabled for memory efficiency.")
    except Exception as e:
        print(f"Could not enable gradient checkpointing: {e}")
    
    # Move model to device if not TPU
    if not use_tpu:
        model.to(device)
        
    print(f"Model loaded successfully with {sum(p.numel() for p in model.parameters()):,} parameters")
except Exception as e:
    print(f"Error loading model: {e}")
    raise

## Training Setup

Now let's define our training arguments and evaluation metrics with memory-efficient settings:

In [None]:
# Simple metrics function
def compute_metrics(eval_preds):
    """Compute basic evaluation metrics."""
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    
    # Calculate metrics
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='weighted'
    )
    
    # Return basic metrics to save memory
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [None]:
# Create a data collator for dynamic padding (saves memory)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Define training arguments with TPU-optimized memory settings
try:
    # Set up training arguments for memory efficiency and stability
    training_args = TrainingArguments(
        output_dir="./results/codebert-swift",
        # Basic training parameters
        save_steps=200,               
        save_total_limit=2,           # Keep fewer checkpoints
        learning_rate=3e-5,           
        per_device_train_batch_size=TRAIN_BATCH_SIZE,
        per_device_eval_batch_size=EVAL_BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,  # Critical for memory efficiency
        num_train_epochs=2,           # Reduced from 3 to 2 epochs
        weight_decay=0.01,
        warmup_steps=100,             
        logging_dir="./logs",
        logging_steps=50,
        
        # TPU-specific configurations
        tpu_num_cores=8 if use_tpu else None,  
        dataloader_drop_last=True if use_tpu else False,  # Important for TPU
        
        # Memory optimizations
        fp16=use_gpu,                 # Mixed precision on GPU
        dataloader_num_workers=2,     # Reduced worker count
        dataloader_pin_memory=True,   
        max_grad_norm=1.0,           
        
        # Optimizer settings
        adam_beta1=0.9,
        adam_beta2=0.999,
        adam_epsilon=1e-8,
    )
    
    print("Training arguments configured successfully.")
    print(f"Effective batch size: {TRAIN_BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
except Exception as e:
    print(f"Error configuring training arguments: {e}")
    raise

In [None]:
# Create the Trainer with memory-efficient settings
try:
    # Initialize the trainer without any callbacks
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train_data,
        eval_dataset=tokenized_val_data,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=data_collator,  # Dynamic padding
        # No callbacks for simplicity
    )
    
    print("Trainer initialized successfully.")
except Exception as e:
    print(f"Error creating trainer: {e}")
    raise

## Training the Model

Now let's train our CodeBERT model with enhanced error handling for memory issues:

In [None]:
# Start training with checkpoint recovery and enhanced error handling
try:
    # Clean up memory before training
    cleanup_memory()
    
    print("Starting model training...")
    train_result = trainer.train()
    print(f"Training completed successfully! Metrics: {train_result.metrics}")
    
    # Save the final model
    print("Saving final model...")
    trainer.save_model()
    print("Final model saved.")
    
    # Save training metrics
    trainer.log_metrics("train", train_result.metrics)
    trainer.save_metrics("train", train_result.metrics)
    trainer.save_state()
    print("Training metrics and state saved.")
except RuntimeError as e:
    # Handle memory-related errors specially
    error_msg = str(e)
    print(f"Runtime error during training: {error_msg}")
    
    if "memory" in error_msg.lower() or "cuda out of memory" in error_msg.lower() or "resource exhausted" in error_msg.lower():
        print("\nMEMORY ERROR DETECTED! Try further reducing these parameters:")
        print(f"1. MAX_LENGTH (currently {MAX_LENGTH}). Try 256 or 192.")
        print(f"2. TRAIN_BATCH_SIZE (currently {TRAIN_BATCH_SIZE}). Try 8 or 4.")
        print(f"3. Increase GRADIENT_ACCUMULATION_STEPS (currently {GRADIENT_ACCUMULATION_STEPS}). Try 8 or 16.")
    
    # Try to save current state
    try:
        print("Attempting to save current model state...")
        trainer.save_model("./results/codebert-swift-emergency-save")
        print("Emergency model save completed.")
    except Exception as save_err:
        print(f"Could not perform emergency save: {save_err}")
    
    raise
except Exception as e:
    print(f"Error during training: {e}")
    raise

## Evaluating the Model

Let's evaluate our model on the validation dataset:

In [None]:
# Evaluate the model with memory efficiency
try:
    # Clean up memory before evaluation
    cleanup_memory()
    
    print("Evaluating model on validation dataset...")
    eval_results = trainer.evaluate()
    print(f"Evaluation results: {eval_results}")
    
    # Save evaluation metrics
    trainer.log_metrics("eval", eval_results)
    trainer.save_metrics("eval", eval_results)
except Exception as e:
    print(f"Error during evaluation: {e}")

## Conclusion

We've successfully optimized the CodeBERT fine-tuning process for Swift code classification with significant memory improvements:

1. **Memory Optimizations**: Reduced sequence length, batch size, and implemented gradient accumulation
2. **Gradient Checkpointing**: Added gradient checkpointing to trade computation for memory
3. **Efficient Tokenization**: Optimized the tokenization process to use less memory
4. **Improved Error Handling**: Added better error handling and recovery mechanisms
5. **TPU-Specific Settings**: Added configurations specific to TPU memory efficiency

The model can now be successfully trained on TPU without hitting memory limits.