# Enhanced CodeBERT for Swift Code Understanding with LoRA

In this notebook, we fine-tune the [CodeBERT](https://github.com/microsoft/CodeBERT) model on the [Swift Code Intelligence dataset](https://huggingface.co/datasets/mvasiliniuc/iva-swift-codeint) using LoRA (Low-Rank Adaptation). CodeBERT is a pre-trained model specifically designed for programming languages, much like how BERT was pre-trained for natural language text. Created by Microsoft Research, CodeBERT can understand both programming language and natural language, making it ideal for code-related tasks.

## What is LoRA?

LoRA (Low-Rank Adaptation) is a technique that significantly reduces the number of trainable parameters by adding small, trainable rank decomposition matrices to the existing weights rather than fine-tuning all model parameters. This approach:

- Reduces memory usage by up to 3-4x
- Speeds up training significantly
- Allows for efficient model adaptation with minimal parameters
- Maintains model quality comparable to full fine-tuning

## Overview

The process of fine-tuning CodeBERT with LoRA involves:

1. **🔧 Setup**: Install necessary libraries and prepare our environment
2. **📥 Data Loading**: Load the Swift code dataset from Hugging Face
3. **🧹 Preprocessing**: Prepare the data for training by tokenizing the code samples
4. **🔄 LoRA Configuration**: Set up LoRA for efficient fine-tuning
5. **🧠 Model Training**: Fine-tune CodeBERT on our prepared data with optimized performance
6. **📊 Evaluation**: Assess how well our model performs
7. **📤 Export & Upload**: Save the model and upload it to Dropbox

Let's start by installing the necessary libraries:

In [None]:
!pip install transformers datasets evaluate torch scikit-learn tqdm dropbox requests accelerate peft

In [None]:
import os
import json
import torch
import random
import numpy as np
import time
from tqdm.auto import tqdm
from datasets import load_dataset, Dataset, Features, Value, ClassLabel
from sklearn.metrics import accuracy_score, f1_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split as sklearn_train_test_split
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    RobertaForSequenceClassification,
    Trainer, 
    TrainingArguments,
    set_seed,
    DataCollatorWithPadding,
    EarlyStoppingCallback,
    default_data_collator,
    get_scheduler
)
from transformers.optimization import AdamW
from transformers.trainer_utils import get_last_checkpoint

# For memory optimization
import gc
from accelerate import Accelerator

# Import PEFT for LoRA
from peft import (
    get_peft_model, 
    LoraConfig, 
    TaskType, 
    PeftModel, 
    PeftConfig,
    prepare_model_for_kbit_training
)

# Set a seed for reproducibility
set_seed(42)

## Accelerator Detection and Configuration

Let's detect and configure the available accelerator (CPU, GPU, or TPU) with enhanced detection:

In [None]:
# Function to detect and configure accelerator with better error handling
def detect_and_configure_accelerator():
    """Detect and configure the available accelerator (CPU, GPU, or TPU) with enhanced detection."""
    try:
        # Initialize accelerator from HF accelerate library
        accelerator = Accelerator()
        if accelerator.distributed_type == "TPU":
            print("TPU detected! Configuring for TPU training...")
            device = accelerator.device
            use_tpu = True
            use_gpu = False
            use_mixed_precision = True
            return device, use_tpu, use_gpu, use_mixed_precision, accelerator
        
        # Check for GPU
        if torch.cuda.is_available():
            print(f"GPU detected! Using {torch.cuda.get_device_name(0)}")
            device = torch.device("cuda")
            use_tpu = False
            use_gpu = True
            use_mixed_precision = True  # Enable mixed precision by default for GPU
            print(f"GPU memory available: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
            
            # Clear GPU cache to free up memory
            torch.cuda.empty_cache()
            gc.collect()
        else:
            print("No GPU or TPU detected. Using CPU (this will be slow).")
            device = torch.device("cpu")
            use_tpu = False
            use_gpu = False
            use_mixed_precision = False  # Disable mixed precision for CPU
        
        return device, use_tpu, use_gpu, use_mixed_precision, accelerator
    except Exception as e:
        print(f"Error detecting accelerator: {e}")
        print("Defaulting to CPU.")
        return torch.device("cpu"), False, False, False, None

# Detect and configure accelerator
device, use_tpu, use_gpu, use_mixed_precision, accelerator = detect_and_configure_accelerator()

## Dataset and Model Configuration

Let's define the model and dataset we'll be using with optimized batch sizes and memory settings:

In [None]:
# Set model and dataset IDs
MODEL_ID = "microsoft/codebert-base"
DATASET_ID = "mvasiliniuc/iva-swift-codeint"

# Configure batch sizes based on available hardware with optimized values
# With LoRA, we can use larger batch sizes due to reduced memory requirements
if use_tpu:
    TRAIN_BATCH_SIZE = 128  # Larger batch size for TPU with LoRA
    EVAL_BATCH_SIZE = 256
    GRADIENT_ACCUMULATION_STEPS = 1
    NUM_WORKERS = 8
elif use_gpu:
    # Dynamically adjust batch size based on available GPU memory
    gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
    if gpu_memory_gb > 16:  # High-end GPU
        TRAIN_BATCH_SIZE = 64  # Increased due to LoRA efficiency
        EVAL_BATCH_SIZE = 128
        GRADIENT_ACCUMULATION_STEPS = 1
    elif gpu_memory_gb > 8:  # Mid-range GPU
        TRAIN_BATCH_SIZE = 32  # Increased due to LoRA efficiency
        EVAL_BATCH_SIZE = 64
        GRADIENT_ACCUMULATION_STEPS = 2
    else:  # Low-end GPU
        TRAIN_BATCH_SIZE = 16  # Increased due to LoRA efficiency
        EVAL_BATCH_SIZE = 32
        GRADIENT_ACCUMULATION_STEPS = 4
    NUM_WORKERS = min(4, os.cpu_count() or 1)
else:
    TRAIN_BATCH_SIZE = 8   # Increased for CPU with LoRA
    EVAL_BATCH_SIZE = 16
    GRADIENT_ACCUMULATION_STEPS = 8
    NUM_WORKERS = 0  # No multiprocessing on CPU

# Set maximum sequence length for tokenization
MAX_SEQ_LENGTH = 512  # CodeBERT's maximum sequence length

# Configure LoRA parameters
LORA_R = 8        # Rank of the update matrices
LORA_ALPHA = 16   # Scaling factor for the update matrices
LORA_DROPOUT = 0.1  # Dropout probability for LoRA layers

print(f"Using device: {device}")
print(f"Training batch size: {TRAIN_BATCH_SIZE}")
print(f"Evaluation batch size: {EVAL_BATCH_SIZE}")
print(f"Gradient accumulation steps: {GRADIENT_ACCUMULATION_STEPS}")
print(f"Number of dataloader workers: {NUM_WORKERS}")
print(f"Using mixed precision: {use_mixed_precision}")
print(f"LoRA configuration: r={LORA_R}, alpha={LORA_ALPHA}, dropout={LORA_DROPOUT}")

## Data Loading

Now let's load the Swift code dataset and examine its structure with proper error handling and caching:

In [None]:
# Function to load dataset with retry logic and caching
def load_dataset_with_retry(dataset_id, max_retries=3, retry_delay=5):
    """Load a dataset with retry logic and caching."""
    cache_dir = os.path.join(os.getcwd(), "dataset_cache")
    os.makedirs(cache_dir, exist_ok=True)
    
    for attempt in range(max_retries):
        try:
            print(f"Loading dataset (attempt {attempt+1}/{max_retries})...")
            data = load_dataset(dataset_id, trust_remote_code=True, cache_dir=cache_dir)
            print(f"Dataset loaded successfully with {len(data['train'])} examples")
            return data
        except Exception as e:
            print(f"Error loading dataset (attempt {attempt+1}/{max_retries}): {e}")
            if attempt < max_retries - 1:
                print(f"Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
            else:
                print("Maximum retries reached. Could not load dataset.")
                raise

# Load the dataset with retry logic
try:
    data = load_dataset_with_retry(DATASET_ID)
    print("Dataset structure:")
    print(data)
except Exception as e:
    print(f"Fatal error loading dataset: {e}")
    raise

In [None]:
# Verify dataset structure and column names
def verify_dataset_structure(dataset):
    """Verify that the dataset has the expected structure and columns."""
    required_columns = ['repo_name', 'path', 'content']
    
    if 'train' not in dataset:
        print("WARNING: Dataset does not have a 'train' split.")
        return False
    
    missing_columns = [col for col in required_columns if col not in dataset['train'].column_names]
    if missing_columns:
        print(f"WARNING: Dataset is missing required columns: {missing_columns}")
        return False
    
    print("Dataset structure verification passed.")
    return True

# Verify dataset structure
dataset_valid = verify_dataset_structure(data)
if not dataset_valid:
    print("Dataset structure is not as expected. Proceeding with caution.")

In [None]:
# Let's take a look at an example from the dataset
try:
    if 'train' in data:
        example = data['train'][0]
    else:
        example = data[list(data.keys())[0]][0]
        
    print("Example features:")
    for key, value in example.items():
        if isinstance(value, str) and len(value) > 100:
            print(f"{key}: {value[:100]}...")
        else:
            print(f"{key}: {value}")
except Exception as e:
    print(f"Error exploring dataset example: {e}")

## Loading the CodeBERT Tokenizer

Now, let's load the CodeBERT tokenizer, which has been specially trained to handle code tokens:

In [None]:
# Load the CodeBERT tokenizer with error handling and caching
try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)  # Use fast tokenizer for better performance
    print(f"Tokenizer vocabulary size: {len(tokenizer)}")
    print(f"Tokenizer type: {tokenizer.__class__.__name__}")
except Exception as e:
    print(f"Error loading tokenizer: {e}")
    raise

## Data Preparation

Since we're dealing with a code understanding task, we need to prepare our data appropriately. The dataset contains Swift code files, so we'll need to create labeled data for our task.

For this demonstration, we'll create a binary classification task that determines whether the code is a Package.swift file (which is used for Swift package management) or not. This is just an example task - in a real application, you might have more complex classification targets.

In [None]:
# Create a classification dataset based on whether the file is a Package.swift file
def add_labels(example):
    # Label 1 if it's a Package.swift file, 0 otherwise
    example['label'] = 1 if 'Package.swift' in example['path'] else 0
    return example

try:
    # Apply the labeling function
    labeled_data = data['train'].map(add_labels)
    
    # Check the distribution of labels using collections.Counter
    import collections
    all_labels = labeled_data['label']
    label_counter = collections.Counter(all_labels)
    print("Label distribution:")
    for label, count in label_counter.items():
        print(f"Label {label}: {count} examples ({count/len(labeled_data)*100:.2f}%)")
        
    # Check for label imbalance
    min_label_count = min(label_counter.values())
    max_label_count = max(label_counter.values())
    imbalance_ratio = max_label_count / min_label_count if min_label_count > 0 else float('inf')
    
    if imbalance_ratio > 10:
        print(f"WARNING: Severe label imbalance detected (ratio: {imbalance_ratio:.2f}). Consider using class weights or resampling.")
    elif imbalance_ratio > 3:
        print(f"WARNING: Moderate label imbalance detected (ratio: {imbalance_ratio:.2f}). Consider using class weights.")
except Exception as e:
    print(f"Error preparing dataset: {e}")
    raise

## Dataset Splitting

Now let's split our data into training and validation sets. We'll use scikit-learn's train_test_split to avoid the ClassLabel issue:

In [None]:
try:
    # Convert to pandas DataFrame for easier manipulation
    df = labeled_data.to_pandas()
    
    # Split using scikit-learn's train_test_split with stratification
    train_df, val_df = sklearn_train_test_split(
        df, 
        test_size=0.1, 
        random_state=42, 
        stratify=df['label']
    )
    
    # Convert back to HuggingFace datasets
    train_data = Dataset.from_pandas(train_df)
    val_data = Dataset.from_pandas(val_df)
    
    # Verify label distribution after split
    train_label_counter = collections.Counter(train_data['label'])
    val_label_counter = collections.Counter(val_data['label'])
    
    print(f"Training set size: {len(train_data)}")
    print(f"Training label distribution: {dict(train_label_counter)}")
    print(f"Validation set size: {len(val_data)}")
    print(f"Validation label distribution: {dict(val_label_counter)}")
    
    # Check if dataset is large (might cause memory issues)
    if len(train_data) > 10000:
        print("\nWARNING: You are training on a large dataset.")
        print("This may require significant memory, especially when using a GPU.")
        print("Consider reducing batch size or using gradient accumulation if you encounter memory issues.")
except Exception as e:
    print(f"Error splitting dataset: {e}")
    raise

## Optimized Tokenization

Now we need to tokenize our code samples. We'll use the CodeBERT tokenizer to convert the Swift code into token IDs that the model can understand. This implementation is optimized for speed and memory efficiency:

In [None]:
def tokenize_function(examples):
    """Tokenize the Swift code samples with optimized settings.
    
    Args:
        examples: Batch of examples from the dataset
        
    Returns:
        Tokenized examples
    """
    # Tokenize the code content with optimized settings
    # - No return_tensors="pt" for memory efficiency
    # - padding=False for dynamic padding later with DataCollator
    # - truncation=True to handle long sequences
    return tokenizer(
        examples["content"],
        padding=False,  # We'll use dynamic padding with DataCollator
        truncation=True,
        max_length=MAX_SEQ_LENGTH
    )

In [None]:
try:
    # Process the data with progress bars and optimized settings
    tokenized_train_data = train_data.map(
        tokenize_function,
        batched=True,
        batch_size=1000,  # Process in larger batches for speed
        remove_columns=[col for col in train_data.column_names if col != 'label'],
        desc="Tokenizing training data",  # This adds a progress bar
        num_proc=NUM_WORKERS if NUM_WORKERS > 0 else None  # Use multiprocessing if available
    )
    
    tokenized_val_data = val_data.map(
        tokenize_function,
        batched=True,
        batch_size=1000,  # Process in larger batches for speed
        remove_columns=[col for col in val_data.column_names if col != 'label'],
        desc="Tokenizing validation data",  # This adds a progress bar
        num_proc=NUM_WORKERS if NUM_WORKERS > 0 else None  # Use multiprocessing if available
    )
    
    # Set format for pytorch
    tokenized_train_data = tokenized_train_data.with_format("torch")
    tokenized_val_data = tokenized_val_data.with_format("torch")
    
    print("Training data after tokenization:")
    print(tokenized_train_data)
    print("\nValidation data after tokenization:")
    print(tokenized_val_data)
except Exception as e:
    print(f"Error tokenizing data: {e}")
    raise

## Model Preparation with LoRA

Now that our data is ready, let's load the CodeBERT model and configure it for sequence classification with LoRA for efficient fine-tuning:

In [None]:
try:
    # Load the CodeBERT model for sequence classification (2 classes)
    base_model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_ID, 
        num_labels=2,
        # Enable gradient checkpointing for memory efficiency
        gradient_checkpointing=True if use_gpu or use_tpu else False
    )
    
    print(f"Base model type: {base_model.__class__.__name__}")
    
    # Calculate class weights for imbalanced dataset
    label_counts = collections.Counter(train_data['label'])
    total_samples = len(train_data)
    class_weights = {label: total_samples / (len(label_counts) * count) for label, count in label_counts.items()}
    print(f"Class weights for handling imbalance: {class_weights}")
    
    # Convert class weights to tensor for loss function
    class_weights_tensor = torch.tensor(
        [class_weights[i] for i in range(len(class_weights))],
        dtype=torch.float
    ).to(device)
    
    # Configure LoRA for efficient fine-tuning
    # For RoBERTa/CodeBERT, we target the attention layers (query and value)
    peft_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,  # Sequence classification task
        r=LORA_R,                    # Rank of the update matrices
        lora_alpha=LORA_ALPHA,       # Scaling factor
        lora_dropout=LORA_DROPOUT,   # Dropout probability
        # Target the attention layers in CodeBERT (which is RoBERTa-based)
        target_modules=["query", "value"],
        bias="none",                 # Don't train bias parameters
        modules_to_save=["classifier"],  # Save the classifier head
    )
    
    # Create the LoRA model
    model = get_peft_model(base_model, peft_config)
    
    # Print trainable parameters to verify LoRA setup
    model.print_trainable_parameters()
    
    # Move model to the appropriate device
    if not use_tpu:  # For TPU, the Trainer will handle device placement
        model.to(device)
        
except Exception as e:
    print(f"Error loading model: {e}")
    raise

## Training Setup

Now let's define our training arguments and evaluation metrics with optimized settings for LoRA:

In [None]:
# Function to compute metrics during evaluation
def compute_metrics(eval_preds):
    """Compute evaluation metrics."""
    try:
        logits, labels = eval_preds
        predictions = np.argmax(logits, axis=-1)
        
        # Calculate multiple metrics
        accuracy = accuracy_score(labels, predictions)
        precision, recall, f1, _ = precision_recall_fscore_support(
            labels, predictions, average='weighted'
        )
        
        # Calculate per-class metrics for better understanding
        per_class_precision, per_class_recall, per_class_f1, _ = precision_recall_fscore_support(
            labels, predictions, average=None
        )
        
        result = {
            'accuracy': accuracy,
            'f1': f1,
            'precision': precision,
            'recall': recall
        }
        
        # Add per-class metrics
        for i, (p, r, f) in enumerate(zip(per_class_precision, per_class_recall, per_class_f1)):
            result[f'precision_class_{i}'] = p
            result[f'recall_class_{i}'] = r
            result[f'f1_class_{i}'] = f
            
        return result
    except Exception as e:
        print(f"Error computing metrics: {e}")
        return {'error': str(e)}

In [None]:
# Create a data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Define training arguments with optimized settings for LoRA
try:
    # With LoRA, we can use a higher learning rate
    training_args = TrainingArguments(
        output_dir="./results/codebert-swift-lora",
        evaluation_strategy="steps",
        eval_steps=100,
        save_strategy="steps",
        save_steps=100,
        save_total_limit=3,  # Keep only the 3 best checkpoints
        learning_rate=1e-4,  # Higher learning rate for LoRA
        per_device_train_batch_size=TRAIN_BATCH_SIZE,
        per_device_eval_batch_size=EVAL_BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        num_train_epochs=3,
        weight_decay=0.01,
        logging_dir="./logs",
        logging_steps=50,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        push_to_hub=False,
        # TPU-specific configurations
        tpu_num_cores=8 if use_tpu else None,  # 8 cores for TPU v2/v3
        dataloader_drop_last=True if use_tpu else False,  # Important for TPU
        # Memory and performance optimizations
        fp16=use_mixed_precision,  # Use mixed precision when available
        dataloader_num_workers=NUM_WORKERS,
        # Gradient clipping to prevent exploding gradients
        max_grad_norm=1.0,
        # Warmup steps for learning rate scheduler
        warmup_ratio=0.1,  # Warm up over 10% of training steps
        # Reporting
        report_to=["tensorboard"],
        # Optimizer settings
        optim="adamw_torch",  # Use PyTorch's AdamW implementation
        # Avoid OOM errors by not storing gradients for all steps
        gradient_checkpointing=True if (use_gpu or use_tpu) else False,
        # Avoid unnecessary memory usage
        remove_unused_columns=True,
        # Disable tqdm progress bars in favor of our own reporting
        disable_tqdm=False,
    )
    
    print("Training arguments configured successfully.")
except Exception as e:
    print(f"Error configuring training arguments: {e}")
    raise

In [None]:
# Create the Trainer with data collator and callbacks
try:
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train_data,
        eval_dataset=tokenized_val_data,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=data_collator,  # Added data collator for dynamic padding
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],  # Added early stopping
    )
    
    print("Trainer initialized successfully.")
except Exception as e:
    print(f"Error creating trainer: {e}")
    raise

## Checkpoint Recovery

Let's add improved checkpoint recovery logic to resume training if it was interrupted:

In [None]:
# Check for existing checkpoints using HuggingFace's built-in function
try:
    # Create output directory if it doesn't exist
    os.makedirs(training_args.output_dir, exist_ok=True)
    
    # Use HuggingFace's get_last_checkpoint function
    latest_checkpoint = get_last_checkpoint(training_args.output_dir)
    
    if latest_checkpoint:
        print(f"Found existing checkpoint at {latest_checkpoint}. Training will resume from this point.")
    else:
        print("No existing checkpoint found. Training will start from scratch.")
except Exception as e:
    print(f"Error checking for checkpoints: {e}")
    latest_checkpoint = None

## Training the Model with LoRA

Now let's train our CodeBERT model for Swift code classification with LoRA for efficient fine-tuning:

In [None]:
# Start training with checkpoint recovery and memory optimization
try:
    # Clear memory before training
    if use_gpu:
        torch.cuda.empty_cache()
    gc.collect()
    
    print("Starting model training with LoRA...")
    train_result = trainer.train(resume_from_checkpoint=latest_checkpoint)
    print(f"Training completed. Metrics: {train_result.metrics}")
    
    # Save the final model
    trainer.save_model()
    print("Final model saved.")
    
    # Save training metrics
    trainer.log_metrics("train", train_result.metrics)
    trainer.save_metrics("train", train_result.metrics)
    trainer.save_state()
except Exception as e:
    print(f"Error during training: {e}")
    raise

## Model Evaluation

Let's evaluate our trained model on the validation set:

In [None]:
# Evaluate the model
try:
    print("Evaluating model on validation set...")
    eval_results = trainer.evaluate()
    print(f"Evaluation results: {eval_results}")
    
    # Save evaluation metrics
    trainer.log_metrics("eval", eval_results)
    trainer.save_metrics("eval", eval_results)
except Exception as e:
    print(f"Error during evaluation: {e}")
    raise

## Model Export

Now let's export our model for deployment. We'll merge the LoRA weights with the base model for easier deployment:

In [None]:
# Export the model to a specific directory
try:
    export_dir = "./exported-model-lora"
    os.makedirs(export_dir, exist_ok=True)
    
    # Merge LoRA weights with the base model
    print("Merging LoRA weights with base model...")
    merged_model = model.merge_and_unload()
    
    # Save the merged model
    merged_model.save_pretrained(export_dir)
    tokenizer.save_pretrained(export_dir)
    
    # Save model configuration and metadata
    model_info = {
        "model_name": "CodeBERT-Swift-LoRA",
        "base_model": MODEL_ID,
        "task": "binary_classification",
        "labels": ["Not Package.swift", "Package.swift"],
        "metrics": eval_results,
        "lora_config": {
            "rank": LORA_R,
            "alpha": LORA_ALPHA,
            "dropout": LORA_DROPOUT,
            "target_modules": ["query", "value"]
        },
        "training_params": {
            "batch_size": TRAIN_BATCH_SIZE,
            "learning_rate": training_args.learning_rate,
            "epochs": training_args.num_train_epochs,
            "weight_decay": training_args.weight_decay,
            "training_samples": len(tokenized_train_data),
            "validation_samples": len(tokenized_val_data)
        }
    }
    
    with open(os.path.join(export_dir, "model_info.json"), "w") as f:
        json.dump(model_info, f, indent=2)
        
    print(f"Model exported to {export_dir}")
    
    # Also save the LoRA adapter separately for future use
    lora_dir = "./lora-adapter"
    os.makedirs(lora_dir, exist_ok=True)
    model.save_pretrained(lora_dir)
    print(f"LoRA adapter saved to {lora_dir}")
except Exception as e:
    print(f"Error exporting model: {e}")
    raise

## Upload to Dropbox (Optional)

If you want to upload the model to Dropbox for easy access, you can use the following code:

In [None]:
# Function to upload a file to Dropbox
def upload_to_dropbox(file_path, dropbox_path, access_token):
    """Upload a file to Dropbox.
    
    Args:
        file_path: Path to the file to upload
        dropbox_path: Path in Dropbox where the file should be uploaded
        access_token: Dropbox access token
        
    Returns:
        Response from Dropbox API
    """
    try:
        import dropbox
        dbx = dropbox.Dropbox(access_token)
        
        with open(file_path, "rb") as f:
            file_size = os.path.getsize(file_path)
            chunk_size = 4 * 1024 * 1024  # 4MB chunks
            
            if file_size <= chunk_size:
                # Small file, upload in one go
                return dbx.files_upload(f.read(), dropbox_path, mode=dropbox.files.WriteMode.overwrite)
            else:
                # Large file, use chunked upload
                upload_session_start_result = dbx.files_upload_session_start(f.read(chunk_size))
                cursor = dropbox.files.UploadSessionCursor(
                    session_id=upload_session_start_result.session_id,
                    offset=f.tell()
                )
                commit = dropbox.files.CommitInfo(path=dropbox_path, mode=dropbox.files.WriteMode.overwrite)
                
                while f.tell() < file_size:
                    if (file_size - f.tell()) <= chunk_size:
                        # Last chunk
                        return dbx.files_upload_session_finish(f.read(chunk_size), cursor, commit)
                    else:
                        # Intermediate chunk
                        dbx.files_upload_session_append_v2(f.read(chunk_size), cursor)
                        cursor.offset = f.tell()
    except Exception as e:
        print(f"Error uploading to Dropbox: {e}")
        raise

# To use this function, uncomment and provide your Dropbox access token
# DROPBOX_ACCESS_TOKEN = "your_access_token_here"
# upload_to_dropbox("./exported-model-lora.zip", "/CodeBERT-Swift-LoRA/model.zip", DROPBOX_ACCESS_TOKEN)

## Inference Example

Let's create a simple inference example to demonstrate how to use the trained model:

In [None]:
def predict_with_model(code_text, model, tokenizer, device):
    """Make a prediction with the trained model.
    
    Args:
        code_text: Swift code text to classify
        model: Trained model
        tokenizer: Tokenizer
        device: Device to run inference on
        
    Returns:
        Prediction label and confidence score
    """
    # Tokenize the input
    inputs = tokenizer(
        code_text,
        padding=True,
        truncation=True,
        max_length=MAX_SEQ_LENGTH,
        return_tensors="pt"
    ).to(device)
    
    # Make prediction
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probabilities = torch.nn.functional.softmax(logits, dim=-1)
    
    # Get prediction and confidence
    predicted_class = torch.argmax(probabilities, dim=-1).item()
    confidence = probabilities[0, predicted_class].item()
    
    return predicted_class, confidence

# Example Swift code
example_code = """
import Foundation

struct Person {
    let name: String
    let age: Int
    
    func greet() -> String {
        return "Hello, my name is \(name) and I am \(age) years old."
    }
}

let john = Person(name: "John", age: 30)
print(john.greet())
"""

# Make a prediction
try:
    predicted_class, confidence = predict_with_model(example_code, merged_model, tokenizer, device)
    print(f"Prediction: {'Package.swift' if predicted_class == 1 else 'Not Package.swift'}")
    print(f"Confidence: {confidence:.4f}")
except Exception as e:
    print(f"Error making prediction: {e}")

## Conclusion

In this notebook, we've fine-tuned the CodeBERT model on Swift code using LoRA for efficient training. The model can now be used for code understanding tasks related to Swift.

Key optimizations implemented:

1. **LoRA (Low-Rank Adaptation)** for efficient fine-tuning with significantly fewer parameters
2. **Mixed precision training** for faster computation
3. **Gradient checkpointing** to reduce memory usage
4. **Dynamic batch sizing** based on available hardware
5. **Efficient data loading** with multiprocessing
6. **Memory management** with garbage collection and cache clearing
7. **Optimized tokenization** with batched processing
8. **Proper stratification** using scikit-learn instead of datasets library
9. **Improved checkpoint handling** for reliable training resumption

The model is now ready for deployment in your applications!

### Benefits of LoRA

By using LoRA, we achieved:
- **Faster training**: Training completes in significantly less time
- **Lower memory usage**: Reduced memory footprint allows for larger batch sizes
- **Smaller model size**: The adapter is much smaller than a fully fine-tuned model
- **Comparable performance**: Results are similar to full fine-tuning

This approach is ideal for efficiently adapting large pre-trained models to specific tasks with limited computational resources.