# CodeBERT for Swift Code Understanding

In this notebook, we fine-tune the [CodeBERT](https://github.com/microsoft/CodeBERT) model on the [Swift Code Intelligence dataset](https://huggingface.co/datasets/mvasiliniuc/iva-swift-codeint). CodeBERT is a pre-trained model specifically designed for programming languages, much like how BERT was pre-trained for natural language text. Created by Microsoft Research, CodeBERT can understand both programming language and natural language, making it ideal for code-related tasks.

We'll use the Swift code dataset to fine-tune the model for code understanding tasks. After training, we'll upload the model to Dropbox for easy access and distribution.

## Overview

The process of fine-tuning CodeBERT involves:

1. **🔧 Setup**: Install necessary libraries and prepare our environment
2. **📥 Data Loading**: Load the Swift code dataset from Hugging Face
3. **🧹 Preprocessing**: Prepare the data for training by tokenizing the code samples
4. **🧠 Model Training**: Fine-tune CodeBERT on our prepared data
5. **📊 Evaluation**: Assess how well our model performs
6. **📤 Export & Upload**: Save the model and upload it to Dropbox

Let's start by installing the necessary libraries:

In [None]:
!pip install transformers datasets evaluate torch scikit-learn tqdm dropbox requests

In [None]:
import os
import json
import torch
import random
import numpy as np
from tqdm.auto import tqdm
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    RobertaForSequenceClassification,
    Trainer, 
    TrainingArguments,
    set_seed
)

# Set a seed for reproducibility
set_seed(42)

## Dataset and Model Configuration

Let's define the model and dataset we'll be using:

In [None]:
# Set model and dataset IDs
MODEL_ID = "microsoft/codebert-base"
DATASET_ID = "mvasiliniuc/iva-swift-codeint"

# Set device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## Data Loading

Now let's load the Swift code dataset and examine its structure:

In [None]:
# Load the dataset 
data = load_dataset(DATASET_ID, trust_remote_code=True)
print("Dataset structure:")
print(data)

In [None]:
# Let's take a look at an example from the dataset
if 'train' in data:
    example = data['train'][0]
else:
    example = data[list(data.keys())[0]][0]
    
print("Example features:")
for key, value in example.items():
    if isinstance(value, str) and len(value) > 100:
        print(f"{key}: {value[:100]}...")
    else:
        print(f"{key}: {value}")

## Loading the CodeBERT Tokenizer

Now, let's load the CodeBERT tokenizer, which has been specially trained to handle code tokens:

In [None]:
# Load the CodeBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
print(f"Tokenizer vocabulary size: {len(tokenizer)}")
print(f"Tokenizer type: {tokenizer.__class__.__name__}")

## Data Preparation

Since we're dealing with a code understanding task, we need to prepare our data appropriately. The dataset contains Swift code files, so we'll need to create labeled data for our task.

For this demonstration, we'll create a binary classification task that determines whether the code is a Package.swift file (which is used for Swift package management) or not. This is just an example task - in a real application, you might have more complex classification targets.

In [None]:
# Create a classification dataset based on whether the file is a Package.swift file
def add_labels(example):
    # Label 1 if it's a Package.swift file, 0 otherwise
    example['label'] = 1 if 'Package.swift' in example['path'] else 0
    return example

# Apply the labeling function
labeled_data = data['train'].map(add_labels)

# Check the distribution of labels using collections.Counter
import collections
all_labels = labeled_data['label']
label_counter = collections.Counter(all_labels)
print("Label distribution:")
for label, count in label_counter.items():
    print(f"Label {label}: {count} examples ({count/len(labeled_data)*100:.2f}%)")

Now let's split our data into training and validation sets:

In [None]:
# Split the dataset without stratification (to avoid ClassLabel errors)
train_test_split = labeled_data.train_test_split(test_size=0.1, seed=42)
train_data = train_test_split['train']
val_data = train_test_split['test']

# Verify label distribution after split
train_label_counter = collections.Counter(train_data['label'])
val_label_counter = collections.Counter(val_data['label'])

print(f"Training set size: {len(train_data)}")
print(f"Training label distribution: {dict(train_label_counter)}")
print(f"Validation set size: {len(val_data)}")
print(f"Validation label distribution: {dict(val_label_counter)}")

## Tokenization

Now we need to tokenize our code samples. We'll use the CodeBERT tokenizer to convert the Swift code into token IDs that the model can understand:

In [None]:
def tokenize_function(examples):
    """Tokenize the Swift code samples.
    
    Args:
        examples: Batch of examples from the dataset
        
    Returns:
        Tokenized examples
    """
    # Tokenize the code content
    return tokenizer(
        examples["content"],
        padding="max_length",
        truncation=True,
        max_length=512,  # CodeBERT supports sequences up to 512 tokens
        return_tensors="pt"
    )

In [None]:
# Process the data
tokenized_train_data = train_data.map(
    tokenize_function,
    batched=True,
    remove_columns=[col for col in train_data.column_names if col != 'label']
)

tokenized_val_data = val_data.map(
    tokenize_function,
    batched=True,
    remove_columns=[col for col in val_data.column_names if col != 'label']
)

print("Training data after tokenization:")
print(tokenized_train_data)
print("\nValidation data after tokenization:")
print(tokenized_val_data)

## Model Preparation

Now that our data is ready, let's load the CodeBERT model and configure it for sequence classification:

In [None]:
# Load the CodeBERT model for sequence classification (2 classes)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, num_labels=2)
model.to(device)
print(f"Model type: {model.__class__.__name__}")

## Training Setup

Now let's define our training arguments and evaluation metrics:

In [None]:
# Function to compute metrics during evaluation
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='weighted')
    
    return {
        'accuracy': accuracy,
        'f1': f1
    }

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results/codebert-swift",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    push_to_hub=False,
)

In [None]:
# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_val_data,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

## Training the Model

Now let's train our CodeBERT model for Swift code classification:

In [None]:
# Start training
print("Starting model training...")
trainer.train()

## Evaluating the Model

Let's evaluate our model on the validation dataset:

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")

## Testing the Model with Example Predictions

Let's test our model on some sample Swift code files:

In [None]:
# Get some test examples
test_examples = val_data.select(range(5))

# Tokenize them
tokenized_test_examples = tokenize_function({"content": test_examples["content"]})

# Move to device
for key, val in tokenized_test_examples.items():
    if isinstance(val, torch.Tensor):
        tokenized_test_examples[key] = val.to(device)

# Make predictions
with torch.no_grad():
    outputs = model(**{k: v for k, v in tokenized_test_examples.items() if k != "label"})
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_labels = torch.argmax(predictions, dim=-1).cpu().numpy()

# Print results
for i, (pred, true) in enumerate(zip(predicted_labels, test_examples["label"])):
    is_package_swift = "Yes" if pred == 1 else "No"
    true_is_package_swift = "Yes" if true == 1 else "No"
    print(f"File path: {test_examples['path'][i]}")
    print(f"Prediction: Is Package.swift? {is_package_swift} (Confidence: {predictions[i][pred].item():.4f})")
    print(f"True label: Is Package.swift? {true_is_package_swift}")
    print(f"First few lines: {test_examples['content'][i][:100]}...")
    print("---\n")

## Saving the Model

Now let's save the model and tokenizer for later use:

In [None]:
# Create a directory for the model
model_save_dir = "./codebert-swift-model"
os.makedirs(model_save_dir, exist_ok=True)

# Save the model
model.save_pretrained(model_save_dir)
tokenizer.save_pretrained(model_save_dir)

print(f"Model and tokenizer saved to {model_save_dir}")

## Uploading to Dropbox

Now let's upload our trained model to Dropbox for easy access and distribution. We'll use the same approach as in the Groq downloader notebook:

In [None]:
import zipfile
import dropbox
from dropbox.files import WriteMode
from dropbox.exceptions import ApiError, AuthError

# First, let's zip the model directory
def zip_directory(directory, output_path):
    """Compress a directory into a zip file."""
    with zipfile.ZipFile(output_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, _, files in os.walk(directory):
            for file in files:
                file_path = os.path.join(root, file)
                zipf.write(file_path, os.path.relpath(file_path, os.path.dirname(directory)))
        print(f"Created zip file: {output_path}")

# Zip the model directory
model_zip_path = "./codebert-swift-model.zip"
zip_directory(model_save_dir, model_zip_path)

# Check the size of the zip file
zip_size_mb = os.path.getsize(model_zip_path) / (1024 * 1024)
print(f"Zip file size: {zip_size_mb:.2f} MB")

In [None]:
# Dropbox API credentials
APP_KEY = "2bi422xpd3xd962"
APP_SECRET = "j3yx0b41qdvfu86"
REFRESH_TOKEN = "RvyL03RE5qAAAAAAAAAAAVMVebvE7jDx8Okd0ploMzr85c6txvCRXpJAt30mxrKF"

# Initialize Dropbox client
def get_dropbox_client():
    """Initialize and authenticate Dropbox client using refresh token."""
    try:
        dbx = dropbox.Dropbox(
            app_key=APP_KEY,
            app_secret=APP_SECRET,
            oauth2_refresh_token=REFRESH_TOKEN
        )
        # Check that the access token is valid
        dbx.users_get_current_account()
        return dbx
    except AuthError as e:
        print(f"ERROR: Invalid credentials. {e}")
        return None

In [None]:
# Upload the file to Dropbox
def upload_file_to_dropbox(file_path, dropbox_path):
    """Upload a file to Dropbox."""
    dbx = get_dropbox_client()
    if not dbx:
        return False
        
    with open(file_path, 'rb') as f:
        file_size = os.path.getsize(file_path)
        chunk_size = 4 * 1024 * 1024  # 4MB chunks
        
        if file_size <= chunk_size:
            # Small file, upload in one go
            print(f"Uploading {file_path} to Dropbox as {dropbox_path}...")
            try:
                dbx.files_upload(f.read(), dropbox_path, mode=WriteMode('overwrite'))
                print("Upload complete!")
                return True
            except ApiError as e:
                print(f"ERROR: Dropbox API error - {e}")
                return False
        else:
            # Large file, use chunked upload
            print(f"Uploading {file_path} to Dropbox as {dropbox_path} in chunks...")
            upload_session_start_result = dbx.files_upload_session_start(f.read(chunk_size))
            cursor = dropbox.files.UploadSessionCursor(
                session_id=upload_session_start_result.session_id,
                offset=f.tell()
            )
            commit = dropbox.files.CommitInfo(path=dropbox_path, mode=WriteMode('overwrite'))
            
            # Upload the file in chunks
            uploaded = f.tell()
            with tqdm(total=file_size, desc="Uploading", unit="B", unit_scale=True) as pbar:
                pbar.update(uploaded)
                
                while uploaded < file_size:
                    if (file_size - uploaded) <= chunk_size:
                        # Last chunk
                        data = f.read(chunk_size)
                        dbx.files_upload_session_finish(
                            data, cursor, commit
                        )
                        uploaded += len(data)
                        pbar.update(len(data))
                    else:
                        # More chunks to upload
                        data = f.read(chunk_size)
                        dbx.files_upload_session_append_v2(
                            data, cursor
                        )
                        uploaded += len(data)
                        cursor.offset = uploaded
                        pbar.update(len(data))
                        
            print("Chunked upload complete!")
            return True

# Upload the model zip to Dropbox
dropbox_path = "/codebert-swift-model/codebert-swift-model.zip"
success = upload_file_to_dropbox(model_zip_path, dropbox_path)

if success:
    print(f"Successfully uploaded model to Dropbox at {dropbox_path}")
else:
    print("Failed to upload model to Dropbox.")

## Creating a Shareable Link

Finally, let's create a shareable link for our model:

In [None]:
# Create a shared link for the file
def create_shared_link(dropbox_path):
    """Create a shared link for a file in Dropbox."""
    dbx = get_dropbox_client()
    if not dbx:
        return None
        
    try:
        shared_link = dbx.sharing_create_shared_link_with_settings(dropbox_path)
        return shared_link.url
    except ApiError as e:
        # If the file already has a shared link, the API will return an error
        if isinstance(e.error, dropbox.sharing.CreateSharedLinkWithSettingsError) and \
           e.error.is_shared_link_already_exists():
            # Get existing links
            links = dbx.sharing_list_shared_links(dropbox_path).links
            if links:
                return links[0].url
        print(f"ERROR: {e}")
        return None

# Create a shared link
shared_link = create_shared_link(dropbox_path)

if shared_link:
    # Convert to direct download link
    download_link = shared_link.replace("www.dropbox.com", "dl.dropboxusercontent.com").replace("?dl=0", "")
    print(f"Download link: {download_link}")
else:
    print("Failed to create shared link.")

## Conclusion

In this notebook, we've successfully:

1. Loaded and prepared the Swift code dataset for training
2. Fine-tuned the CodeBERT model on this dataset
3. Evaluated the model's performance
4. Saved and uploaded the model to Dropbox for easy access

This fine-tuned CodeBERT model can now be used for various Swift code understanding tasks, such as code search, code classification, or as a feature extractor for more complex code intelligence tasks.