# Enhanced CodeBERT for Swift Code Understanding

In this notebook, we fine-tune the [CodeBERT](https://github.com/microsoft/CodeBERT) model on the [Swift Code Intelligence dataset](https://huggingface.co/datasets/mvasiliniuc/iva-swift-codeint). CodeBERT is a pre-trained model specifically designed for programming languages, much like how BERT was pre-trained for natural language text. Created by Microsoft Research, CodeBERT can understand both programming language and natural language, making it ideal for code-related tasks.

Unlike the previous version that focused only on identifying Package.swift files, this enhanced version trains the model on the entire dataset by classifying Swift files into meaningful categories based on their purpose in a codebase.

## Overview

The process of fine-tuning CodeBERT involves:

1. **🔧 Setup**: Install necessary libraries and prepare our environment
2. **📥 Data Loading**: Load the Swift code dataset from Hugging Face
3. **🧹 Enhanced Preprocessing**: Prepare the data for training by categorizing files and tokenizing the code samples
4. **🧠 Model Training**: Fine-tune CodeBERT on our prepared data
5. **📊 Evaluation**: Assess how well our model performs
6. **📤 Export & Upload**: Save the model and upload it to Dropbox

Let's start by installing the necessary libraries:

In [None]:
# Uninstall TensorFlow and install TensorFlow-cpu (better for Kaggle environment)
!pip uninstall -y tensorflow
!pip install tensorflow-cpu
# Install required libraries
!pip install transformers datasets evaluate torch scikit-learn tqdm dropbox requests


In [None]:
# Important: These imports must be properly separated
import os
import json
import torch
import random
import numpy as np
import time
import gc
import re
import collections
from tqdm.auto import tqdm
from datasets import load_dataset, ClassLabel
from sklearn.metrics import accuracy_score, f1_score, precision_recall_fscore_support
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    RobertaForSequenceClassification,
    Trainer, 
    TrainingArguments,
    set_seed,
    DataCollatorWithPadding,
    EarlyStoppingCallback,
    get_scheduler
)

# Import AdamW from torch.optim instead of transformers.optimization
from torch.optim import AdamW
from transformers.trainer_utils import get_last_checkpoint

# Set a seed for reproducibility
set_seed(42)

# Add memory management function
def cleanup_memory():
    """Force garbage collection and clear CUDA cache if available."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    print("Memory cleaned up.")


## Accelerator Detection and Configuration

Let's detect and configure the available accelerator (CPU, GPU, or TPU):

In [None]:
# Check if GPU is available
import torch

if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device('cpu')
    print("Using CPU")

# Set random seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)


## Dataset and Model Configuration

Let's define the model and dataset we'll be using:

In [None]:
# Dataset configuration
DATASET_ID = "mvasiliniuc/iva-swift-codeint"

# Model configuration
MODEL_NAME = "microsoft/codebert-base"
MAX_LENGTH = 512
BATCH_SIZE = 16
LEARNING_RATE = 2e-5
WEIGHT_DECAY = 0.01
NUM_EPOCHS = 5
WARMUP_STEPS = 500
GRADIENT_ACCUMULATION_STEPS = 4

print("Using default configuration values.")


## Data Loading

Now let's load the Swift code dataset and examine its structure with proper error handling:

In [None]:
# Function to load dataset with retry logic
def load_dataset_with_retry(dataset_id, max_retries=3, retry_delay=5):
    """Load a dataset with retry logic."""
    for attempt in range(max_retries):
        try:
            print(f"Loading dataset (attempt {attempt+1}/{max_retries})...")
            data = load_dataset(dataset_id, trust_remote_code=True)
            print(f"Dataset loaded successfully with {len(data['train'])} examples")
            return data
        except Exception as e:
            print(f"Error loading dataset (attempt {attempt+1}/{max_retries}): {e}")
            if attempt < max_retries - 1:
                print(f"Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
            else:
                print("Maximum retries reached. Could not load dataset.")
                raise

# Make sure dataset ID is defined (in case previous cell didn't execute)
if 'DATASET_ID' not in globals():
    print("Warning: DATASET_ID not found. Using default value.")
    DATASET_ID = "mvasiliniuc/iva-swift-codeint"  # Default value as fallback
    MAX_LENGTH = 384
    MODEL_ID = "microsoft/codebert-base"
    TRAIN_BATCH_SIZE = 8
    EVAL_BATCH_SIZE = 16
    GRADIENT_ACCUMULATION_STEPS = 4
    print("Using default configuration values.")

# Load the dataset with retry logic
try:
    print(f"Loading dataset: {DATASET_ID}")
    data = load_dataset_with_retry(DATASET_ID)
    print("Dataset structure:")
    print(data)
except Exception as e:
    print(f"Fatal error loading dataset: {e}")
    raise


In [None]:
# Verify dataset structure and column names
def verify_dataset_structure(dataset):
    """Verify that the dataset has the expected structure and columns."""
    required_columns = ['repo_name', 'path', 'content']
    if 'train' not in dataset:
        print("WARNING: Dataset does not have a 'train' split.")
        return False
    
    missing_columns = [col for col in required_columns if col not in dataset['train'].column_names]
    if missing_columns:
        print(f"WARNING: Dataset is missing required columns: {missing_columns}")
        return False
    
    print("Dataset structure verification passed.")
    return True

# Verify dataset structure
dataset_valid = verify_dataset_structure(data)
if not dataset_valid:
    print("Dataset structure is not as expected. Proceeding with caution.")


In [None]:
# Let's take a look at an example from the dataset
try:
    if 'train' in data:
        example = data['train'][0]
    else:
        example = data[list(data.keys())[0]][0]
    
    print("Example features:")
    for key, value in example.items():
        if isinstance(value, str) and len(value) > 100:
            print(f"{key}: {value[:100]}...")
        else:
            print(f"{key}: {value}")
except Exception as e:
    print(f"Error exploring dataset example: {e}")


## Loading the CodeBERT Tokenizer

Now, let's load the CodeBERT tokenizer, which has been specially trained to handle code tokens:

In [None]:
# Load the CodeBERT tokenizer with error handling
try:
    # Use MODEL_NAME instead of MODEL_ID to match the variable defined earlier
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    print(f"Tokenizer vocabulary size: {len(tokenizer)}")
    print(f"Tokenizer type: {tokenizer.__class__.__name__}")
except Exception as e:
    print(f"Error loading tokenizer: {e}")
    raise


## Enhanced Data Preparation

Instead of focusing only on Package.swift files, we'll create a more meaningful multi-class classification task that categorizes Swift files based on their purpose in a codebase. This approach utilizes the entire dataset and provides more valuable insights into code understanding.

We'll categorize files into the following classes:
1. **Models** - Data structures and model definitions
2. **Views** - UI related files
3. **Controllers** - Application logic
4. **Utilities** - Helper functions and extensions
5. **Tests** - Test files
6. **Configuration** - Package and configuration files

In [None]:
def extract_file_type(path):
    """
    Extract the file type/category based on the file path and naming conventions in Swift projects.
    
    Args:
        path (str): The file path
        
    Returns:
        int: The category label (0-5)
    """
    path_lower = path.lower()
    filename = path.split('/')[-1].lower()
    
    # Category 0: Models - Data structures and model definitions
    if ('model' in path_lower or 
        'struct' in path_lower or 
        'entity' in path_lower or
        'data' in path_lower and 'class' in path_lower):
        return 0
    
    # Category 1: Views - UI related files
    elif ('view' in path_lower or 
          'ui' in path_lower or 
          'screen' in path_lower or 
          'page' in path_lower or
          'controller' in path_lower and 'view' in path_lower):
        return 1
    
    # Category 2: Controllers - Application logic
    elif ('controller' in path_lower or 
          'manager' in path_lower or 
          'coordinator' in path_lower or
          'service' in path_lower):
        return 2
    
    # Category 3: Utilities - Helper functions and extensions
    elif ('util' in path_lower or 
          'helper' in path_lower or 
          'extension' in path_lower or
          'common' in path_lower):
        return 3
    
    # Category 4: Tests - Test files
    elif ('test' in path_lower or 
          'spec' in path_lower or 
          'mock' in path_lower):
        return 4
    
    # Category 5: Configuration - Package and configuration files
    elif ('package.swift' in path_lower or 
          'config' in path_lower or 
          'settings' in path_lower or
          'info.plist' in path_lower):
        return 5
    
    # Default to category 3 (Utilities) if no clear category is found
    return 3

def analyze_content_for_category(content):
    """
    Analyze file content to help determine its category when path-based classification is ambiguous.
    
    Args:
        content (str): The file content
        
    Returns:
        int: The suggested category based on content analysis
    """
    content_lower = content.lower()
    
    # Check for model patterns
    if (re.search(r'struct\s+\w+', content) or 
        re.search(r'class\s+\w+\s*:\s*\w*codable', content_lower) or
        'encodable' in content_lower or 'decodable' in content_lower):
        return 0
    
    # Check for view patterns
    elif ('uiview' in content_lower or 
          'uitableview' in content_lower or 
          'uicollectionview' in content_lower or
          'swiftui' in content_lower or
          'view {' in content_lower):
        return 1
    
    # Check for controller patterns
    elif ('viewcontroller' in content_lower or 
          'uiviewcontroller' in content_lower or
          'navigationcontroller' in content_lower or
          'viewdidload' in content_lower):
        return 2
    
    # Check for utility patterns
    elif ('extension' in content_lower or 
          'func ' in content and not 'class' in content_lower[:100] or
          'protocol' in content_lower):
        return 3
    
    # Check for test patterns
    elif ('xctest' in content_lower or 
          'testcase' in content_lower or
          'func test' in content_lower):
        return 4
    
    # Check for configuration patterns
    elif ('package(' in content_lower or 
          'dependencies' in content_lower and 'package' in content_lower or
          'products' in content_lower and 'targets' in content_lower):
        return 5
    
    # Default to -1 (undetermined)
    return -1

def enhanced_add_labels(example):
    """
    Enhanced labeling function that categorizes Swift files based on their purpose.
    
    Categories:
    0: Models - Data structures and model definitions
    1: Views - UI related files
    2: Controllers - Application logic
    3: Utilities - Helper functions and extensions
    4: Tests - Test files
    5: Configuration - Package and configuration files
    
    Args:
        example: Dataset example with 'path' and 'content' fields
        
    Returns:
        example: The example with added 'label' field
    """
    # First try to determine category from path
    path_category = extract_file_type(example['path'])
    
    # If the path-based category is ambiguous (category 3 - Utilities is our default),
    # try to analyze the content for a more specific category
    if path_category == 3:
        content_category = analyze_content_for_category(example['content'])
        # Only use content category if it's determined (-1 means undetermined)
        if content_category != -1:
            example['label'] = content_category
        else:
            example['label'] = path_category
    else:
        example['label'] = path_category
    
    return example


In [None]:
try:
    # Apply the enhanced labeling function
    labeled_data = data['train'].map(enhanced_add_labels)
    
    # Check the distribution of labels
    all_labels = labeled_data['label']
    label_counter = collections.Counter(all_labels)
    
    print("Label distribution:")
    for label, count in label_counter.items():
        category_names = {
            0: "Models",
            1: "Views",
            2: "Controllers",
            3: "Utilities",
            4: "Tests",
            5: "Configuration"
        }
        category_name = category_names.get(label, f"Category {label}")
        print(f"Label {label} ({category_name}): {count} examples ({count/len(labeled_data)*100:.2f}%)")
    
    # Check for label imbalance
    min_label_count = min(label_counter.values())
    max_label_count = max(label_counter.values())
    imbalance_ratio = max_label_count / min_label_count if min_label_count > 0 else float('inf')
    
    if imbalance_ratio > 10:
        print(f"WARNING: Severe label imbalance detected (ratio: {imbalance_ratio:.2f}). Consider using class weights or resampling.")
    elif imbalance_ratio > 3:
        print(f"WARNING: Moderate label imbalance detected (ratio: {imbalance_ratio:.2f}). Consider using class weights.")
        
except Exception as e:
    print(f"Error preparing dataset: {e}")
    raise


Now let's split our data into training and validation sets with stratification to maintain label distribution:

In [None]:
try:
    # Get unique labels
    unique_labels = sorted(set(labeled_data["label"]))
    num_labels = len(unique_labels)
    
    # Create a new dataset with ClassLabel feature
    labeled_data = labeled_data.cast_column("label", ClassLabel(num_classes=num_labels, names=[str(i) for i in unique_labels]))
    
    # First split: Create train and temp sets (temp will be split into val and test)  
    train_temp_split = labeled_data.train_test_split(test_size=0.2, seed=42, stratify_by_column='label')
    train_data = train_temp_split['train']
    
    # Second split: Split temp into validation and test sets
    val_test_split = train_temp_split['test'].train_test_split(test_size=0.5, seed=42, stratify_by_column='label')
    val_data = val_test_split['train']
    test_data = val_test_split['test']
    
    # Verify label distribution after split
    train_label_counter = collections.Counter(train_data['label'])
    val_label_counter = collections.Counter(val_data['label'])
    test_label_counter = collections.Counter(test_data['label'])
    
    print(f"Training set size: {len(train_data)}")
    print(f"Training label distribution: {dict(train_label_counter)}")
    
    print(f"Validation set size: {len(val_data)}")
    print(f"Validation label distribution: {dict(val_label_counter)}")
    
    print(f"Test set size: {len(test_data)}")
    print(f"Test label distribution: {dict(test_label_counter)}")
    
except Exception as e:
    print(f"Error splitting dataset: {e}")
    raise


## Tokenization

Now let's tokenize our data for the model:

In [None]:
def tokenize_function(examples):
    # Tokenize the texts with proper padding and truncation
    return tokenizer(
        examples['content'],  # Use 'content' column instead of 'func'
        padding='max_length',
        truncation=True,
        max_length=MAX_LENGTH,
        return_tensors='pt'
    )

# Apply tokenization to the datasets
tokenized_train_data = train_data.map(tokenize_function, batched=True)
tokenized_val_data = val_data.map(tokenize_function, batched=True)
tokenized_test_data = test_data.map(tokenize_function, batched=True)

# Set the format for PyTorch
tokenized_train_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_val_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_test_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

print(f"Tokenized {len(tokenized_train_data)} training examples")
print(f"Tokenized {len(tokenized_val_data)} validation examples")
print(f"Tokenized {len(tokenized_test_data)} test examples")


## Model Setup

Now let's set up the CodeBERT model for our multi-class classification task:

In [None]:
try:
    # Load the model with the correct number of labels
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME, 
        num_labels=num_labels,
        problem_type="single_label_classification"
    )
    
    # Move model to the appropriate device
    model.to(device)
    
    print(f"Model loaded with {num_labels} output classes")
    print(f"Model type: {model.__class__.__name__}")
    
except Exception as e:
    print(f"Error loading model: {e}")
    raise


## Class Weights Calculation

Since we detected label imbalance, let's calculate class weights to help the model learn better from imbalanced data:

In [None]:
# Calculate class weights to handle imbalanced data
from sklearn.utils.class_weight import compute_class_weight

# Convert PyTorch tensor to numpy array if needed
if hasattr(tokenized_train_data['label'], 'numpy'):
    labels = tokenized_train_data['label'].numpy()
else:
    # If it's already a list or another type, convert to numpy array
    labels = np.array(tokenized_train_data['label'])

# Compute balanced class weights
class_weights = compute_class_weight('balanced', classes=np.unique(labels), y=labels)
class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)

print("Class weights:")
for i, weight in enumerate(class_weights):
    category_names = {
        0: "Models",
        1: "Views",
        2: "Controllers",
        3: "Utilities",
        4: "Tests",
        5: "Configuration"
    }
    category_name = category_names.get(i, f"Category {i}")
    print(f"  Class {i} ({category_name}): {weight:.4f}")


## Training Setup

Let's set up the training configuration:

In [None]:
# Create a custom loss function with class weights
class WeightedLossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        
        # Use class weights in the loss calculation
        loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        
        return (loss, outputs) if return_outputs else loss

# Create trainer with weighted loss
trainer = WeightedLossTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_val_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[early_stopping_callback]
)

print("Training setup complete")


## Model Training

Now let's train the model:

In [None]:
try:
    print("Starting training...")
    train_result = trainer.train()
    
    # Print training results
    print(f"Training completed in {train_result.metrics['train_runtime']:.2f} seconds")
    print(f"Training loss: {train_result.metrics['train_loss']:.4f}")
    
    # Save the model
    trainer.save_model("./final_model")
    print("Model saved to ./final_model")
    
    # Clean up memory
    cleanup_memory()
    
except Exception as e:
    print(f"Error during training: {e}")
    raise


## Model Evaluation

Let's evaluate our model on the test set:

In [None]:
try:
    print("Evaluating model on test set...")
    test_results = trainer.evaluate(tokenized_test_data)
    
    # Print evaluation results
    print("Test results:")
    for key, value in test_results.items():
        print(f"{key}: {value:.4f}")
    
    # Create a confusion matrix
    from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
    import matplotlib.pyplot as plt
    
    # Get predictions
    predictions = trainer.predict(tokenized_test_data)
    preds = predictions.predictions.argmax(-1)
    labels = predictions.label_ids
    
    # Create confusion matrix
    cm = confusion_matrix(labels, preds)
    
    # Define class names
    class_names = ["Models", "Views", "Controllers", "Utilities", "Tests", "Configuration"]
    
    # Display confusion matrix
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
    fig, ax = plt.subplots(figsize=(10, 10))
    disp.plot(ax=ax, cmap=plt.cm.Blues, values_format='d')
    plt.title('Confusion Matrix')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
except Exception as e:
    print(f"Error during evaluation: {e}")
    raise


## Prediction Examples

Let's look at some examples of model predictions:

In [None]:
try:
    # Get a few examples from the test set
    num_examples = 5
    examples = test_data.select(range(num_examples))
    
    # Tokenize examples
    inputs = tokenizer(examples['content'], padding='max_length', truncation=True, max_length=MAX_LENGTH, return_tensors='pt')
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
    
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_classes = torch.argmax(predictions, dim=-1).cpu().numpy()
    
    # Define class names
    class_names = ["Models", "Views", "Controllers", "Utilities", "Tests", "Configuration"]
    
    # Print predictions
    print("Example predictions:")
    for i in range(num_examples):
        print(f"\nExample {i+1}:")
        print(f"File path: {examples['path'][i]}")
        print(f"Content preview: {examples['content'][i][:100]}...")
        print(f"Predicted class: {predicted_classes[i]} ({class_names[predicted_classes[i]]})")
        print(f"True class: {examples['label'][i]} ({class_names[examples['label'][i]]})")
        print(f"Confidence: {predictions[i][predicted_classes[i]].item():.4f}")
        
        # Print top 3 predictions
        top_3 = torch.topk(predictions[i], 3)
        print("Top 3 predictions:")
        for j in range(3):
            idx = top_3.indices[j].item()
            prob = top_3.values[j].item()
            print(f"  {class_names[idx]}: {prob:.4f}")
    
except Exception as e:
    print(f"Error during prediction examples: {e}")
    raise


## Model Export

Let's save and export our model:

In [None]:
try:
    # Save model and tokenizer
    output_dir = "./swift_codebert_classifier"
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    
    # Save class names and other metadata
    metadata = {
        "class_names": class_names,
        "num_classes": num_labels,
        "max_length": MAX_LENGTH,
        "model_name": MODEL_NAME,
        "dataset": DATASET_ID,
        "accuracy": float(test_results["eval_accuracy"]),
        "f1": float(test_results["eval_f1"]),
        "precision": float(test_results["eval_precision"]),
        "recall": float(test_results["eval_recall"])
    }
    
    with open(f"{output_dir}/metadata.json", "w") as f:
        json.dump(metadata, f, indent=2)
    
    print(f"Model, tokenizer, and metadata saved to {output_dir}")
    
    # Create a zip file for easy download
    import shutil
    shutil.make_archive("swift_codebert_classifier", "zip", ".", output_dir)
    print("Model package created: swift_codebert_classifier.zip")
    
except Exception as e:
    print(f"Error saving model: {e}")
    raise


## Uploading to Dropbox

Now let's upload our trained model to Dropbox for easy access and distribution with improved error handling and validation:

In [None]:
# Set your Dropbox credentials
# You should set these as environment variables in a production environment
APP_KEY = "2bi422xpd3xd962"  # Replace with your actual app key
APP_SECRET = "j3yx0b41qdvfu86"  # Replace with your actual app secret
REFRESH_TOKEN = "RvyL03RE5qAAAAAAAAAAAVMVebvE7jDx8Okd0ploMzr85c6txvCRXpJAt30mxrKF"  # Replace with your actual refresh token


In [None]:
import dropbox
from dropbox.exceptions import ApiError
from dropbox.files import WriteMode
from tqdm.notebook import tqdm
import time

def validate_dropbox_credentials(app_key, app_secret, refresh_token):
    """Test Dropbox credentials before attempting upload."""
    try:
        print("Validating Dropbox credentials...")
        dbx = dropbox.Dropbox(
            app_key=app_key,
            app_secret=app_secret,
            oauth2_refresh_token=refresh_token
        )
        # Check that the access token is valid
        account = dbx.users_get_current_account()
        print(f"✅ Connected to Dropbox account: {account.name.display_name}")
        return True, dbx
    except Exception as e:
        print(f"❌ Error connecting to Dropbox: {e}")
        return False, None

# Validate Dropbox credentials
credentials_valid, dbx = validate_dropbox_credentials(APP_KEY, APP_SECRET, REFRESH_TOKEN)

if not credentials_valid:
    print("Please check your Dropbox credentials and try again.")


In [None]:
def upload_to_dropbox(file_path, dropbox_path, max_retries=3):
    """Upload a file to Dropbox with retry logic."""
    if not credentials_valid:
        print("Dropbox credentials are not valid. Cannot upload.")
        return False
        
    file_size = os.path.getsize(file_path)
    chunk_size = 4 * 1024 * 1024  # 4MB chunks
    
    for attempt in range(max_retries):
        try:
            with open(file_path, 'rb') as f:
                # For small files, upload in one go
                if file_size <= chunk_size:
                    print(f"Uploading {file_path} to Dropbox as {dropbox_path}...")
                    try:
                        dbx.files_upload(f.read(), dropbox_path, mode=WriteMode('overwrite'))
                        print("Upload complete!")
                        return True
                    except ApiError as e:
                        print(f"ERROR: Dropbox API error - {e}")
                        if attempt < max_retries - 1:
                            print(f"Retrying... (Attempt {attempt+1}/{max_retries})")
                            continue
                        return False
                
                # For large files, use chunked upload
                else:
                    print(f"Uploading {file_path} to Dropbox as {dropbox_path} in chunks...")
                    upload_session_start_result = dbx.files_upload_session_start(f.read(chunk_size))
                    cursor = dropbox.files.UploadSessionCursor(
                        session_id=upload_session_start_result.session_id,
                        offset=f.tell()
                    )
                    commit = dropbox.files.CommitInfo(path=dropbox_path, mode=WriteMode('overwrite'))
                    
                    # Upload the file in chunks with progress tracking
                    uploaded = f.tell()
                    with tqdm(total=file_size, desc="Uploading", unit="B", unit_scale=True) as pbar:
                        pbar.update(uploaded)
                        while uploaded < file_size:
                            if (file_size - uploaded) <= chunk_size:
                                dbx.files_upload_session_finish(f.read(chunk_size), cursor, commit)
                                uploaded = file_size
                                pbar.update(file_size - pbar.n)
                            else:
                                dbx.files_upload_session_append_v2(f.read(chunk_size), cursor)
                                uploaded = f.tell()
                                cursor.offset = uploaded
                                pbar.update(chunk_size)
                    print("Chunked upload complete!")
                    return True
        except Exception as e:
            print(f"ERROR: Upload failed - {e}")
            if attempt < max_retries - 1:
                print(f"Retrying... (Attempt {attempt+1}/{max_retries})")
                time.sleep(2)  # Wait before retrying
            else:
                print("Maximum retries reached. Upload failed.")
                return False
    return False

def create_shared_link(dropbox_path):
    """Create a shared link for a file in Dropbox."""
    if not credentials_valid:
        print("Dropbox credentials are not valid. Cannot create shared link.")
        return None
        
    try:
        shared_link = dbx.sharing_create_shared_link_with_settings(dropbox_path)
        return shared_link.url
    except ApiError as e:
        # If the file already has a shared link, the API will return an error
        if isinstance(e.error, dropbox.sharing.CreateSharedLinkWithSettingsError) and \
           e.error.is_path() and e.error.get_path().is_shared_link_already_exists():
            # Get existing shared links
            shared_links = dbx.sharing_list_shared_links(path=dropbox_path).links
            if shared_links:
                return shared_links[0].url
        print(f"ERROR: Could not create shared link - {e}")
        return None


In [None]:
# Upload the model zip to Dropbox
if credentials_valid:
    zip_path = "swift_codebert_classifier.zip"
    dropbox_path = f"/swift_codebert_classifier/{os.path.basename(zip_path)}"
    
    if upload_to_dropbox(zip_path, dropbox_path):
        print(f"Successfully uploaded model to Dropbox at {dropbox_path}")
        shared_link = create_shared_link(dropbox_path)
        if shared_link:
            print(f"Shared link: {shared_link}")
    else:
        print("Failed to upload model to Dropbox.")
else:
    print("Skipping Dropbox upload due to invalid credentials.")


## Conclusion

We've successfully enhanced the CodeBERT training process to utilize the entire Swift code dataset instead of focusing only on Package.swift files. Our model now classifies Swift code files into meaningful categories based on their purpose in a codebase:

1. **Models** - Data structures and model definitions
2. **Views** - UI related files
3. **Controllers** - Application logic
4. **Utilities** - Helper functions and extensions
5. **Tests** - Test files
6. **Configuration** - Package and configuration files

This multi-class classification approach provides more valuable insights for code understanding tasks and makes better use of the available data. The model can be used for various code intelligence tasks such as:

- Automatically categorizing new code files
- Suggesting file organization in large codebases
- Identifying misplaced code (e.g., model logic in controller files)
- Assisting in code navigation and understanding

The same approach can be extended to other programming languages and tasks, such as code search, code completion, and bug detection.