# Load Dataset and Libraries

## Install Packages

In [None]:
!pip install datasets --use-feature=2020-resolver


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

option --use-feature: invalid choice: '2020-resolver' (choose from 'fast-deps', 'truststore', 'no-binary-enable-wheel-cache')


In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [

In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=ca95742bf1c17f588620812c8fc64e82f55901721a878e47a71e6a552d72972c
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


## Import Libraries

In [None]:
# Core libraries
import numpy as np
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm

# Dataset and evaluation
from datasets import load_dataset
import evaluate

# Transformers library
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    AdamW,
    get_scheduler
)

## Load Dataset

In [None]:
# Load the Python subset of the CodeXGLUE CT Code-to-Text dataset
dataset = load_dataset("code_x_glue_ct_code_to_text", "python")

# Work with the first 1% of the dataset
def get_subset(dataset, percentage=1):
    subset = {}
    for split in dataset.keys():
        num_samples = int(len(dataset[split]) * percentage / 100)
        subset[split] = dataset[split].select(range(num_samples))
    return subset

# Get the smaller subset (first 1% of the dataset)
subset_percentage = 1
subset = get_subset(dataset, percentage=subset_percentage)

# Iterate through the first 5 samples in the subset
for i in range(5):
    print(f"Sample {i+1}:")
    print("Code:", subset['train']['code'][i])
    print("Docstring:", subset['train']['docstring'][i])
    print("-" * 80)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/26.7k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/144M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/147M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/16.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/18.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/251820 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13914 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/14918 [00:00<?, ? examples/s]

Sample 1:
Code: def settext(self, text, cls='current'):
        """Set the text for this element.

        Arguments:
            text (str): The text
            cls (str): The class of the text, defaults to ``current`` (leave this unless you know what you are doing). There may be only one text content element of each class associated with the element.
        """
        self.replace(TextContent, value=text, cls=cls)
Docstring: Set the text for this element.

        Arguments:
            text (str): The text
            cls (str): The class of the text, defaults to ``current`` (leave this unless you know what you are doing). There may be only one text content element of each class associated with the element.
--------------------------------------------------------------------------------
Sample 2:
Code: def setdocument(self, doc):
        """Associate a document with this element.

        Arguments:
            doc (:class:`Document`): A document

        Each element must be ass

In [None]:
print("Dataset columns:", subset['train'].column_names)

Dataset columns: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url']


We only need "code" and "docstring" columns.

In [None]:
# Function to retain only specified columns in a dataset split
def retain_columns(dataset, columns_to_keep):
    # Remove all columns except those in `columns_to_keep`
    columns_to_remove = [col for col in dataset.column_names if col not in columns_to_keep]
    return dataset.remove_columns(columns_to_remove)

# Columns to keep
columns_to_keep = ["code", "docstring"]

# Apply to each split in the subset
subset["train"] = retain_columns(subset["train"], columns_to_keep)
if "validation" in subset:
    subset["validation"] = retain_columns(subset["validation"], columns_to_keep)
if "test" in subset:
    subset["test"] = retain_columns(subset["test"], columns_to_keep)

# Print the structure of the updated subset
print(subset)

{'train': Dataset({
    features: ['code', 'docstring'],
    num_rows: 2518
}), 'validation': Dataset({
    features: ['code', 'docstring'],
    num_rows: 139
}), 'test': Dataset({
    features: ['code', 'docstring'],
    num_rows: 149
})}


# Preprocessing

In [None]:
# Dataset Preprocessing
def preprocess_function(examples, tokenizer, max_input_length=512, max_output_length=128):
    inputs = tokenizer(
        examples["code"], max_length=max_input_length, padding="max_length", truncation=True
    )
    outputs = tokenizer(
        examples["docstring"], max_length=max_output_length, padding="max_length", truncation=True
    )
    inputs["labels"] = outputs["input_ids"]
    return {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
        "labels": inputs["labels"],
    }

# Implementation

## Encoder-decoder Models


*   PLBART
*   CodeT5



In [None]:
# Load BLEU and ROUGE metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
# Evaluation function
def evaluate_model(model, dataloader, tokenizer, device):
    model.eval()
    predictions, references = [], []
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            inputs = batch["input_ids"].to(device)
            labels = batch["labels"]
            outputs = model.generate(inputs, max_length=512)
            predictions += tokenizer.batch_decode(outputs, skip_special_tokens=True)
            references += tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute metrics
    bleu_result = bleu.compute(predictions=predictions, references=references)
    rouge_result = rouge.compute(predictions=predictions, references=references)

    return bleu_result, rouge_result

In [None]:
# Training function
def train_model(model, train_dataloader, val_dataloader, optimizer, device, epochs=3):
    for epoch in range(epochs):
        print(f"Epoch {epoch + 1}/{epochs}")
        model.train()
        train_loss = 0
        progress_bar = tqdm(train_dataloader, desc="Training")

        for batch in progress_bar:
            batch = {key: val.to(device) for key, val in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
            progress_bar.set_postfix(loss=loss.item())

        print(f"Training Loss: {train_loss / len(train_dataloader)}")

        # Validation
        if val_dataloader:
            model.eval()
            valid_loss = 0
            with torch.no_grad():
                for batch in tqdm(val_dataloader, desc="Validation"):
                    batch = {key: val.to(device) for key, val in batch.items()}
                    outputs = model(**batch)
                    valid_loss += outputs.loss.item()
            print(f"Validation Loss: {valid_loss / len(val_dataloader)}")

In [None]:
# Initialize data and models
def pipeline(model_name, dataset, batch_size=4, learning_rate=5e-5, epochs=3):
    print(f"Processing pipeline for {model_name}...")

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Preprocess the dataset
    tokenized_train = dataset["train"].map(
        lambda x: preprocess_function(x, tokenizer), batched=True
    )
    tokenized_validation = dataset["validation"].map(
        lambda x: preprocess_function(x, tokenizer), batched=True
    )
    tokenized_test = dataset["test"].map(
        lambda x: preprocess_function(x, tokenizer), batched=True
    )

    # Format data for PyTorch
    tokenized_train.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
    tokenized_validation.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
    tokenized_test.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

    # DataLoader
    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
    train_dataloader = DataLoader(
        tokenized_train, shuffle=True, batch_size=batch_size, collate_fn=data_collator
    )
    val_dataloader = DataLoader(
        tokenized_validation, batch_size=batch_size, collate_fn=data_collator
    )
    test_dataloader = DataLoader(
        tokenized_test, batch_size=batch_size, collate_fn=data_collator
    )

    # Optimizer
    optimizer = AdamW(model.parameters(), lr=learning_rate)

    # Train the model
    train_model(model, train_dataloader, val_dataloader, optimizer, device, epochs)

    # Evaluate the model
    bleu_result, rouge_result = evaluate_model(model, test_dataloader, tokenizer, device)
    print(f"BLEU Score for {model_name}: {bleu_result['bleu']}")
    print(f"ROUGE Scores for {model_name}: {rouge_result}")

    return bleu_result, rouge_result

In [None]:
# Compare PLBART and CodeT5
results = {}
results["PLBART"] = pipeline("uclanlp/plbart-base", subset)
results["CodeT5"] = pipeline("Salesforce/codet5-small", subset)

# Display comparison
print("\nComparison Results:")
for model_name, metrics in results.items():
    print(f"\n{model_name}:")
    print(f"BLEU: {metrics[0]['bleu']}")
    print(f"ROUGE: {metrics[1]}")

Processing pipeline for uclanlp/plbart-base...


config.json:   0%|          | 0.00/783 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/986k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/557M [00:00<?, ?B/s]

Map:   0%|          | 0/2518 [00:00<?, ? examples/s]

Map:   0%|          | 0/139 [00:00<?, ? examples/s]

Map:   0%|          | 0/149 [00:00<?, ? examples/s]



Epoch 1/3


  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
Training: 100%|██████████| 630/630 [03:20<00:00,  3.14it/s, loss=0.0101]


Training Loss: 0.08181892174311325


Validation: 100%|██████████| 35/35 [00:03<00:00,  9.68it/s]


Validation Loss: 0.020966943696216083
Epoch 2/3


Training: 100%|██████████| 630/630 [03:24<00:00,  3.07it/s, loss=7.21e-5]


Training Loss: 0.01439259035545951


Validation: 100%|██████████| 35/35 [00:03<00:00,  9.67it/s]


Validation Loss: 0.015268680770194415
Epoch 3/3


Training: 100%|██████████| 630/630 [03:24<00:00,  3.07it/s, loss=0.0109]


Training Loss: 0.011256696411264928


Validation: 100%|██████████| 35/35 [00:03<00:00,  9.61it/s]


Validation Loss: 0.01625968839977889


Evaluating: 100%|██████████| 38/38 [00:53<00:00,  1.41s/it]


BLEU Score for uclanlp/plbart-base: 0.9564276700745586
ROUGE Scores for uclanlp/plbart-base: {'rouge1': 0.9042146268510616, 'rouge2': 0.9039743931529767, 'rougeL': 0.9070110398827219, 'rougeLsum': 0.9054893853227471}
Processing pipeline for Salesforce/codet5-small...


tokenizer_config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/703k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/294k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/12.5k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Map:   0%|          | 0/2518 [00:00<?, ? examples/s]

Map:   0%|          | 0/139 [00:00<?, ? examples/s]

Map:   0%|          | 0/149 [00:00<?, ? examples/s]

Epoch 1/3


Training:   0%|          | 0/630 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
Training: 100%|██████████| 630/630 [01:54<00:00,  5.49it/s, loss=0.0881]


Training Loss: 0.21691958414915477


Validation: 100%|██████████| 35/35 [00:02<00:00, 16.76it/s]


Validation Loss: 0.04251538830056753
Epoch 2/3


Training: 100%|██████████| 630/630 [01:54<00:00,  5.50it/s, loss=0.0204]


Training Loss: 0.037057896147459926


Validation: 100%|██████████| 35/35 [00:02<00:00, 17.04it/s]


Validation Loss: 0.028153632702950356
Epoch 3/3


Training: 100%|██████████| 630/630 [01:54<00:00,  5.51it/s, loss=0.0161]


Training Loss: 0.025638019573751888


Validation: 100%|██████████| 35/35 [00:02<00:00, 17.13it/s]


Validation Loss: 0.025255575138466418


Evaluating: 100%|██████████| 38/38 [00:36<00:00,  1.05it/s]


BLEU Score for Salesforce/codet5-small: 0.968911900812081
ROUGE Scores for Salesforce/codet5-small: {'rouge1': 0.991518107686995, 'rouge2': 0.9908478976767303, 'rougeL': 0.9913725254197496, 'rougeLsum': 0.9916625626615034}

Comparison Results:

PLBART:
BLEU: 0.9564276700745586
ROUGE: {'rouge1': 0.9042146268510616, 'rouge2': 0.9039743931529767, 'rougeL': 0.9070110398827219, 'rougeLsum': 0.9054893853227471}

CodeT5:
BLEU: 0.968911900812081
ROUGE: {'rouge1': 0.991518107686995, 'rouge2': 0.9908478976767303, 'rougeL': 0.9913725254197496, 'rougeLsum': 0.9916625626615034}
