<a href="https://colab.research.google.com/github/chamtgm/SM-Model-Comparison/blob/main/Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here's a breakdown of your process:

1. Loading Pre-trained Models: You are indeed starting with models that have been pre-trained on large datasets (often a mix of text and code, or primarily code for code-specific models). This pre-training allows the models to learn general language patterns, syntax, and possibly some coding structures.

2. Fine-tuning on Your Data: This is the crucial step. You are not simply evaluating the pre-trained models on your data out-of-the-box to compare them against their original benchmarks. Instead, you are using your C Train Data.jsonl dataset to fine-tune these pre-trained models. Fine-tuning adapts the pre-trained model's knowledge to your specific task (generating C code from natural language descriptions in your dataset) and your specific data distribution.

3. Evaluating the Fine-tuned Models: After fine-tuning each model on your training data, you are then evaluating its performance on your C Test Data.jsonl. The evaluation metrics you calculate (loss from the trainer, BLEU, CodeBLEU) are based on how well the fine-tuned model performs on your specific test set.

4. Comparing Fine-tuned Model Performance: The comparison you are doing is between the performance of different fine-tuned models on your test data, not comparing the fine-tuned models to the original pre-training benchmarks. The original benchmarks would be based on the datasets the models were initially trained on, which are likely different from your C Train Data.jsonl and C Test Data.jsonl.

Analogy:

Imagine a student who has learned general math principles (pre-training). You then give them specific practice problems in a particular area of math (your training data) and evaluate them on a test covering those specific problems (your test data). You are comparing how well different students (pre-trained models) learn and perform on your specific practice problems and test, not comparing their performance on your test to their scores on a standardized math exam they took before they started your practice problems (original pre-training benchmark).

In summary:

You are taking pre-trained models and adapting them to your specific code generation task by fine-tuning them on your data. Your comparison is based on the performance of these fine-tuned models on your test set. This is a standard and effective approach for leveraging the power of pre-trained models on custom tasks.

1. Using Input and Output of Train Data: During the trainer.train() phase, you are feeding the model pairs of text (the natural language input) and code (the desired code output) from your C Train Data.jsonl. The model receives the text as input and attempts to generate the corresponding code.

2. Comparing Model Output with Desired Output (Training): Internally, during training, the model's generated code is compared to the actual code in your training data. This comparison is typically done using a loss function (e.g., cross-entropy loss) that measures how different the model's output probability distribution for the next token is from the probability distribution of the true next token in the reference code.

3. Pre-trained Model Changes its Parameters: Based on the calculated loss, the model updates its internal parameters (weights and biases) through a process called backpropagation and optimization (controlled by the learning rate and optimizer in your Seq2SeqTrainingArguments). This is the learning or "fine-tuning" process. The model learns to adjust its parameters to minimize the loss, meaning it gets better at generating code that is closer to the desired output in your training data.

4. Using Input and Output of Test Data (Evaluation): During the trainer.evaluate() phase (and also when you manually generate predictions for BLEU/CodeBLEU), you are feeding the model the text inputs from your C Test Data.jsonl. The model generates code outputs based on what it learned during training.

5. Comparing Model Output with Desired Output (Evaluation): For evaluation, you compare the model's generated code outputs from the test data to the actual code in the test data using evaluation metrics (loss, BLEU, CodeBLEU, etc.). This evaluation measures how well the fine-tuned model generalizes to unseen data. The model's parameters are not changed during evaluation; it's a measurement of performance.

Simplified Flow:

Training: Input Text -> Pre-trained Model -> Generated Code (Predicted) -> Compare with Reference Code (Actual) -> Calculate Loss -> Adjust Model Parameters. Repeat over epochs.
Evaluation: Input Text -> Fine-tuned Model -> Generated Code (Predicted) -> Compare with Reference Code (Actual) -> Calculate Metrics. Model parameters are not adjusted.

Here's what the code in the test dataset does:

1. Provides the Correct Answer: For each natural language text description in your test dataset, the corresponding code is the correct and desired code output that the model should ideally generate.

2. Enables Performance Measurement: After your fine-tuned model generates code for the text inputs in the test set, the code in the test dataset is used to compare against the model's generated output. Evaluation metrics like BLEU, CodeBLEU, and potentially others (like exact match) quantify how similar the generated code is to the reference code.

Crucially, during both training and evaluation:


*   Training: The code in the training dataset is used as the target for the model to learn from. The model adjusts its parameters to minimize the difference between its generated output and this training code.
*   Evaluation: The code in the test dataset is only used to measure the performance of the trained model. The model's parameters are not updated based on the comparison with the test code. Using the test data for training would lead to data leakage and an overestimation of the model's true performance on unseen data.

Think of it like a student taking a test:

Training Data: The student studies examples with both problems (text) and solutions (code). They learn how to solve the problems by looking at the solutions.
Test Data: The student is given new problems (text) without the solutions (code). They solve the problems based on what they learned. Their answers are then compared to the correct solutions (code in the test data) to see how well they understood the material. The solutions in the test data are not used to help the student learn during the test; they are only for grading.

In [1]:
# Remove all previous installation attempts for torch and transformers
!pip uninstall -y torch torchvision transformers
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install transformers
!pip install langchain
!pip install langchain-huggingface
!pip install datasets
!pip install --upgrade transformers
!pip install nltk evaluate

Found existing installation: torch 2.6.0+cu124
Uninstalling torch-2.6.0+cu124:
  Successfully uninstalled torch-2.6.0+cu124
Found existing installation: torchvision 0.21.0+cu124
Uninstalling torchvision-0.21.0+cu124:
  Successfully uninstalled torchvision-0.21.0+cu124
Found existing installation: transformers 4.52.4
Uninstalling transformers-4.52.4:
  Successfully uninstalled transformers-4.52.4
Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torch
  Downloading https://download.pytorch.org/whl/cu118/torch-2.7.1%2Bcu118-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (28 kB)
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cu118/torchvision-0.22.1%2Bcu118-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Collecting sympy>=1.13.3 (from torch)
  Downloading https://download.pytorch.org/whl/sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.8.89 (from torch)
  Downloading https://download.pytorch.org/whl/cu1

In [2]:
import nltk
from nltk.translate.bleu_score import sentence_bleu
from evaluate import load #For CodeBLEU

nltk.download('punkt') #Word tokenization
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [3]:
import pandas as pd
from datasets import DatasetDict, Dataset

train_df = pd.read_json('/content/C Train Data.jsonl', lines = True)
test_df = pd.read_json('/content/C Test Data.jsonl', lines = True)

dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "test": Dataset.from_pandas(test_df)
})

In [4]:
from transformers import AutoTokenizer
results = {}
all_metrics = {} #Store all evaluation metrics including custom ones

# Models used

# Model 1

In [5]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainer, Seq2SeqTrainingArguments
from nltk.translate.bleu_score import sentence_bleu

model_name = "Salesforce/codet5-small"
print(f"Training and Evaluating Model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def tokenize_function(example):
    return tokenizer(
        example["text"],
        text_target=example["code"],
        padding="max_length",
        truncation=True,
        max_length=512
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    save_total_limit=2,
    predict_with_generate=True
)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer
)
trainer.train()
#Perform evaluation using trainer.evaluate() for basic metrics
evaluation_results = trainer.evaluate()
print(f"Trainer Evaluation Results: {evaluation_results}")

Training and Evaluating Model: Salesforce/codet5-small


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/703k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/294k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/12.5k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Map:   0%|          | 0/463 [00:00<?, ? examples/s]

Map:   0%|          | 0/51 [00:00<?, ? examples/s]

  trainer = Seq2SeqTrainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjcham240[0m ([33mjcham240-university-of-nottingham-malaysia[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,No log,0.632353
2,No log,0.537051
3,No log,0.515845




Trainer Evaluation Results: {'eval_loss': 0.5158452987670898, 'eval_runtime': 75.5462, 'eval_samples_per_second': 0.675, 'eval_steps_per_second': 0.093, 'epoch': 3.0}


In [8]:
#Generate predictions for custom metrics
predictions = trainer.predict(tokenized_datasets["test"])
predicted_token_ids = predictions.predictions
label_token_ids = predictions.label_ids
# Decode the token IDs back to text
predicted_code = tokenizer.batch_decode(predicted_token_ids, skip_special_tokens=True)
reference_code = tokenizer.batch_decode(label_token_ids, skip_special_tokens=True)

# Calculate BLEU Score
# BLEU requires tokenized sentences
reference_code_tokenized = [nltk.word_tokenize(code) for code in reference_code]
predicted_code_tokenized = [nltk.word_tokenize(code) for code in predicted_code]
# BLEU is typically calculated for each sentence and averaged, or as a corpus BLEU
# Let's calculate a corpus BLEU for simplicity here
# Note: Corpus BLEU is often preferred for model evaluation
# For corpus_bleu, references should be a list of lists of tokens
# where each inner list is a single reference sentence's tokens.
# Since we have one reference per prediction, it's a list of lists with one list inside
corpus_reference = [[tokens] for tokens in reference_code_tokenized]
corpus_candidate = predicted_code_tokenized
try:
    bleu_score = nltk.translate.bleu_score.corpus_bleu(corpus_reference, corpus_candidate)
    print(f"Corpus BLEU Score: {bleu_score}")
    evaluation_results['corpus_bleu'] = bleu_score
except ZeroDivisionError:
    print("Could not calculate BLEU score (likely due to zero n-grams).")
    evaluation_results['corpus_bleu'] = 0.0
# Calculate CodeBLEU Score
# The 'evaluate' library has a metric for CodeBLEU
# You'll need to make sure your data format is compatible
try:
    # List available metrics to confirm the name
    from evaluate.list import list_metrics
    available_metrics = list_metrics()
    print(f"Available metrics: {available_metrics}")

    if 'codebleu' in available_metrics:
        codebleu = load("codebleu")
        # The CodeBLEU metric expects references and predictions as lists of strings
        codebleu_results = codebleu.compute(references=reference_code, predictions=predicted_code)
        print(f"CodeBLEU Results: {codebleu_results}")
        evaluation_results.update(codebleu_results) # Add CodeBLEU components to results
    else:
        print("CodeBLEU metric not found in available metrics.")
        # Add placeholder if metric is not found
        evaluation_results['codebleu_weighted_ngram_match'] = 0.0
        evaluation_results['codebleu_syntax_match'] = 0.0
        evaluation_results['codebleu_dataflow_match'] = 0.0
        evaluation_results['codebleu'] = 0.0

except Exception as e:
    print(f"Could not calculate CodeBLEU score: {e}")
    # Add a placeholder if calculation fails for other reasons
    evaluation_results['codebleu_weighted_ngram_match'] = 0.0
    evaluation_results['codebleu_syntax_match'] = 0.0
    evaluation_results['codebleu_dataflow_match'] = 0.0
    evaluation_results['codebleu'] = 0.0


# Store all metrics for comparison
all_metrics[model_name] = evaluation_results



Corpus BLEU Score: 5.635698075798914e-22
Could not calculate CodeBLEU score: cannot import name 'list_metrics' from 'evaluate' (/usr/local/lib/python3.11/dist-packages/evaluate/__init__.py)

Comparison Results (including custom metrics)
Salesforce/codet5-small: {'eval_loss': 0.5158452987670898, 'eval_runtime': 75.5462, 'eval_samples_per_second': 0.675, 'eval_steps_per_second': 0.093, 'epoch': 3.0, 'corpus_bleu': 5.635698075798914e-22, 'codebleu': 0.0, 'codebleu_weighted_ngram_match': 0.0, 'codebleu_syntax_match': 0.0, 'codebleu_dataflow_match': 0.0}
deepseek-ai/deepseek-coder-1.3b-base: {'eval_loss': 0.5158452987670898, 'eval_runtime': 75.5462, 'eval_samples_per_second': 0.675, 'eval_steps_per_second': 0.093, 'epoch': 3.0, 'corpus_bleu': 5.635698075798914e-22, 'codebleu': 0.0, 'codebleu_weighted_ngram_match': 0.0, 'codebleu_syntax_match': 0.0, 'codebleu_dataflow_match': 0.0}


# Model 2

In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from datasets import DatasetDict, Dataset

model_name = "deepseek-ai/deepseek-coder-1.3b-base"
print(f"Training and Evaluating Model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def tokenize_function(example):
    return tokenizer(
        example["text"],
        text_target=example["code"],
        padding="max_length",
        truncation=True,
        max_length=512
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    save_total_limit=2,
    predict_with_generate=True
)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer
)
trainer.train()
#Perform evaluation using trainer.evaluate() for basic metrics
evaluation_results = trainer.evaluate()
print(f"Trainer Evaluation Results: {evaluation_results}")

Training and Evaluating Model: deepseek-ai/deepseek-coder-1.3b-base


tokenizer_config.json:   0%|          | 0.00/793 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

ValueError: Unrecognized configuration class <class 'transformers.models.llama.configuration_llama.LlamaConfig'> for this kind of AutoModel: AutoModelForSeq2SeqLM.
Model type should be one of BartConfig, BigBirdPegasusConfig, BlenderbotConfig, BlenderbotSmallConfig, EncoderDecoderConfig, FSMTConfig, GPTSanJapaneseConfig, GraniteSpeechConfig, LEDConfig, LongT5Config, M2M100Config, MarianConfig, MBartConfig, MT5Config, MvpConfig, NllbMoeConfig, PegasusConfig, PegasusXConfig, PLBartConfig, ProphetNetConfig, Qwen2AudioConfig, SeamlessM4TConfig, SeamlessM4Tv2Config, SwitchTransformersConfig, T5Config, UMT5Config, XLMProphetNetConfig.

# Model 3

In [None]:
model_name = "Salesforce/codegen-350M-mono"
print(f"Training and Evaluating Model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def tokenize_function(example):
    return tokenizer(
        example["text"],
        text_target=example["code"],
        padding="max_length",
        truncation=True,
        max_length=512
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    save_total_limit=2,
    predict_with_generate=True
)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer
)
trainer.train()
#Perform evaluation using trainer.evaluate() for basic metrics
evaluation_results = trainer.evaluate()
print(f"Trainer Evaluation Results: {evaluation_results}")

# Model 4

In [None]:
model_name = "mesolitica/mallam-1.1B-4096"
print(f"Training and Evaluating Model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def tokenize_function(example):
    return tokenizer(
        example["text"],
        text_target=example["code"],
        padding="max_length",
        truncation=True,
        max_length=512
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    save_total_limit=2,
    predict_with_generate=True
)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer
)
trainer.train()
#Perform evaluation using trainer.evaluate() for basic metrics
evaluation_results = trainer.evaluate()
print(f"Trainer Evaluation Results: {evaluation_results}")

# Calculation for all models

In [None]:
for model_name in model_names:
    #Generate predictions for custom metrics
  predictions = trainer.predict(tokenized_datasets["test"])
  predicted_token_ids = predictions.predictions
  label_token_ids = predictions.label_ids

  # Decode the token IDs back to text
  predicted_code = tokenizer.batch_decode(predicted_token_ids, skip_special_tokens=True)
  reference_code = tokenizer.batch_decode(label_token_ids, skip_special_tokens=True)

# Calculate BLEU Score
  # BLEU requires tokenized sentences
  reference_code_tokenized = [nltk.word_tokenize(code) for code in reference_code]
  predicted_code_tokenized = [nltk.word_tokenize(code) for code in predicted_code]

  # BLEU is typically calculated for each sentence and averaged, or as a corpus BLEU
  # Let's calculate a corpus BLEU for simplicity here
  # Note: Corpus BLEU is often preferred for model evaluation
  # For corpus_bleu, references should be a list of lists of tokens
  # where each inner list is a single reference sentence's tokens.
  # Since we have one reference per prediction, it's a list of lists with one list inside
  corpus_reference = [[tokens] for tokens in reference_code_tokenized]
  corpus_candidate = predicted_code_tokenized

  try:
      bleu_score = nltk.translate.bleu_score.corpus_bleu(corpus_reference, corpus_candidate)
      print(f"Corpus BLEU Score: {bleu_score}")
      evaluation_results['corpus_bleu'] = bleu_score
  except ZeroDivisionError:
      print("Could not calculate BLEU score (likely due to zero n-grams).")
      evaluation_results['corpus_bleu'] = 0.0


  # Calculate CodeBLEU Score
  # The 'evaluate' library has a metric for CodeBLEU
  # You'll need to make sure your data format is compatible
  try:
      codebleu = load("codebleu")
      # The CodeBLEU metric expects references and predictions as lists of strings
      codebleu_results = codebleu.compute(references=reference_code, predictions=predicted_code)
      print(f"CodeBLEU Results: {codebleu_results}")
      evaluation_results.update(codebleu_results) # Add CodeBLEU components to results
  except Exception as e:
      print(f"Could not calculate CodeBLEU score: {e}")
      evaluation_results['codebleu'] = 0.0 # Add a placeholder if calculation fails


  # Store all metrics for comparison
  all_metrics[model_name] = evaluation_results


In [None]:
print("\nComparison Results (including custom metrics)")
for model, metrics in all_metrics.items():
    print(f"{model}: {metrics}")