Skip to content

chenwu0926/concode-codet5-codegen

Repository files navigation

CONCODE Code Generation With CodeT5

This project fine-tunes Salesforce/codet5-base on the CodeXGLUE CONCODE text-to-code dataset and reports Exact Match, BLEU, and learning curves.

This repository tracks the source code, tests, requirements, and final report. Raw CONCODE files, virtual environments, model weights, and training checkpoints are intentionally excluded because they are generated artifacts and exceed normal GitHub file-size limits.

Final Result

Split Exact Match BLEU
Dev 19.05 34.15
Test 21.65 37.18

The final coursework report is available at report/concode_codet5_report.md. Training logs are retained under report/:

  • training_log.md: readable training configuration, validation logs, and loss logs
  • training_log.csv: exported TrainerState.log_history
  • trainer_state.json: raw Hugging Face Trainer state
  • training_config.json, eval_dev_metrics.json, eval_test_metrics.json: reproducibility artifacts

Best Model Weights

The best checkpoint is included under models/codet5-base-concode-best/. The large model.safetensors file is tracked by Git LFS because it is about 851 MB and exceeds GitHub's normal 100 MB file limit.

After cloning:

git lfs install
git lfs pull

Then load the model with:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_dir = "models/codet5-base-concode-best"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

Environment

Conda:

conda create -n concode-codegen python=3.12 -y
conda activate concode-codegen
pip install -r requirements.txt

Venv:

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt

If you need a CUDA-specific PyTorch build, install PyTorch >=2.6.0,<2.8 from the official PyTorch index first, then run pip install -r requirements.txt. This version floor is required because recent transformers releases block loading legacy .bin checkpoints with older PyTorch versions for CVE-2025-32434 mitigation.

The pinned Hugging Face range is intentional: transformers currently requires huggingface-hub<1.0, so requirements.txt pins huggingface-hub>=0.34.0,<1.0.

If huggingface.co is slow or unavailable, use the Hugging Face mirror:

export HF_ENDPOINT=https://hf-mirror.com

Data

python scripts/download_data.py --output_dir data/concode

Expected files:

  • data/concode/train.jsonl
  • data/concode/dev.jsonl
  • data/concode/test.jsonl

Each JSONL record must contain nl and code.

Smoke Test

Use this before launching a full run:

python -m pytest -q
HF_ENDPOINT=https://hf-mirror.com python scripts/train.py \
  --model_name Salesforce/codet5-small \
  --train_file data/concode/train.jsonl \
  --validation_file data/concode/dev.jsonl \
  --output_dir outputs/smoke \
  --max_train_samples 8 \
  --max_eval_samples 4 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 1 \
  --per_device_eval_batch_size 1

Multi-GPU Training

The reported run used GPUs 0,2,5:

HF_ENDPOINT=https://hf-mirror.com CUDA_VISIBLE_DEVICES=0,2,5 torchrun --nproc_per_node=3 scripts/train.py \
  --model_name Salesforce/codet5-base \
  --train_file data/concode/train.jsonl \
  --validation_file data/concode/dev.jsonl \
  --output_dir outputs/codet5-base-concode \
  --num_train_epochs 5 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --gradient_accumulation_steps 2 \
  --learning_rate 5e-5 \
  --source_max_length 256 \
  --target_max_length 256 \
  --generation_max_length 256 \
  --generation_num_beams 5 \
  --logging_steps 50 \
  --eval_steps 1000 \
  --save_steps 1000 \
  --save_total_limit 3 \
  --bf16

If a GPU does not support stable fp16 for this run, replace --fp16 with --bf16 on supported hardware or remove mixed precision.

Evaluation

HF_ENDPOINT=https://hf-mirror.com python scripts/evaluate.py \
  --model_dir outputs/codet5-base-concode \
  --split dev \
  --output_dir outputs/codet5-base-concode

HF_ENDPOINT=https://hf-mirror.com python scripts/evaluate.py \
  --model_dir outputs/codet5-base-concode \
  --split test \
  --output_dir outputs/codet5-base-concode

Outputs:

  • outputs/codet5-base-concode/eval_dev_metrics.json
  • outputs/codet5-base-concode/eval_test_metrics.json
  • outputs/codet5-base-concode/predictions_dev.jsonl
  • outputs/codet5-base-concode/predictions_test.jsonl

Learning Curve And Report

python scripts/plot_learning_curve.py \
  --trainer_state outputs/codet5-base-concode/trainer_state.json \
  --output report/learning_curve.png

python scripts/make_report.py \
  --output report/concode_codet5_report.md \
  --dev_metrics outputs/codet5-base-concode/eval_dev_metrics.json \
  --test_metrics outputs/codet5-base-concode/eval_test_metrics.json \
  --learning_curve learning_curve.png

The report includes the model architecture diagram, metric table, and learning curve section.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages