This project fine-tunes Salesforce/codet5-base on the CodeXGLUE CONCODE text-to-code dataset and reports Exact Match, BLEU, and learning curves.
This repository tracks the source code, tests, requirements, and final report. Raw CONCODE files, virtual environments, model weights, and training checkpoints are intentionally excluded because they are generated artifacts and exceed normal GitHub file-size limits.
| Split | Exact Match | BLEU |
|---|---|---|
| Dev | 19.05 | 34.15 |
| Test | 21.65 | 37.18 |
The final coursework report is available at report/concode_codet5_report.md. Training logs are retained under report/:
training_log.md: readable training configuration, validation logs, and loss logstraining_log.csv: exportedTrainerState.log_historytrainer_state.json: raw Hugging Face Trainer statetraining_config.json,eval_dev_metrics.json,eval_test_metrics.json: reproducibility artifacts
The best checkpoint is included under models/codet5-base-concode-best/. The large model.safetensors file is tracked by Git LFS because it is about 851 MB and exceeds GitHub's normal 100 MB file limit.
After cloning:
git lfs install
git lfs pullThen load the model with:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_dir = "models/codet5-base-concode-best"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)Conda:
conda create -n concode-codegen python=3.12 -y
conda activate concode-codegen
pip install -r requirements.txtVenv:
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txtIf you need a CUDA-specific PyTorch build, install PyTorch >=2.6.0,<2.8 from the official PyTorch index first, then run pip install -r requirements.txt. This version floor is required because recent transformers releases block loading legacy .bin checkpoints with older PyTorch versions for CVE-2025-32434 mitigation.
The pinned Hugging Face range is intentional: transformers currently requires huggingface-hub<1.0, so requirements.txt pins huggingface-hub>=0.34.0,<1.0.
If huggingface.co is slow or unavailable, use the Hugging Face mirror:
export HF_ENDPOINT=https://hf-mirror.compython scripts/download_data.py --output_dir data/concodeExpected files:
data/concode/train.jsonldata/concode/dev.jsonldata/concode/test.jsonl
Each JSONL record must contain nl and code.
Use this before launching a full run:
python -m pytest -q
HF_ENDPOINT=https://hf-mirror.com python scripts/train.py \
--model_name Salesforce/codet5-small \
--train_file data/concode/train.jsonl \
--validation_file data/concode/dev.jsonl \
--output_dir outputs/smoke \
--max_train_samples 8 \
--max_eval_samples 4 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1The reported run used GPUs 0,2,5:
HF_ENDPOINT=https://hf-mirror.com CUDA_VISIBLE_DEVICES=0,2,5 torchrun --nproc_per_node=3 scripts/train.py \
--model_name Salesforce/codet5-base \
--train_file data/concode/train.jsonl \
--validation_file data/concode/dev.jsonl \
--output_dir outputs/codet5-base-concode \
--num_train_epochs 5 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 2 \
--learning_rate 5e-5 \
--source_max_length 256 \
--target_max_length 256 \
--generation_max_length 256 \
--generation_num_beams 5 \
--logging_steps 50 \
--eval_steps 1000 \
--save_steps 1000 \
--save_total_limit 3 \
--bf16If a GPU does not support stable fp16 for this run, replace --fp16 with --bf16 on supported hardware or remove mixed precision.
HF_ENDPOINT=https://hf-mirror.com python scripts/evaluate.py \
--model_dir outputs/codet5-base-concode \
--split dev \
--output_dir outputs/codet5-base-concode
HF_ENDPOINT=https://hf-mirror.com python scripts/evaluate.py \
--model_dir outputs/codet5-base-concode \
--split test \
--output_dir outputs/codet5-base-concodeOutputs:
outputs/codet5-base-concode/eval_dev_metrics.jsonoutputs/codet5-base-concode/eval_test_metrics.jsonoutputs/codet5-base-concode/predictions_dev.jsonloutputs/codet5-base-concode/predictions_test.jsonl
python scripts/plot_learning_curve.py \
--trainer_state outputs/codet5-base-concode/trainer_state.json \
--output report/learning_curve.png
python scripts/make_report.py \
--output report/concode_codet5_report.md \
--dev_metrics outputs/codet5-base-concode/eval_dev_metrics.json \
--test_metrics outputs/codet5-base-concode/eval_test_metrics.json \
--learning_curve learning_curve.pngThe report includes the model architecture diagram, metric table, and learning curve section.