Build a robust retrieval augmented code model setup and learn along the way.
some scalability hyper parameter search track training metrics
research and handcraft new model architecture components extreme scalability multimodal sota scores
- specific: model and code that runs inference on one gpu that i consider using day to day
- measurable: do i still like use it after a week
- achievable: yes
- relevant: i think ppl. would like to use foss code search andor build ontop of such as a backbone
- time-bound: be done in half a year
this experimental project investigates Bert trained on CodeSearchNet with the contrastive SimCSE setup.
the similar model Codebert was trained for 160 GPU hours. Difficulty in implementing distributed computation, we could practically only train for a shorter period.
our latest results can be seen in visualization/Screenshot from 2022-02-28 23-51-52.png
our previous graph, trained on just the python subset of CodeSearchNet is visualization/Screenshot from 2022-02-17 11-54-51.png
- pipenv
- docker (amd gpu)
# install pipenv:
pip install pipenv
# open the virtual environment:
pipenv shell
# install the needed packages in the virtual environment:
pipenv install
python -m secora.train configs/default.yml --progress --name distilroberta
sudo ./environments/container/run.sh
python -m secora.train configs/default.yml --progress --name distilroberta
tensorboard --logdir ouput
mypy secora/train.py
# run all test
pytest
# skip slow and cuda tests
pytest --fastonly --nocuda
if build_container is true, it saves the docker container to builddir if you don't have enough ram/space, you need to change the builddir path
meson setup /tmp/builddir
# after some modifications you maybe also need
meson --reconfigure /tmp/builddir
# configure the build, see meson_options.txt
cd /tmp/builddir
meson configure -Dbuild_container=true
meson compile -v
i needed a general build tool bash scripts are errorprone setuptools don't seem to work well with multilanguage projects or build containers meson has good syntax and features
simpler to use the repl for research, and flat hierarchies are generally better
ml training has high compute and time costs validation is key to deliver progress validation is required for reproducable and comparable research and ml models i try to consider and explore each of four parts of a ml "deliverable"
- dataset and domain insights
- model training metrics and methodologies
- ml coding strategies and patterns (throught this repo)
- a runnable model for inference
validation happens through the ordered steps:
- dependecy building/integration testing
- mypy type checking
- linting
- config checking
- running all tests/ including short training
- packaging in docker
- gradually scale the runs
- monitor metrics/ hparam search and hparam search is robust against spurious failed runs
- model evaluation/checklist
important papers:
challenge:
implementation:
other background information and literature:
- Retrieval Augmented Coding
- benchmark
- robustness analysis
- collection of ai4code papers
- evaluation of large models for code
- unsupervised code retrieval
- comprehensive literature review of the field
- codedotAI, open organization
existing software:
the previous course submission can be found under the tag submission_1