secora

goal

Build a robust retrieval augmented code model setup and learn along the way.

side goals

some scalability hyper parameter search track training metrics

non goals

research and handcraft new model architecture components extreme scalability multimodal sota scores

smart

specific: model and code that runs inference on one gpu that i consider using day to day
measurable: do i still like use it after a week
achievable: yes
relevant: i think ppl. would like to use foss code search andor build ontop of such as a backbone
time-bound: be done in half a year

description

this experimental project investigates Bert trained on CodeSearchNet with the contrastive SimCSE setup.

the similar model Codebert was trained for 160 GPU hours. Difficulty in implementing distributed computation, we could practically only train for a shorter period.

our latest results can be seen in visualization/Screenshot from 2022-02-28 23-51-52.png

our previous graph, trained on just the python subset of CodeSearchNet is visualization/Screenshot from 2022-02-17 11-54-51.png

environments

pipenv
docker (amd gpu)

pipenv

# install pipenv:
pip install pipenv
# open the virtual environment:
pipenv shell
# install the needed packages in the virtual environment:
pipenv install

python -m secora.train configs/default.yml --progress --name distilroberta

run in container

sudo ./environments/container/run.sh

python -m secora.train configs/default.yml --progress --name distilroberta

show training curves

tensorboard --logdir ouput

type check:

mypy secora/train.py

run tests:

# run all test
pytest
# skip slow and cuda tests
pytest --fastonly --nocuda

build with meson

if build_container is true, it saves the docker container to builddir if you don't have enough ram/space, you need to change the builddir path

meson setup /tmp/builddir

# after some modifications you maybe also need
meson --reconfigure /tmp/builddir

# configure the build, see meson_options.txt
cd /tmp/builddir
meson configure -Dbuild_container=true
meson compile -v

design choices

why meson

i needed a general build tool bash scripts are errorprone setuptools don't seem to work well with multilanguage projects or build containers meson has good syntax and features

why flat project layout

simpler to use the repl for research, and flat hierarchies are generally better

validation

ml training has high compute and time costs validation is key to deliver progress validation is required for reproducable and comparable research and ml models i try to consider and explore each of four parts of a ml "deliverable"

dataset and domain insights
model training metrics and methodologies
ml coding strategies and patterns (throught this repo)
a runnable model for inference

validation happens through the ordered steps:

dependecy building/integration testing
mypy type checking
linting
config checking
running all tests/ including short training
packaging in docker
gradually scale the runs
monitor metrics/ hparam search and hparam search is robust against spurious failed runs
model evaluation/checklist

background

important papers:

challenge:

codesearchnet challenge

implementation:

other background information and literature:

existing software:

the previous course submission can be found under the tag submission_1

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
configs		configs
environments		environments
evaluation		evaluation
secora		secora
tests		tests
utils		utils
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
meson.build		meson.build
meson_options.txt		meson_options.txt
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

secora

goal

side goals

non goals

smart

description

environments

pipenv

run in container

show training curves

type check:

run tests:

build with meson

design choices

why meson

why flat project layout

validation

background

About

Releases

Packages

Contributors 2

Languages

License

flowpoint/secora

Folders and files

Latest commit

History

Repository files navigation

secora

goal

side goals

non goals

smart

description

environments

pipenv

run in container

show training curves

type check:

run tests:

build with meson

design choices

why meson

why flat project layout

validation

background

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages