We release the code used for pre-training the CodeSSM model. For finetuning, the code released with the respective benchmarks can be used.
The code and associated trained weights are intended to be used only for research purposes with the aim to reproduce the results of the work or further extend/improve the work.
The CodeSSM model is based on the BiGS. The code in modelsdirectory is taken from the BiGS repository with changes made by us for different model architectures, such as removing positional embeddings.
For pre-training run one of the multi_train_*.py file.
- multi_train_mlm.py : Pre-train CodeSSM on CSN dataset
- multi_train_sc_code.py : Pre-train CodeSSM on subset of StarCoder dataset (see creating StarCoder subset)
- multi_train_bert.py : Pre-train BertCoder on CSN dataset
These files expect the following arguments:
- config : Yaml file with training configuration. The config files are in the
configsdirectory. Add the appropriatesave_pathandfilename. In case of resuming training from a saved checkpoint, also add theload_pathand setresumetotrue. - filename: the name of the saved weight file.
- weight (only for multi_train_mlm.py) : The weight file for initialization. Download the PyTorch weight file.
Use the create_sc_code_subset_dataset.ipynb to generate the subset dataset. Using the StarCoder dataset requires a HuggingFace account and accepting the "Terms and Conditions" of the authors of that work.
The StarCoder dataset is huge and requires a large amount of storage. The notebook only downloads dataset for 8 languages but that also requires large storage.
- By default the models are trained with a context length of 2048.
- For 256 context size, set max_position_embeddings to 256 in the config file.
For finetuning, code released from the following benchmarks can be used or easily adapted:
- NLCodeSearch, Type inference, Clone Detection, Devign: CodeXGlue.
- Complexity Prediction: CodeSage.
- DiverseVul: Defetc detection code of CodeSage adapted to dDiverseVul by changing the dataset creation code.
- SQA: NLCodeSearch code from CodeXGlue can be easily adapted to be used with SQA.
Pretrained weights are available here: https://figshare.com/s/14238287e9078f92cd50.