Skip to content

abx04/CodeSSM

Repository files navigation

We release the code used for pre-training the CodeSSM model. For finetuning, the code released with the respective benchmarks can be used.

The code and associated trained weights are intended to be used only for research purposes with the aim to reproduce the results of the work or further extend/improve the work.

The CodeSSM model is based on the BiGS. The code in modelsdirectory is taken from the BiGS repository with changes made by us for different model architectures, such as removing positional embeddings.

For pre-training run one of the multi_train_*.py file.

Different pre-training files

  • multi_train_mlm.py : Pre-train CodeSSM on CSN dataset
  • multi_train_sc_code.py : Pre-train CodeSSM on subset of StarCoder dataset (see creating StarCoder subset)
  • multi_train_bert.py : Pre-train BertCoder on CSN dataset

These files expect the following arguments:

  • config : Yaml file with training configuration. The config files are in the configs directory. Add the appropriate save_path and filename. In case of resuming training from a saved checkpoint, also add the load_path and set resume to true.
  • filename: the name of the saved weight file.
  • weight (only for multi_train_mlm.py) : The weight file for initialization. Download the PyTorch weight file.

Creating StarCoder subset

Use the create_sc_code_subset_dataset.ipynb to generate the subset dataset. Using the StarCoder dataset requires a HuggingFace account and accepting the "Terms and Conditions" of the authors of that work.

The StarCoder dataset is huge and requires a large amount of storage. The notebook only downloads dataset for 8 languages but that also requires large storage.

Different pre-training context lengths

  • By default the models are trained with a context length of 2048.
  • For 256 context size, set max_position_embeddings to 256 in the config file.

Finetuning

For finetuning, code released from the following benchmarks can be used or easily adapted:

  • NLCodeSearch, Type inference, Clone Detection, Devign: CodeXGlue.
  • Complexity Prediction: CodeSage.
  • DiverseVul: Defetc detection code of CodeSage adapted to dDiverseVul by changing the dataset creation code.
  • SQA: NLCodeSearch code from CodeXGlue can be easily adapted to be used with SQA.

Pretrained weights are available here: https://figshare.com/s/14238287e9078f92cd50.

About

Code for EMNLP 2025 paper CodeSSM: Towards State Space Models for Code Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published