GitHub - abx04/CodeSSM: Code for EMNLP 2025 paper CodeSSM: Towards State Space Models for Code Understanding

We release the code used for pre-training the CodeSSM model. For finetuning, the code released with the respective benchmarks can be used.

The code and associated trained weights are intended to be used only for research purposes with the aim to reproduce the results of the work or further extend/improve the work.

The CodeSSM model is based on the BiGS. The code in modelsdirectory is taken from the BiGS repository with changes made by us for different model architectures, such as removing positional embeddings.

For pre-training run one of the multi_train_*.py file.

Different pre-training files

multi_train_mlm.py : Pre-train CodeSSM on CSN dataset
multi_train_sc_code.py : Pre-train CodeSSM on subset of StarCoder dataset (see creating StarCoder subset)
multi_train_bert.py : Pre-train BertCoder on CSN dataset

These files expect the following arguments:

config : Yaml file with training configuration. The config files are in the configs directory. Add the appropriate save_path and filename. In case of resuming training from a saved checkpoint, also add the load_path and set resume to true.
filename: the name of the saved weight file.
weight (only for multi_train_mlm.py) : The weight file for initialization. Download the PyTorch weight file.

Creating StarCoder subset

Use the create_sc_code_subset_dataset.ipynb to generate the subset dataset. Using the StarCoder dataset requires a HuggingFace account and accepting the "Terms and Conditions" of the authors of that work.

The StarCoder dataset is huge and requires a large amount of storage. The notebook only downloads dataset for 8 languages but that also requires large storage.

Different pre-training context lengths

By default the models are trained with a context length of 2048.
For 256 context size, set max_position_embeddings to 256 in the config file.

Finetuning

For finetuning, code released from the following benchmarks can be used or easily adapted:

NLCodeSearch, Type inference, Clone Detection, Devign: CodeXGlue.
Complexity Prediction: CodeSage.
DiverseVul: Defetc detection code of CodeSage adapted to dDiverseVul by changing the dataset creation code.
SQA: NLCodeSearch code from CodeXGlue can be easily adapted to be used with SQA.

Pretrained weights are available here: https://figshare.com/s/14238287e9078f92cd50.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
models		models
LICENSE		LICENSE
README.md		README.md
collater.py		collater.py
collater256.py		collater256.py
create_sc_code_subset_dataset.ipynb		create_sc_code_subset_dataset.ipynb
eval.py		eval.py
get_model.py		get_model.py
mlmc.py		mlmc.py
mlmc_starcoder.py		mlmc_starcoder.py
multi_train_bert.py		multi_train_bert.py
multi_train_mlm.py		multi_train_mlm.py
multi_train_mlm_sc_code.py		multi_train_mlm_sc_code.py
optimizer.py		optimizer.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Different pre-training files

Creating StarCoder subset

Different pre-training context lengths

Finetuning

About

Uh oh!

Releases

Packages

Languages

License

abx04/CodeSSM

Folders and files

Latest commit

History

Repository files navigation

Different pre-training files

Creating StarCoder subset

Different pre-training context lengths

Finetuning

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages