Learning Program

Semantics with Code Representations: An Empirical Study

This repository contains the code and data in our paper, "Learning Program Semantics with Code Representations: An Empirical Study" published in SANER'2022. It includes POJ104Clone and POJ dataset.

Clone Detection - Pairwise Clone Detection
Code Classification - Classify Code in their respective label
Vulnerability Detection - See Devign

Dataset

I had uploaded the dataset to google drive. You can download it here

Train

You can train the model with the sample command:

python3 -u /home/jingkai/projects/cit/train.py --config_path ./ymls/clone_detection/tfidf/naivebayes.yml

Please look into ./ymls/<tasks>/*.yml for setting the configurations.

Citation

If you find this repository useful in your research, please consider citing it:

@inproceedings{siow2022learning,
  title={Learning Program Semantics with Code Representations: An Empirical Study},
  author={Jing Kai, Siow and Shangqing, Liu and Xiaofei, Xie and Guozhu, Meng and Yang, Liu},
  booktitle={Proceedings of the 29th IEEE International Conference onSoftware Analysis, Evolution and Reengineering},
  year={2022}
}

dataset

test dataset : json list, len == 5000

[
  {
    'item_1': {'function_id': '1','jsgraph': {'graph': [[1,2,0],...],'function': ''}},
    'item_2': {},
    'target': 1/0
    
  }
]

POJ104

[
  {
    "function_id": "str",
    "target": "int",
    "jsgraph": {
      "graph": [
        [
          1,
          2,
          0
        ]
      ],
      "node_features":{ "0": ["Function","","0","False"]
      }
    },
    "jsgraph_file_path":"str",
    "function": "str",
    "graph_size":"int",
    "cfile_path":"str"
  }
]

load dataset steps（以Tree-LSTM为例）

DatasetFactory().get_dataset(config)
dataloader = POJ104
BaseDataset
- 定义 self.dataformatter = FormatterFactory().get_formatter(self.config)
  - TreeLSTMFormatter
  - 在BaseFormatter中，定义collate_fn:collate_graph_for_classification
- 从gzip导入json数据 train/val/test load from json
- self.format_data()
- self._format(train/val/test)
- datapoints.append(self.dataformatter.format(item, self.get_vocabs()))

trained_model

code_classification

treelstm

20220727-163602: textual

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bases		bases
configs		configs
dataset		dataset
evaluation		evaluation
factory		factory
pymodels		pymodels
tasks		tasks
tokenizer		tokenizer
trainer		trainer
utils		utils
ymls		ymls
z_ASN		z_ASN
z_get_statement		z_get_statement
z_privacy_attack		z_privacy_attack
.gitignore		.gitignore
README.md		README.md
code_classification_embed.py		code_classification_embed.py
code_classification_embed_with_gen_ast.py		code_classification_embed_with_gen_ast.py
embed_func_or_snippet.py		embed_func_or_snippet.py
requirements.txt		requirements.txt
samples_generator.py		samples_generator.py
snippet_creator.py		snippet_creator.py
subcode_infer_pipeline.py		subcode_infer_pipeline.py
test.py		test.py
train.py		train.py

flyboss/learning-program-representation

Folders and files

Latest commit

History

Repository files navigation

Learning Program

Dataset

Train

Citation

dataset

POJ104

load dataset steps（以Tree-LSTM为例）

trained_model

code_classification

treelstm

About

Resources

Stars

Watchers

Forks

Languages