Navigation Menu

Skip to content

Xzh0u/MuCoS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MuCoS

WIP

This repo provides the code for reproducing the experiments in MuCoS: Is a Single Model Enough? MuCoS: A Multi-Model EnsembleLearning for Semantic Code Search. MuCoS is a multi-model ensemble learning architecture for semantic code search.

Dependency

  • pip install torch==1.4.0
  • pip install transformers==2.5.0
  • pip install javalang nltk numpy h5py # data process
  • pip install filelock more_itertools tensorboardX scikit-learn matplotlib # run MuCoS

Code Search

Data Generation

An example CodeSearchNet data for the process is csn.pkl(which is available in data/). For the full data, you should parse the data in CodeSearchNet into similar format. Then we generate adversarial data using generate_adversarial_datasets.py, this file call a jar package in https://github.com/mdrafiqulrabin/tnpa-framework#1-variable-renaming to process the origin data and do 6 types of program transformation, then save the data to data/output/<transform_type>/<filename>. We use save_dataset.py to read all augmented data and add to origin CodeSearchNet data, then parse the api, tokens and method name, and save all these data to valid.adv_data.pkl or train.adv_data.pkl.

Step 1: generate data

cd src

python generate_adversarial_datasets.py

The output will be the same as in output/ of this repo.

Step 2: parse and save augmented and origin data

all augmented data into origin data, and parse origin and augmented data, save to a new pickle file

python save_dataset.py --data_split train
python save_dataset.py --data_split val
python save_dataset.py --data_split test

Step 3: run scripts to build different dataset for each baseline

python transform2codebert.py --data_split train
python transform2codebert.py --data_split val
python transform2codebert.py --data_split test

Train Individual Model

Fine-Tune

lang=java #fine-tuning a language-specific model for each programming language
pretrained_model=microsoft/codebert-base  #Roberta: roberta-base

python ensemble_train.py \
--model_type roberta \
--model_name_or_path microsoft/codebert-base \
--task_name codesearch \
--do_predict \
--output_dir ../data/ \
--train_file train_origin.txt \
--dev_file valid_origin.txt \
--data_dir ../data/ \
--max_seq_length 200 \
--pred_modelA_dir ./models_api/java/checkpoint-best/ \
--pred_modelB_dir ./models_var/java/checkpoint-best/ \
--pred_modelC_dir ./models_struct/java/checkpoint-best/ \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--pred_model_dir ./models_ensemble/$lang/checkpoint-best/

Inference

lang=java #programming language
idx=0 #test batch idx

python ensemble_train.py \
--model_type roberta \
--model_name_or_path microsoft/codebert-base \
--task_name codesearch \
--do_predict \
--output_dir ../data/codesearch/test/$lang \
--data_dir ../data/codesearch/test/$lang \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--pred_modelA_dir ./models_api/java/checkpoint-best/ \
--pred_modelB_dir ./models_var/java/checkpoint-best/ \
--pred_modelC_dir ./models_struct/java/checkpoint-best/ \
--num_train_epochs 8 \
--test_file batch_${idx}.txt \
--pred_model_dir ./models_ensemble/$lang/checkpoint-best/ \
--test_result_dir ./results_ensemble/$lang/${idx}_batch_result.txt 

Evaluation

python mrr.py

About

Code for MuCoS (paper accepted by CIKM2021)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages