The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Table of content

The Vault Dataset
The Vault toolkit
- Getting Started
- Processing Pipeline
Citing The Vault
Contact Us
License

The Vault Dataset

Data Summary

The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.

We provide The Vault which contains code snippets from 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP. This dataset provides multiple code-snippet levels, metadata, and 11 docstring styles for enhanced usability and versatility.

Data Structure

Data Instances

Every sample of The Vault are stored in form of a json object and compressed into a large json line file. Each sample corresponds to one raw code file. The content of the file are used to extracting function, class and inline set, other information (repository name, licenses, etc) are collected from source dataset (The Stack).

Data Fields

See detail of data fields and example for each type of set Here

Data Near-Deduplication

We applied deduplication for internal and external.

Internal: Apply exact deduplicate in full dataset.
External: Apply near deduplicate with the test sets of CodeSearchNet, HumanEval and APPS.

*Near-deduplication use MinHash LSH to clustering sample based on their code. Those sample are close to each other (or even modified version) can be detected.

Splitting train/eval/test

We have divided the complete dataset into three distinct sets: a training set, an evaluation set, and a test set, to maintain consistency throughout the experiment.

To avoid data leakage, we allocated all samples from the same repository to a singular set. We then subdivided these sets using code tokens as splitting factors. As a result, these subsets mirror the distribution of the full dataset.

Splitting trainset into multiple subsets

Given the substantial size of our dataset, we found it beneficial to further divide the training set into two smaller subsets for ease of experimentation:

A small training set, which contains 5% of the total data.
A medium training set, comprising 20% of the full dataset.
(And) the full training set.

	Small set	Medium set	Train set	Validation	Test	Total
Python	370,657	1,952,110	7,772,647	30,992	21,652	7,825,291
Java	351,213	1,612,366	6,629,193	22,677	15,552	6,667,422
JavaScript	82,931	404,729	1,640,416	22,044	21,108	1,683,568
PHP	236,638	1,155,476	4,656,371	21,375	19,010	4,696,756
C	105,978	381,207	1,639,319	27,525	19,122	1,685,966
C#	141,090	783,166	3,305,891	24,787	19,638	3,350,316
C++	87,420	410,907	1,671,268	20,011	18,169	1,709,448
Go	267,535	1,319,547	5,109,020	19,102	25,314	5,153,436
Ruby	23,921	112,574	424,339	17,338	19,908	461,585
Rust	35,367	224,015	825,130	16,716	23,141	864,987
TOTAL	1,702,750	8,356,097	33,673,594	222,567	202,614	34,098,775

Download dataset

Load dataset on Huggingface

We publish The Vault (function/inline/class) on Huggingface dataset hub.

from datasets import load_dataset

# Load full function/class/inline level dataset
dataset = load_dataset("Fsoft-AIC/the-vault-{function/class/inline}")

# Load function level train/validation/test set
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"])

# Load "small" (or "medium", "full") function level training set
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train/small"])

# specific language (e.g. Python) 
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"], languages=['Python'])

# dataset streaming
data = load_dataset("Fsoft-AIC/the-vault-function", split_set= ["train"], streaming= True)
for sample in iter(data['train']): 
    print(sample)

Download via link

Or download the Vault directly from Azure blob storage via download link. Here are the link pattern for specific download option:

https://ai4code.blob.core.windows.net/thevault/v1/{function,class,inline}/{python,java,javascript,go,cpp,c_sharp,c,rust,ruby,php}.zip

For example, download class of Python:

https://ai4code.blob.core.windows.net/thevault/v1/class/python.zip

Or download using the script download_dataset.py:

python download_dataset.py "<path/to/destination>" --set "function" # or class/inline

Note: blob link currently only contains function-level version

The Vault Toolkit

Getting Started

To setup environment and install dependencies via pip:

pip -r install requirements.txt

Install codetext parser to extract code using tree-sitter, via pip:

pip install codetext

Or manually build codetext form source, see more at Codetext repo

git clone https://github.com/FSoft-AI4Code/CodeText-parser.git
cd CodeText-parser
pip install -e .

Processing Pipeline

Our toolkit takes raw source code files as input and streamlines the extraction and generation of code-text pairs, as illustrated in Figure above. There are 3 seperate process:

Extracting Raw code: By using Tree-sitter extractor to identify function/class/line node inside raw file and obtain their metadata
Extracting Docstring Style: We implement a docstring style parser to capture all the informative section or param's description inside a docstring
Filtering Docstring: From the docstring gathered from previous process, we use it as main factor to filter quality sample (remove empty docstring, uninformative docstring, etc). See more about cleaning methodologies at our paper

We seperated the source code into multiple steps (coresponde for each process). Or you can run the full pipeline follow this tutorial.

Extracting Raw code

From raw code, you can extract function, class using process_raw_node(). An example structure of a raw code snippet show in the figure below. Inside a node are identifier, parameter or argument list, code and comment (docstring).

from codetext.utils import parse_code
from codetext.parser import PythonParser

code_snippet = """
def sum2num(a: int, b: int):
  '''
  :param a: first number
  :param b: second number
  '''
  return a + b # result
"""
code_tree = parse_code(code_snippet, 'cpp')

res = process_raw_node(
    tree=code_tree, 
    blob=code_snippet,
    language_parser=PythonParser(),
    metadata={'repo': 'test'}  # Optional
)

# or extrating line

res = get_line_definitions(
    tree=code_tree, 
    blob=code_snippet,
    language_parser=PythonParser(),
    source_metadata={'repo': 'test'}  # Optional
)

For extracting raw inline comment, the function get_line_definitions() can help to extract line comment and return the parent code block, previous and next context (i.e. code block).

from codetext.utils import parse_code
from codetext.parser import PythonParser

code_snippet = """
def sum2num(a: int, b: int):
  '''
  :param a: first number
  :param b: second number
  '''
  return a + b
"""
code_tree = parse_code(code_snippet, 'cpp')

res = process_raw_node(
    tree=code_tree, 
    blob=code_snippet,
    language_parser=PythonParser(),
    metadata={'repo': 'test'}  # Optional
)

Filtering Extracted code snippet

With the result function or class node and their metadata extracted from previous process, execute get_node_definitions() to filtering sample based on their docstring. Heuristic rules will remove sample that not meet the minimum requirement (We detailedly describe it inside our ).

Lastly, to extracting docstring style we implement a function call extract_docstring() that take docstring (in form of string) as input and result metadata of the docstring style as demonstrate in the figure above (e.g. param's docstring, type, return's docstring, etc.)

Processing Custom Dataset

We create a .yaml to define which field to load when processing data. Usually, only source code are needed, but in case there are other additional information about the raw code might be added using the .yaml.

For example, CodeSearchNet stores their data in structure:

# CodeSearchNet jsonline format 
# https://github.com/github/CodeSearchNet#data-details

code: original_string # raw code
repo: repo # additional infor
path: path # additional infor
language: language # additional infor

Inside processing.py we merged extracting raw code, filtering docstring and extracting docstring style function into 1 simple pipeline for quickly extracting dataset from raw source data. You can use processing.py by:

python -m codetext.processing 
<DATASET_PATH>
--save_path <SAVE_PATH>  # path to save dir

--load_from_file  # load from file instead load from dataset cache
--language Python  # or Java, JavaScript, ...
--data_format './data/format/codeparot-format.yaml'  # load raw data format

--n_split 20  # split original dataset into N subset
--n_core -1  # number of multiple processor (default to 1) (-1 == using all core)

Arguments list:

positional arguments:
  data_path             data folder contain file.jsonl or huggingface dataset cache

options:
  -h, --help            show this help message and exit
  --save_path SAVE_PATH
                        Processed data save path
  --level LEVEL         Extract function/class/inline level or all
  --language LANGUAGE   Declare processing language (e.g: Python, Java)
  --data_format DATA_FORMAT
                        Path to file .yaml contains data format
  --load_from_file      Load from .json or .jsonl
  --cons_from_raw       Continues from raw .jsonl (pass folder path to data)
  --raw_only
  --filtered_only
  --extracted_only
  --n_split N_SPLIT     Split all the raw data into N file and feed into process pool
  --n_core N_CORE       Number of maximum process to create
  --debug

Citing The Vault

More details can be found in our paper.

If you're using The Vault or the toolkit in your research or applications, please cite using this BibTeX:

@article{manh2023vault,
  title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
  author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
  journal={arXiv preprint arXiv:2305.06156},
  year={2023}
}

Contact us

If you have any questions, comments or suggestions, please do not hesitate to contact us.

Website: fpt-aicenter
Email: support.ailab@fpt.com

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 241 Commits
assets		assets
data		data
resources		resources
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

FSoft-AI4Code/TheVault

Folders and files

Latest commit

History

Repository files navigation

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Table of content

The Vault Dataset

Data Summary

Data Structure

Data Instances

Data Fields

Data Near-Deduplication

Splitting train/eval/test

Splitting trainset into multiple subsets

Download dataset

Load dataset on Huggingface

Download via link

The Vault Toolkit

Getting Started

Processing Pipeline

Extracting Raw code

Filtering Extracted code snippet

Processing Custom Dataset

Citing The Vault

Contact us

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages