CoDesc contains the following files that consists the 4.2m CoDesc dataset and related information.
-
CoDesc.json:
- List of python dictionaries type
- Each entry has the following keys:
- id: unique id in CoDesc dataset
- src: source dataset
- src_div: which subset the entry was taken from, e.g. train, test, etc.
- src_idx: idx in source subset
- code: java function
- nl: natural language description after initial filtering
- original_code: source code taken from source
- original_nl: natural language description taken from source
- partition: "train", "valid" or "test"
-
src2id.json:
- Dictionary type
- src2id[src][src_div] is a list of ids from CoDesc dataset
- src -> src_div:
- "CodeSearchNet-Java" -> "test", "valid", "train", "removed"
- "FunCom" -> "none"
- "DeepCom" -> "test", "valid", "train"
- "CONCODE" -> "test", "valid", "train"
- "CodeSearchNet-Py2Java" -> "full", "truncated"
-
id2src.csv:
- csv type
- Columns:
- id: unique id in CoDesc dataset
- src: source dataset
- src_div: which subset the entry was taken from, e.g. train, test, etc.
- src_idx: idx in source subset
-
src_len.csv
- csv type
- Columns:
- src: source dataset
- src_div: which subset the entry was taken from, e.g. train, test, etc.
- len: number of datapoints under this subset
-
partition2id.json
- Dictionary type
- partition2id["train"], partition2id["valid"], and partition2id["test"] are list of ids in CoDesc dataset corresponding to the partition they belong to.
Name | #Projects | #Raw data |
#Clean data |
Code | NL | ||||
---|---|---|---|---|---|---|---|---|---|
#Unique tokens |
Avg len |
≤ 200 (%) | #Unique tokens |
Avg len |
≤ 50 (%) | ||||
CSN-Java | N/A | 542,991 | 490,169 | 284,214 | 140.41 | 83.42 | 168,507 | 25.14 | 89.42 |
DeepCom | 9,714 | 588,108 | 424,028 | 306,422 | 128.35 | 84.04 | 91,933 | 17.80 | 94.76 |
FunCom | 28,000 | 2,149,121 | 2,130,247 | 469,354 | 51.30 | 99.83 | 399,338 | 15.52 | 95.87 |
CONCODE | 33,000 | 2,184,310 | 733,040 | 131,852 | 33.75 | 99.99 | 166,239 | 14.87 | 96.27 |
CSN-Py2Java | N/A | 456,000 | 434,032 | 414,018 | 163.78 | 72.32 | 223,277 | 57.11 | 68.69 |
CoDesc (All) | N/A | 5,920,530 | 4,211,516 | 1,128,909 | 77.97 | 93.53 | 813,078 | 21.04 | 92.28 |
Balanced train-valid-test split for CoDesc data | |||||||||
train | - | - | 3,369,218 | 991,395 | 78.01 | 93.53 | 718,204 | 21.05 | 92.28 |
valid | - | - | 421,149 | 269,435 | 77.73 | 93.51 | 188,145 | 21.08 | 92.26 |
test | - | - | 421,149 | 269,318 | 77.88 | 93.55 | 187,230 | 20.97 | 92.33 |