Skip to content

Latest commit

 

History

History
198 lines (189 loc) · 7.11 KB

Dataset Description.md

File metadata and controls

198 lines (189 loc) · 7.11 KB

Details of CoDesc

Files description

CoDesc contains the following files that consists the 4.2m CoDesc dataset and related information.

  1. CoDesc.json:

    • List of python dictionaries type
    • Each entry has the following keys:
      • id: unique id in CoDesc dataset
      • src: source dataset
      • src_div: which subset the entry was taken from, e.g. train, test, etc.
      • src_idx: idx in source subset
      • code: java function
      • nl: natural language description after initial filtering
      • original_code: source code taken from source
      • original_nl: natural language description taken from source
      • partition: "train", "valid" or "test"
  2. src2id.json:

    • Dictionary type
    • src2id[src][src_div] is a list of ids from CoDesc dataset
    • src -> src_div:
      • "CodeSearchNet-Java" -> "test", "valid", "train", "removed"
      • "FunCom" -> "none"
      • "DeepCom" -> "test", "valid", "train"
      • "CONCODE" -> "test", "valid", "train"
      • "CodeSearchNet-Py2Java" -> "full", "truncated"
  3. id2src.csv:

    • csv type
    • Columns:
      • id: unique id in CoDesc dataset
      • src: source dataset
      • src_div: which subset the entry was taken from, e.g. train, test, etc.
      • src_idx: idx in source subset
  4. src_len.csv

    • csv type
    • Columns:
      • src: source dataset
      • src_div: which subset the entry was taken from, e.g. train, test, etc.
      • len: number of datapoints under this subset
  5. partition2id.json

    • Dictionary type
    • partition2id["train"], partition2id["valid"], and partition2id["test"] are list of ids in CoDesc dataset corresponding to the partition they belong to.

Dataset Statistics

Name #Projects #Raw
data
#Clean
data
Code NL
#Unique
tokens
Avg
len
≤ 200 (%) #Unique
tokens
Avg
len
≤ 50 (%)
CSN-Java N/A 542,991 490,169 284,214 140.41 83.42 168,507 25.14 89.42
DeepCom 9,714 588,108 424,028 306,422 128.35 84.04 91,933 17.80 94.76
FunCom 28,000 2,149,121 2,130,247 469,354 51.30 99.83 399,338 15.52 95.87
CONCODE 33,000 2,184,310 733,040 131,852 33.75 99.99 166,239 14.87 96.27
CSN-Py2Java N/A 456,000 434,032 414,018 163.78 72.32 223,277 57.11 68.69
CoDesc (All) N/A 5,920,530 4,211,516 1,128,909 77.97 93.53 813,078 21.04 92.28
Balanced train-valid-test split for CoDesc data
train - - 3,369,218 991,395 78.01 93.53 718,204 21.05 92.28
valid - - 421,149 269,435 77.73 93.51 188,145 21.08 92.26
test - - 421,149 269,318 77.88 93.55 187,230 20.97 92.33