RCompletion Evaluation Dataset (RCED)

Evaluation dataset for the R programming language completion task from the Time-Efficient Code Completion Model for the R Programming Language paper.

Dataset was collected from the GitHub open-source projects. You can use the link (link here) to download it or collect it manually using the GitHub API.

Dataset structure

Dataset consists of the synthetic completion events for the R programming language.

All events are divided into following groups:

6 dataset groups correspond to the completion events following the operators $, %>%, ->, :: <-, =
4 dataset groups cover autocompletion events during the positional or keyword arguments completion in vectors or functions
1 group contains packages import usage contexts
2 groups correspond to the completion of a variable or a function name at the start of a new line

Also, we divide the dataset into prefix and non-prefix (full) groups. The last program token is always incomplete in the prefix group.

Each group volume is reported in the following table.

	with prefix	w/o prefix	total
after operator $	964	1194	2158
after operator %>%	667	826	1493
after operator ->	19	24	43
after operator ::	457	567	1024
after operator <-	1675	2073	3748
after operator =	2005	2483	4488
c key argument	267	331	598
c positional argument	793	983	1776
f key argument	3986	4934	8920
f positional argument	3007	3723	6730
library	341	407	748
new line variable	792	982	1774
new line function	497	616	1113

How to use the dataset?

You can load the dataset as lists of dicts using the following code:

import json
evaluation_data = []
with open('<PATH_TO_EVENTS_FILE>', 'r', encoding='utf8') as f:
    for line in f:
        evaluation_data.append(json.loads(line))

Each element of the evaluation_data is a dict with the following fields:

url - link from which the file was downloaded.
before_cursor - context of the completion event.
after_cursor - answer for the completion event.
after_cursor_token - full-answer for the completion event. If prefix == 'full' then it is equal to the after_cursor. If prefix == 'prefix' then it is equal to the desired programming token.
group - group of the completion event.
prefix - If prefix == 'prefix' then the last programming token of the context is unfinished and should be continued. If prefix == 'prefix' then answer corresponds to the new unseen token.

How to collect the dataset from the GitHub?

Get Personal access token on https://github.com/settings/tokens.
Change OAUTH_TOKEN variable to your token and USERNAME variable to your GitHub login in the crawl_data_from_github.py script.
Run crawl_data_from_github.py. It is important to note that github has rate limit of 5,000 requests per hour. You can wait for an hour or run download script again after an hour after your limit is expired:
```
python crawl_data_from_github.py --urls=data/urls_to_download.txt --folder=<FOLDER_WITH_RAW_FILES>
```

Run get_events_from_data.py:

python get_events_from_data.py --indexes=data/indexes_to_extract.json --folder=<FOLDER_WITH_RAW_FILES> --events=data/extracted_events.json

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
LICENSE		LICENSE
README.md		README.md
crawl_data_from_github.py		crawl_data_from_github.py
get_events_from_data.py		get_events_from_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RCompletion Evaluation Dataset (RCED)

Dataset structure

How to use the dataset?

How to collect the dataset from the GitHub?

About

Releases

Packages

Languages

License

arti32lehtonen/rcompletion_evaluation_dataset

Folders and files

Latest commit

History

Repository files navigation

RCompletion Evaluation Dataset (RCED)

Dataset structure

How to use the dataset?

How to collect the dataset from the GitHub?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages