The goal of this repository is to enable End to End Software Engineering research. It started in the source code as input and provide multiple process metrics that can be used as concepts in supervised learning.
While pridcitve models have value, we would like to take a step further and look for causal relations. For that we provide the source code in different dates. That enables using co-change analysis in order to identify likely causal relations. The source code also provides intervention points, were one can use the classical experimental methodology.
This repository contains the source code needed for the generation and some auxaliry data.
The main data is in different repositories since their volum is very large the sturcture is rather similar. The first batch of 1.8 million Java files from June 2021 is here The last batch is from December 2022, a year and a half later. The BigQuery github_repos scheme was update in the last time at November 27 2022 and therefore we cannot use it to further extend this repository.
See data sheet
Contains the commits, including their messages, in the two months after the source code was extraced. The files invloving the commits are given too.
The goal of this data is to enable comming up with new process metrics.
The samples are splited into train, validation and test files. Splits are done with psuedo random function to ease reproducability and being backward and forward compatible.
The first split is by the code repository. This are split and appear in different directories in zipped files. Inside each file, the samples (which are files in the subject repository) are also bing split into train, validation and test.
Notet the samples of the different splits should have similar properties. However, when spliting by repository, the repositories are different and do not have to have similar properties. This split is suitable for investigating domain adaptation.
If you publish results, please provide that on the test set so it will be camparable to others. In case that the volum of the train set is too high, you can use the validation set instead.
file_labels contains the process metrics for each file. Note that there are many metrics so the data set can be used for many tasks.
The program_repair file contains files modified once in the observed period. That make them suiable for program repair (as a bug fix or a refactor).
previous_file_properties include the file properties computed on the period prior to the extraction.
We compute pairs of easy/hard file. The pair are generated by three labeling functions:
- Files from a developer first period in the project vs. later.
- Files with many/few bugs byt the same developers.
- Files written quickly/longer by the same developer.
The evolvment of code enable defining a similarity function over them. Different versions of the sme file are similar, other files are not. Of course, the negtive set is much lerger so we require the negative to be files from the same project and directory and downsampling them. The content files in the code_similarity files contain thos chnage in the next version(e.g., August). The rest of the files should be taken from the currnet version (e.g., June).
The similarity file contains the labels.