Skip to content

av9ash/gitbugs

Repository files navigation

GitBugs

DOI

License and Citation

This project is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Any use or reuse of this work please cite the following:

@article{patil2025gitbugs,
  title={GitBugs: Bug Reports for Duplicate Detection, Retrieval Augmented Generation, Triage, and More},
  author={Patil, Avinash},
  journal={arXiv preprint arXiv:2504.09651},
  year={2025}
}

For more details on the license, visit CC BY 4.0.

Bug Report Datasets: Bugs from some popular open source projects.

Column Description

Summary: Short description of the issue.

Issue ID: Unique identifier.

Status: The current state of the issue (e.g., Open, Resolved).

Priority: The assigned importance level (e.g., Blocker, Critical).

Resolution: The outcome of the issue (e.g., Fixed, Duplicate). Some values are missing.

Created: The timestamp when the issue was created.

Resolved: The timestamp when the issue was resolved. Many are missing.

Affects Version/s: The Hadoop version(s) impacted by the bug.

Description: A detailed description of the bug. Some values are missing.

Project Total Bug Reports Duplicates
Cassandra 4,612 300
Firefox 28824 6255
Hadoop 2503 128
Hbase 5403 108
Mozilla Core 85673 17899
VS Code 32829 9272
Seamonkey 1076 120
Spark 20275 497
Thunderbird 15192 4200

More information copied from Logpai/Bugrepo

BugRepo

BugRepo maintains a collection of bug reports that are publicly available for research purposes. Bug reports are a main data source for facilitating NLP-based research in software engineering. We categorize the datasets into the following research directions.

Newer Publications

Previous Publications

2. Bug localization

Bug localization is a process to map a bug report to the corresponding buggy source file. This dataset contains bug reports, commit history, and API descriptions of six open source Java projects including Eclipse Platform UI, SWT, JDT, AspectJ, Birt, and Tomcat. The dataset is currently available here.

Project Timespan #Bugs mapped
AspectJ 2002-03-13 ~ 2014-01-10 593
Birt 2005-06-14 ~ 2013-12-19 4,178
Eclipse 2001-10-10 ~ 2014-01-17 6,495
JDT 2001-10-10 ~ 2014-01-14 6,274
SWT 2002-02-19 ~ 2014-01-17 4,151
Tomcat 2002-07-06 ~ 2014-01-18 1,056

Publications

3. Bug triaging

Given a software bug report, bug triaging is the process to identify an appropriate developer who could fix the bug. Automatic bug triaging algorithm can be formulated as a classification problem, which takes the bug title and description as the input, mapping it to one of the available developers (class labels). The dataset is currently available here.

Project #Bugs #Bugs for classifier
Chromium 383,104 118,643
Mozilla Core 314,388 128,215
Firefox 162,307 24,214

Publications

4. Bug-fixing time estimation

The bug report datasets hosted in this repository contain detailed information about bug fixing time tracking, which can thus be used for research on bug-fixing time estimation.

Publications

5. Bug information mining

Lamkanfi et al. [MSR'13] contributed a dataset with over 200.000 reported bugs extracted from the Eclipse and Mozilla projects. Besides providing a single snapshot of a bug report, they also include all the incremental modifications as performed during the lifetime of the bug report. The dataset is currently available here.

Project #Components #Bugs
Eclipse Platform 22 24,775
JDT 6 10,814
CDT 20 5,640
GEF 5 5,655
Mozilla Core 137 74,292
Firefox 47 69,879
Thunderbird 23 19,237
Bugzilla 21 4,616

Publications

  • [MSR'13] Ahmed Lamkanfi and Javier Perez and Serge Demeyer. The Eclipse and Mozilla Defect Tracking Dataset: a Genuine Dataset for Mining Bug Information. International Working Conference on Mining Software Repositories (MSR), 2013.

About

Bugs from some popular projects on github.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published