Skip to content

[Bug][Jira] _raw_jira_api_epics table accumulates duplicate data across runs, causing extractEpics subtask performance degradation #8409

@narrowizard

Description

@narrowizard

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When running data collection pipelines for Jira, we observed that the _raw_jira_api_epics table in the DevLake database continuously accumulates duplicate entries for the same Jira epics across multiple collection runs. Each subsequent successful collection run adds a new batch of raw data for epics, but the older, seemingly identical data for the same epics from previous runs is not removed or updated.

This unchecked growth of duplicate raw data in _raw_jira_api_epics is causing a significant performance issue. The extractEpics subtask, which presumably processes this raw data, takes increasingly longer to complete with each collection run due to the large volume of redundant data it has to handle.

What do you expect to happen

We expect DevLake to manage the data in the _raw_jira_api_epics table in a way that prevents the indefinite accumulation of identical duplicate records across collection runs for the same source data.

Ideally, on subsequent collection runs for the same Jira connection and boards:

  1. The system should avoid inserting data that is an exact duplicate of what is already present for a given epic.
  2. Alternatively, old raw data for epics could be replaced or purged before or after inserting fresh data, ensuring the raw table doesn't grow indefinitely with duplicates.
    Preventing this accumulation of duplicates in the raw table should resolve the observed performance degradation and reduce the execution time of the extractEpics subtask to a consistent level.

How to reproduce

  1. Set up an Apache DevLake instance.
  2. Configure a data connection to a Jira instance that contains some epics.
  3. Create and run a data collection pipeline using the configured Jira connection for one or more boards containing epics.
  4. After the first run completes successfully, trigger and run the same DevLake collection pipeline for the same Jira connection and boards again.
  5. Repeat step 4 multiple times (e.g., 2-3 more times).
  6. Observe the execution time of the extractEpics subtask in the later runs compared to the first run; it should show a noticeable increase.
  7. Inspect the contents of the _raw_jira_api_epics table in the DevLake database after multiple runs. You should find multiple rows with identical content (representing the same Jira epic, e.g., identified by the same URL), confirming the presence of duplicate data.

Anything else

No response

Version

main

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

component/pluginsThis issue or PR relates to pluginsseverity/p2This bug doesn’t affect the functionality or isn’t evidenttype/bugThis issue is a bug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions