Search before asking
What happened
When running data collection pipelines for Jira, we observed that the _raw_jira_api_epics table in the DevLake database continuously accumulates duplicate entries for the same Jira epics across multiple collection runs. Each subsequent successful collection run adds a new batch of raw data for epics, but the older, seemingly identical data for the same epics from previous runs is not removed or updated.
This unchecked growth of duplicate raw data in _raw_jira_api_epics is causing a significant performance issue. The extractEpics subtask, which presumably processes this raw data, takes increasingly longer to complete with each collection run due to the large volume of redundant data it has to handle.
What do you expect to happen
We expect DevLake to manage the data in the _raw_jira_api_epics table in a way that prevents the indefinite accumulation of identical duplicate records across collection runs for the same source data.
Ideally, on subsequent collection runs for the same Jira connection and boards:
- The system should avoid inserting data that is an exact duplicate of what is already present for a given epic.
- Alternatively, old raw data for epics could be replaced or purged before or after inserting fresh data, ensuring the raw table doesn't grow indefinitely with duplicates.
Preventing this accumulation of duplicates in the raw table should resolve the observed performance degradation and reduce the execution time of the extractEpics subtask to a consistent level.
How to reproduce
- Set up an Apache DevLake instance.
- Configure a data connection to a Jira instance that contains some epics.
- Create and run a data collection pipeline using the configured Jira connection for one or more boards containing epics.
- After the first run completes successfully, trigger and run the same DevLake collection pipeline for the same Jira connection and boards again.
- Repeat step 4 multiple times (e.g., 2-3 more times).
- Observe the execution time of the
extractEpics subtask in the later runs compared to the first run; it should show a noticeable increase.
- Inspect the contents of the
_raw_jira_api_epics table in the DevLake database after multiple runs. You should find multiple rows with identical content (representing the same Jira epic, e.g., identified by the same URL), confirming the presence of duplicate data.
Anything else
No response
Version
main
Are you willing to submit PR?
Code of Conduct
Search before asking
What happened
When running data collection pipelines for Jira, we observed that the
_raw_jira_api_epicstable in the DevLake database continuously accumulates duplicate entries for the same Jira epics across multiple collection runs. Each subsequent successful collection run adds a new batch of raw data for epics, but the older, seemingly identical data for the same epics from previous runs is not removed or updated.This unchecked growth of duplicate raw data in
_raw_jira_api_epicsis causing a significant performance issue. The extractEpics subtask, which presumably processes this raw data, takes increasingly longer to complete with each collection run due to the large volume of redundant data it has to handle.What do you expect to happen
We expect DevLake to manage the data in the
_raw_jira_api_epicstable in a way that prevents the indefinite accumulation of identical duplicate records across collection runs for the same source data.Ideally, on subsequent collection runs for the same Jira connection and boards:
Preventing this accumulation of duplicates in the raw table should resolve the observed performance degradation and reduce the execution time of the
extractEpicssubtask to a consistent level.How to reproduce
extractEpicssubtask in the later runs compared to the first run; it should show a noticeable increase._raw_jira_api_epicstable in the DevLake database after multiple runs. You should find multiple rows with identical content (representing the same Jira epic, e.g., identified by the same URL), confirming the presence of duplicate data.Anything else
No response
Version
main
Are you willing to submit PR?
Code of Conduct