fix(circleci-plugin): pipeline collector time range#7820
Conversation
| // CircleCI API will return a 404 response for a workflow/job that has been deleted | ||
| // due to their data retention policy. We should ignore these errors. | ||
| if res.StatusCode == http.StatusNotFound { | ||
| return api.ErrIgnoreAndContinue |
There was a problem hiding this comment.
Shouldn't it be api.ErrFinishCollect ? There would be no more data to be collected if the reason was "data retention", would it?
There was a problem hiding this comment.
With api.ErrFinishCollect it potentially misses workflow/jobs that are still available if the ci data isn't processed exactly in order from newest -> oldest. With api.ErrIgnoreAndContinue if that does occur those rows would still be collected.
E.g. say there are 600 pipelines to be collected, 1 has been deleted. These end up out of order and the deleted record gets processed as 300/600. The devlake pipeline would end with 300 rows uncollected.
There shouldn't be a significant number of deleted workflows & jobs for attempted collection (as the associated pipelines will also have been deleted), so the increase in devlake pipeline duration should be minimal & worth the reduced risk of missing rows.
There was a problem hiding this comment.
Aah, I see. It makes total sense.
Sorry for taking so long I was on vacation and somehow missed the notification.
|
Hey @klesh, is there an ETA on when this PR might be reviewed/merged? |
| // CircleCI API will return a 404 response for a workflow/job that has been deleted | ||
| // due to their data retention policy. We should ignore these errors. | ||
| if res.StatusCode == http.StatusNotFound { | ||
| return api.ErrIgnoreAndContinue |
There was a problem hiding this comment.
Aah, I see. It makes total sense.
Sorry for taking so long I was on vacation and somehow missed the notification.
|
@Nickcw6 Thanks for your contribution. |
|
@klesh looks like this was never cherry-picked into any beta or release. It's missing from 1.0.1 and from any of the 1.0.2 betas. |
* fix(circleci-plugin): only collect pipelines from after data time range * fix(circleci-plugin): ignore 404 not found when collecting jobs or workflows
* feat(circleci-plugin): incremental data collection (#7986) * feat(api_collector_stateful): handle case were response has records from both before & after createdAfter of last collection * feat(circleci-plugin): incremental collection for pipelines * feat(api_collector_stateful): expose Input collector arg for StatefulFinalizableEntity to collect data based on previous collection * feat(circleci-plugin): incremental data collection for workflows * feat(circleci-plugin): incremental data collection for jobs * refactor(circleci-plugin): use common query param function * refactor(circleci-plugin): use BuildQueryParamsWithPageToken func when collecting unfinished job details * fix(circleci-plugin): pipeline collector time range (#7820) * fix(circleci-plugin): only collect pipelines from after data time range * fix(circleci-plugin): ignore 404 not found when collecting jobs or workflows * Cleanup from bad merge --------- Co-authored-by: Nick Williams <65220492+Nickcw6@users.noreply.github.com> Co-authored-by: John Ibsen <john@videa.ai>
Summary
timeAfterconfig of the DevLake pipelineDoes this close any open issues?
Closes #7797
Screenshots
Before:


Running DevLake pipeline with time range set to 2024-06-01 would pull in results from Feb (ie back 180 days until hitting circleci's data retention policy for my org):
After:

Only results from 2024-06-01 onwards are considered:
Other Information
May have uncovered another bug in the CircleCI plugin whereby existing data is overwritten by a subsequent collection, see this commentConfirmed as intended behaviour for a full sync plugin