Skip to content

[Bug][CircleCI Plugin] CircleCI pipelines collected from before time range #7797

@Nickcw6

Description

@Nickcw6

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When collecting CircleCI pipelines, the time range specified in the sync policy has no effect on the data collected - pipelines are collected from before the specified date.

E.g. Sync policy settings set to collect from 1st June 2024:
Screenshot 2024-07-30 at 16 34 46

Screenshot 2024-07-30 at 16 53 41

Excerpt of data JSON blob from top row - has created_date and updated_date of 1st Feb 2024 (ie. 180 days ago from todays date - 2024-07-30):

{
    "id" : "eae60b4c-7dcc-4293-8b00-45f18a494881",
    "updated_at" : "2024-02-01T14:11:31.988Z",
    "created_at" : "2024-02-01T14:11:31.988Z",
    ...
    "trigger" : {
      "received_at" : "2024-02-01T14:11:31.481Z",
      "type" : "webhook",
    ...
    },
    ...
  }

What do you expect to happen

No CircleCI pipelines, workflows or jobs are collected from before the time range start point.

How to reproduce

  1. Set the time frame to a date before any CircleCI data retention period ends (e.g. if retention period is 90 days, set this to 30 days - see below).
  2. Run the DevLake pipeline
  3. Sort the _raw_circleci_api_pipelines table by the created_at JSON property of the data column:
SELECT *
FROM _raw_circleci_api_pipelines
ORDER BY STR_TO_DATE(JSON_UNQUOTE(JSON_EXTRACT(CONVERT(data USING utf8mb4), '$.created_at')), '%Y-%m-%dT%H:%i:%s.%fZ') ASC;
  1. Compare the created_at property to that set in the sync policy time range.

Anything else

This is in part due to the recent pagination fix on the plugin (#7770) - the pagination works but as the CircleCI API does not offer any date range pagination controls, the collector now loops through the pages until next_page_token is null, which is whenever the data retention limit is hit for the account (e.g. for me it is 180 days, but could be less/more, see here).

When subsequently attempting to collect the relevant workflows & jobs for the pipeline, this will return a 404 and error the DevLake pipeline for any data points that fall outside of the data retention range in a race condition vs. CircleCI cleaning up build data:

subtask collectWorkflows ended unexpectedly Wraps: (2) Retry exceeded 3 times calling /v2/pipeline/6b7c4513-56bd-4e0c-ad72-d562df7513b1/workflow. The last error was: Http DoAsync error calling [method:GET path:/v2/pipeline/6b7c4513-56bd-4e0c-ad72-d562df7513b1/workflow query:map[]]. Response: {:message "Pipeline not found"} (404) Error types: (1) *hintdetail.withDetail (2) *errors.errorString

There needs to be an additional check that the created_at property of the returned pipelines is not before the specified time range starting point.

Version

44c3ecb

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/pluginsThis issue or PR relates to pluginspr-type/bug-fixThis PR fixes a bugseverity/p1This bug affects functionality or significantly affect uxtype/bugThis issue is a bug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions