Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] - Connectors CI/CD Pipeline in Production #24403

Closed
alafanechere opened this issue Mar 23, 2023 · 12 comments
Closed

[EPIC] - Connectors CI/CD Pipeline in Production #24403

alafanechere opened this issue Mar 23, 2023 · 12 comments

Comments

@alafanechere
Copy link
Contributor

alafanechere commented Mar 23, 2023

Definition of done:

  • A CI check runs for all the connectors modified in a connector PR
  • A PR can't be merged if one connector test is failing
  • All our connectors can be tested with a single command, both in the CI and locally.
  • Nightly builds are running for alpha, beta and GA connector on the Connector CI pipeline
  • Connectors health reports are based on dagger pipelines output
  • The /test command is deprecated
  • Documentation on how to test connectors is up to date
  • The Connector Org is onboarded about how to use the Connector CI CLI
  • All the gradle related code is removed from Python and Low code connectors.

Milestone:

  1. ✅ PR modifying connectors trigger modified connector testing pipelines
  2. ✅ A twin nightly builds of GA and Beta connectors is run with Dagger
  3. 🚧 All connectors (Java + Python) can be tested with Dagger. /test is only required when local CDK / local CAT is required. We can deprecate /test and communicate about it.
  4. 🕓 Successful connector test is required to merge a connector PR
  5. 🕓 Nightly builds provide consistent result with /test and are used to power connector health reports
@evantahler evantahler changed the title [EPIC] - Connectors CI in production [EPIC] - Connectors CI/CD in Production Mar 24, 2023
@evantahler evantahler changed the title [EPIC] - Connectors CI/CD in Production [EPIC] - Connectors CI/CD Pipeline in Production Mar 31, 2023
@alafanechere alafanechere self-assigned this Apr 4, 2023
@alafanechere
Copy link
Contributor Author

Project update 04/04/2023:

✅ What's been completed:

🚧 What got in the way:

  • Testing java connectors with Gradle / Dagger is tedious and require "complex" docker-in-docker logic. More details here

🌱 What's next:

❓ Learning / Questions / Changes:
There might be some tradeoff to make on java connectors testing in term of performance.

  • Parallel use of the Gradle cache leads to errors.
  • Testing destination means building the normalization, which relies on airbyte-python.gradle plugin that we want to discard.

⛔ Blockers:

  • A new Dagger SDK should be cut to benefit from this fix which impact our docker-in-docker use case.

@alafanechere
Copy link
Contributor Author

@bnchrch do you think we can descope the connector health report generation from this project? I think it's also scoped in the metadata project.
Let me know if we shall adapt our current reporting logic to S3.

@alafanechere
Copy link
Contributor Author

alafanechere commented Apr 11, 2023

✅ What's been completed:
1. Dagger was officially ratified by our EMs to run connectors CI/CD.
2. I think the PR to test and build all connectors (Java + Python) is ready to be merged.
The logic to build java connectors image will be reusable to Java connectors publishing!

  1. The code structure of the project has been revamped to support multiple CI use cases: the metadata project successfully implemented dagger pipelines on top of the package we've bee iterating on:

🚧 What got in the way:

🌱 What's next:
To complete feature parity with /test I think the next steps are:

  1. [EPIC] - Connectors CI/CD Pipeline in Production  #24403 : Once connectors-ci: test java connectors #24225 is merge we shall compare nightly builds results from /test to the nightly builds ran by Dagger. We shall identify and fix discrepancies.
  2. 23978: When CDK changes are made on a branch the connectors should be tested with "local" CDK version.

⛔ Blockers:

@alafanechere
Copy link
Contributor Author

Weekly project update: Connectors CI/CD Pipeline in Production
✅ What's been completed:
The delivery on this project slowed down a bit because Ben and I focused last week on the metadata project

  • All our connectors can be tested (Java / Python / Low Code).
  • We can track discrepancies between /test and Dagger nightly builds results in this dashboard
  • I got access to the EKS cluster to monitor what’s happening on the CI infra and to nuke pods if needed.
  • We upgraded to Dagger 0.5.0 to benefit from recent fixes

🚧 What got in the way:

After merging the PR enabling testing all our connectors (Java / Python / Low Code) we suffered from a huge performance loss due to caching issues in Dagger. We reached out to the dagger team to understand the root cause. Nuking the Dagger engine and the runner made this problem disappear (not nuking the cache!). I’m going to monitor the pipelines closely this week to check if the problems reoccurs.

🌱 What's next:
To complete feature parity with /test and /publish I think the next steps are:

⛔ Blockers:

No hard blocker at the moment but infra instabilities require extra communications and collaborations with the infra / Dagger teams that can impact the project velocity.

@alafanechere
Copy link
Contributor Author

Weekly project update: Connectors CI/CD Pipeline in Production

✅ What's been completed:

  • Change the scope of the project to focus on connectors testing and release only: no stories will be done to test CDK or CAT
  • Expose a build command to locally build connectors and load them to the user docker host: airbyte-ci connectors --name=source-faker build
  • Expose a publish command triggered on merge when a metadata.yaml changed (only pre-release images are published ATM).
  • Package normalization in java destinations at build time
  • Write documentation for the CLI and socialize it.
  • Continue optimizing the pipeline following Dagger's team suggestion.
  • Nightly builds are getting more stable but we're not there yet...

🚧 What got in the way:

  • The project requirements have to adapt to ongoing build logic changes. Parallel efforts of the destination team on normalization have to be reproduced in Dagger pipelines.
  • It's hard to nail the perfect and broad way to build connectors: I discovered that our python connectors are not all using the same base images (alpine is not comptablle with all the python dependencies).
  • Dagger team closely monitors our nightly build pipelines and discover new bugs on their side or on buildkit. E.G. A memory leak bug on buildkit can cause instabilities. These instabilities makes it hard to get repeatable nightly build result and to assess if something is wrong in our pipelines or on the Dagger/Buildkit side

🌱 What's next:
To complete feature parity with /test and /publish I think the next steps are:

⛔ Blockers:

No hard blocker at the moment but infra instabilities require extra communications and collaborations with the infra / Dagger teams that can impact the project velocity.

@alafanechere
Copy link
Contributor Author

alafanechere commented May 2, 2023

Weekly project update: Connectors CI/CD Pipeline in Production

✅ What's been completed:

  • /test vs Connectors CI tests pipelines discrepancies: improved stability with Python connector build or packaging tweaks.
  • More stable nightly builds: 5 day in a row successes.
  • We added a new check to make sure a connector version was correctly bumped on a connector PR.
  • Dagger team now has access to our engine logs: will simplify the investigation of the new bugs we hit.

🚧 What got in the way:

  • Iterations to debugging Java connectors are slow. But I have a lead to explain why source-postgres connector tests are failing.

🌱 What's next:
Benefits from the Airbyte Assemble week to pair with other teams to:

  • Finish the packaging of normalization inside destination containers
  • Investigate the integration test failure on source-postgres (and other JDBC sources)
  • Assess the quality of Dagger built connector image on publish and deprecate /publish
  • Sync with Conor on the infra next steps
  • Meet the Dagger team in person

@alafanechere
Copy link
Contributor Author

alafanechere commented May 9, 2023

Weekly project update: Connectors CI/CD Pipeline in Production

✅ What's been completed:

🚧 What got in the way

  • Debugging discrepancies between /test and Dagger pipelines is hard because I’m not able to reproduce locally the errors I spot on the python connectors tests. Erik provided helpful suggestions to reproduce locally errors happening in the CI infra. Nevertheless I’ll prioritize the effort on the publish pipeline to production this week and will come back to discrepancies investigations next week.
    🌱 What's next:
The immediate milestone we’d like to achieve is: replacing /publish by publish on merge.
  • Finish the assessing the correctness of the images built by Dagger pipelines.
  • Identify a path to a stable infrastructure for publish pipelines, not corrupted by nightly tests load.
  • Deprecate /publish (PR ready).

⛔ Blocker

  • The infra health remains questionable. I have the impression the nightly build are clogging the engines, or that not enough resources are available to build Java connectors. We hoped the latest improvements we’ve pushed to the cluster last week would mitigate the kind of errors we spotted but it’s not the case. I’ve reported the errors to the Dagger team, they’re on it. I’ll try to use a larger runner for connector publish pipelines.

@alafanechere
Copy link
Contributor Author

alafanechere commented May 16, 2023

Weekly project update: Connectors CI/CD in production [PUBLISH ON MERGE FOCUS]

✅ What's been accomplished:

Isolate environments

To maximize the publish pipeline availability Conor and the dagger team provisioned a separate set of runners for connector testing. Publish and test are not sharing the same cache anymore, so test should not corrupt the publish environment.

Test

  • We manually assessed the correctness of the connector docker images being built by dagger by creating connections with pre-release images. The built image look healthy.
  • We wrote unit and integration tests of the publish pipeline to confirm its logical correctness.
    Stabilize
  • Invalid spec were written to the spec cache. We now perform spec validation before writing them to GCS.
    We now pull them after push to ensure the images we push to DockerHub can be consume
    Decrease the reliance of the publish pipeline on docker-in-docker flows that can be unstable.
    Improve the Java connectors build logic to get image size comparable to what manage.sh builds.

Prepare /publish deprecation

We opened a PR in which we:

  • update or documentation related to /publish
  • Post deprecation message to /publish users on PR
  • Rename /publish to /legacy-publish for normalization publication usage or exceptional use in case of bugs.
  • Make the publish on merge pipeline publish main release and not pre-release anymore

⛔ Blockers:

I detailed in this doc the current potential blockers we have that can hurt the stability of publish on merge. I’d like to discuss with the team if we shall consider these hard blockers that prevent us to go live with publish on merge.

🌱 What’s next

Assess with the team if we’re :thumb up to make publish on merge go live

@alafanechere
Copy link
Contributor Author

alafanechere commented May 23, 2023

Weekly project update:

✅ What's been accomplished

  • Publish on merge is live! The pipeline looks healthy and can be monitored on #publish-on-merge-updates
  • The /publish command on PRs is deprecated and the related documentation was updated

⚠️ What got in the way

  • Some connector images got built with layers compressed with an algorithm incompatible with Docker version < 21. Dagger released a new API flag to force the compression algorithm to use. Resolution was fast but delayed the original roll out plan.
  • Java connectors are not built from Dockerfile anymore, but some connectors like destination-s3 have custom system dependency installation that we have to reproduce in the pipeline building the connectors with Dagger

🔮 What's next:

Now that publish on merge is live we should work on deprecating /test ASAP because the images we publish are not exactly the one we test.

  • Stabilize the test infrastructure: since the infra isolation between publish and test the pipeline running on the test infrastructure are frozen and lead to GHA timeout.
  • Stabilize the java connector testing: improve grade caching logic for concurrent grade task run.
  • Fix eventual testing discrepancies between /test and automated pipelines
  • Solidify the commit check reporting logic with GitHub API and make the test passing a required check to merge
  • Build connector health report from the new nightly build pipeline results.

@alafanechere
Copy link
Contributor Author

Weekly project update

✅ What was done

🌱 Next steps:

We’d like to be code complete in term of pipeline logic this week:
The remaining stories are:

In parallel I expect continued investigation from the Dagger team on our infra problems.

I’d like to focus the next week on:

  • Deprecation strategy for /test
  • Comparing /test nightly build results to dagger tests results and fix discrepancies if any
  • Officially deprecating /test if discrepancies are fixed and the infra is stable enough.*

⛔ Blockers

The infra instabilities still delay stable nightly builds for an in-depth comparison of /test results to dagger tests results. Here are the three problems we're actively investigating with the Dagger team:

@alafanechere
Copy link
Contributor Author

Connectors CI/CD weekly update

✅ What's been accomplished:

  • We overcame the main infrastructure problem that prevented us from running nightly builds
  • We implemented dynamic gradle dependencies resolution to trigger tests on java connector if a base project was modified
  • We made some DX improvement: tests are triggered more rapidly on PRs, pytest logs are written to local filesystem to ease debugging

⚠️ What got in the way:

  • Some connector Python relying on other containers are hard to test in dagger and we have to rework their tests or slightly change the pipeline change logic
  • It was tough to fix the different bugs that were preventing nightly builds to run. Lot of coordination required with the Dagger team.
  • A DockerHub API downtime led to cascading publish failures on Friday. We are working on a workaround to be less dependent to this service.

🔮 What's next:

  • Create connector health reports based on the airbyte-ci nightly builds.
  • Understand and fix the occasional test results discrepancies
  • Prepare a big PR to deprecate /test and make airbyte-ci the default
  • Release next week

As we get closer to the release I post daily project updates on this doc!

@alafanechere
Copy link
Contributor Author

Closing this epic as both publish and test pipelines are in production.
I'm tracking upcoming improvements in #27310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant