Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airbyte-ci: Add local cdk support #30461

Merged
merged 8 commits into from
Sep 19, 2023
Merged

Airbyte-ci: Add local cdk support #30461

merged 8 commits into from
Sep 19, 2023

Conversation

bnchrch
Copy link
Contributor

@bnchrch bnchrch commented Sep 15, 2023

Overview

This allows developers to build and test connectors using the local version of the cdk by running

airbyte-ci connectors --use-local-cdk --name=source-faker build
airbyte-ci connectors --use-local-cdk --name=source-faker test

closes #30242
unblocks #30326

@bnchrch bnchrch requested a review from a team September 15, 2023 02:07
@bnchrch bnchrch requested a review from a team as a code owner September 15, 2023 02:07
@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Sep 15, 2023
Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this!
The apply_python_delopment_overrides LGTM but the change in get_cdk_version_from_python_connector look very brittle to me. We already have a function to find_local_dependencies_in_setup_py.
FYI I'm currently refactoring the build/test logic to test connectors from within their real built image: this will allow us to remove the ConnectorPackageInstall step and the environment functions it uses.

connector_container = (
connector_container.with_mounted_directory(f"/{path_to_cdk}", directory_to_mount)
.with_entrypoint("pip")
.with_exec(["install", "--no-deps", "--find-links=.", f"/{path_to_cdk}"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of overriding the entrypoint you can use the skip_entrypoint flag on with_exec

Suggested change
.with_exec(["install", "--no-deps", "--find-links=.", f"/{path_to_cdk}"])
.with_exec(["pip", "install", "--no-deps", "--find-links=.", f"/{path_to_cdk}"], skip_entrypoint=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah shoot! I was looking for a flag like that! Thank you!

connector_container = (
connector_container.with_mounted_directory(f"/{path_to_cdk}", directory_to_mount)
.with_entrypoint("pip")
.with_exec(["install", "--no-deps", "--find-links=.", f"/{path_to_cdk}"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a comment about the purpose of no-deps and --find-links options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments added!

Also removed --find-links as it was an unessesary relic

if not pip_dependencies:
return None

airbyte_cdk_line = pip_dependencies[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it very brittle to consider the cdk package is always the first in the list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so originally I was filtering the pip_dependencies down to only those that contained airbyte-cdk

I updated this to make use of next so that we dont need to explicitly get an element at index

Comment on lines 888 to 889
pip_dependencies = [dep for dep in pip_freeze_stdout.split("\n") if "airbyte-cdk" in dep]
if not pip_dependencies:
return None

airbyte_cdk_line = pip_dependencies[0]
if "file://" in airbyte_cdk_line:
return "LOCAL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a function which lists the local dependencies of a connector:
find_local_dependencies_in_setup_py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so this file is based off of what package is installed, and these are retrieved from pip freeze. So we dont parse the file itself

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Would appreciate a bit of extra testing but we can do this when refactoring on to of #30474

@@ -365,6 +365,25 @@ def with_python_connector_source(context: ConnectorContext) -> Container:
return with_python_package(context, testing_environment, connector_source_path)


async def apply_python_development_overrides(context: ConnectorContext, connector_container: Container) -> Container:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see an unit test on this. I'll implement one when I'll port it on top of #30474

Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @bnchrch, but I don't think we want to block the CDK from being able to install its own dependencies. Is that a tradeoff that you discussed?

context.logger.info(f"Mounting {directory_to_mount}")

# Install the airbyte-cdk package from the local directory
# We use --no-deps to avoid conflicts with the airbyte-cdk version required by the connector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this prevent us from being able to test the local CDK if it has any new/updated dependencies?

For new dependencies, this will throw an exception, but for updates it'll be undefined (i.e. we won't necessarily be testing the code that'll be used in production).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clnoll great question! and you are exactly right it will ignore sub-dependency conflicts that may crop up.

e.g. if airbyte-cdk needs pandas v2.0.0 but source-s3 need pandas v1.2.3 you wont see the error until you introduce the PR that explicitly bumps source-s3's airbytec-dk version.

The alternative is that we do not include --no-deps and to test local cdk changes a cdk developer would have to

  1. bump the library version number in the pyproject.toml for airbyte-cdk
  2. Ensure/bump the airbyte-cdk library version in the pyproject.toml for a connector you would like to test with to match
  3. Run airbyte-ci --use-local-cdk --name=source-s3 test

The reason I did not choose this route was because

  1. Its an extra step / commit to test different connectors (you have to update their dependency file)
  2. I assumed that the PR that released a new cdk version would not be the same PR that bumped a connector to use the new version. That assumption let me believe the dependency conflicts would be found later in the second PR since you would be running test without --use-local-cdk.

@clnoll more than happy to change the approach, what are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks! I want to take a closer look at the code and understand why we’re stuck with this tradeoff between --no-deps and extra steps, given that the original implementation of the local cdk flag didn’t have that tradeoff. This isn’t a super common scenario but it’s common enough that it doesn’t feel quite right to me to skip the deps.

I hear you re the fact that conflicts would be caught before release, but I think that rooting out the issue early is important. Correct me if I'm wrong, but with the current implementation, to fix a dependency conflict I believe we’d have to toggle back and forth between updating the CDK and updating the connector until we settle on a version that agrees with both. Some conflicts could be resolved by doing that locally, but sometimes there are issues that will only crop up in certain development environments. i.e. you’ll see them in CI but not during local development, especially if it involves something like numpy or pandas that has a C dependency - and conflicts related to those libraries are the biggest sources of these problems that I've seen.

Copy link
Contributor Author

@bnchrch bnchrch Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to take a closer look at the code and understand why we’re stuck with this tradeoff between --no-deps and extra steps

Ah whoops forgot to put a link to part of the why:
jazzband/pip-tools#215

the original implementation of the local cdk flag didn’t have that tradeoff

I can answer that!

So the original implementation used sed to edit the setup.py file using find and replace.

This approach is both brittle and a victim of the python ecosystem where there are approximately 5 "standard" ways in 2023 to define your dependencies

  1. Using requirements.txt
  2. Using setup.py
  3. using setup.py + requirements.txt
  4. Using pyproject.toml
  5. Using pyproject.toml + poetry.lock

Were currently planning on forcing python connectors to all move to the "pyproject.toml + poetry.lock" method but thats still loose plans and today were stuck with 5 possible ways that the airbyte-cdk dependency is declared.

To continue to use the sed approach we would have to have parsing and local override code for all 4 of these files.

However, because they all end up using pip at the end of the day we chose to

  1. Let the connector install using any of the above methods (for now)
  2. Call pip install path/to/airbyte-cdk/python after the fact.

As the trade off is to either

  1. Ignore connector vs airbyte-cdk dependencies conflict during cdk testing
  2. Or, require you to update a connectors dependency file when testing the cdk

Correct me if I'm wrong, but with the current implementation, to fix a dependency conflict I believe we’d have to toggle back and forth between updating the CDK and updating the connector until we settle on a version that agrees with both

🤔 I dont think you'd have to toggle back and forth.

Let me explain

Lets assume when your updating a connector to use a new version of the cdk you have this goal in mind

  • Upgrade to the latest possible version of cdk that wont require a massive refactor

WIth this you'll likely target latest cdk.

  • If theres a dependency conflict:
    • you will look at upgrading your own depenencies first
  • If you cant upgrade your dependencies (You need an old version of pandas or something):
    • you will likely walk back a version of the airbyte-cdk and try again.

I dont think a connector dev would be messing with cdk dependency versions
and vice versa, I dont think a cdk dev would be messing with a connector version (if theres a major conflict, try testing with another connector)

@clnoll Is that a fair way to think about it?

but sometimes there are issues that will only crop up in certain development environments. i.e. you’ll see them in CI but not during local development, especially if it involves something like numpy or pandas that has a C dependency - and conflicts related to those libraries are the biggest sources of these problems that I've seen.

Were hoping this method solves this very issue where airbyte-ci connectors build, airbyte-ci connectors test, and airbyte-ci connectors publish both use the same, repeatable (dockerized) environment both on your machine and in CI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the details, @bnchrch!

I agree with your description of how a connector dev would be thinking about this situation. Re the CDK dev perspective - we do test changes on existing connectors and I could see us wanting to make sure that a package upgrade was appropriate for specific connectors, particularly if it's a popular connector, or if a CDK update was made with solving problems specific to those connectors.

But, if we're confident that airbyte-ci is reproducing the prod environment I think we're good to just deal with this scenario locally.

@bnchrch bnchrch merged commit 07a7989 into master Sep 19, 2023
31 checks passed
@bnchrch bnchrch deleted the bnchrch/ci/local-cdk branch September 19, 2023 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optionally build python connector with the local CDK
4 participants