-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
airbyte_ci: up_to_date command #37487
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
39e208c
to
b276012
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice @bleonard!
Requesting one change related to the steps_results[-1]
indexing, and added a few more minor comments.
...te-ci/connectors/pipelines/pipelines/airbyte_ci/connectors/migrate_to_base_image/pipeline.py
Outdated
Show resolved
Hide resolved
airbyte-ci/connectors/pipelines/pipelines/airbyte_ci/connectors/up_to_date/commands.py
Outdated
Show resolved
Hide resolved
status=StepStatus.SKIPPED, | ||
stderr="The connector is not a Python connector.", | ||
) | ||
if self.context.connector.connector_type != "source": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we require it to be a source connector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to changing later. I felt like I was mostly targeting the source CDK
status=StepStatus.SKIPPED, | ||
stderr="The connector is not a source connector.", | ||
) | ||
if not "poetry.lock" in connector_dir_entries or not "pyproject.toml" in connector_dir_entries: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the poetry.lock file is auto-generated, it feels like for this check we can assert that the connector has a pyproject.toml.
Can we add a check here that the pyproject.toml includes the CDK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave that TODO for my next round. I'd a bit worried about the Candidate
notion if this pipeline does lots of things anyway. Not sure it will be helpful if there are N possible things up_to_date
could do.
stderr="The connector CDK can't be updated because it does not have a base image defined in the metadata.", | ||
) | ||
|
||
# TODO: is there a fast way to check if the connector is already using the latest CDK version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poetry has some tools to play around with for this. poetry show -o | awk '{print $1}' | grep airbyte-cdk
will work. I didn't immediately see a way to make poetry emit json but might have missed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that below. I didn't know if you could do that outside of the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't want to build the connector container the following shortcuts to suggest
Approach A: compare dependencies.json
with the latest cdk verion
- Check if
poetry.lock
has changed on the branch by checking the content ofself.context.modified_files
- If
poetry.lock
was not modified we can consider that the CDK version did not changed compared to the last version. - As we archive the connector dependency at build time (for the data team) you can consume the last dependency snapshot of the connector by GETting and parsing this kind of url: https://storage.googleapis.com/prod-airbyte-cloud-connector-metadata-service/connector_dependencies/source-airtable/4.2.0/dependencies.json
- You can get the latest cdk version by using the
get_latest_cdk_version
from this module - You can parse
dependencies.json
and compare the installed cdk version to the latest cdk version.
Approach B: pip freeze
on the latest released connector
Another approach, which would use a container but without the build footprint could be:
- Keep step 1 and 2 from previous suggestion
- If the lockfile was not updated:
#Instantiate a dagger container with the latest connector image:
latest_connector_container = self.dagger_client.container().from(f"{self.connector.metadata["dockerRepository"]}:latest")
# The following line will pull the connector image if it's not cached. It has some network overhead
installed_dependencies = (await latest_connector_container.with_exec(["pip", "freeze"], skip_entrypoint=True).stdout()).splitlines()
cdk_version = [dependency.split("==")[1] for dependency in installed_dependencies if "airbyte-cdk" in dependency][0]
latest_cdk_version = get_latest_cdk_version()
Approach C [preferred?] : parse the poetry.lock
file
lockfile_content = await self.context.get_connector_dir().file("poetry.lock").contents()
Then you can use regex or this library to parse and query the lockfile and finally compare the cdk version with the output of get_latest_cdk_version()
poetry_show_result = await connector_container.with_exec(sh_dash_c(["poetry show | grep airbyte-cdk"])).stdout() | ||
if not poetry_show_result: | ||
return StepResult(step=self, status=StepStatus.FAILURE, stderr="CDK is not installed in the connector.") | ||
current_cdk_version = extract_poetry_show_version(poetry_show_result) | ||
if not current_cdk_version: | ||
return StepResult(step=self, status=StepStatus.FAILURE, stderr="CDK was not found in the connector poetry.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it belongs in CheckIsUpdateCdkCandidate
. Any reason not to put it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was back to when I spun up the container. I tried to do it all in one spot. Is that common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, that makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like these checks do belong here and not in a separate step. I would also suggest placing here the "check if the current cdk version is the latest and skip if it is" logic.
It will lead to a more self contained and explanatory Step
: Why are we skipping the UpdateCDK execution? because the connector is already on the latest version sounds simpler to understand compared to Why did UpdateCDK did not run? > Because CheckIsUpdateCdkCandidate
was skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the world we are moving to that will do several (but not all) of 7 things, should there be 7 different checks like this? Or should there be 7 tasks that might return SKIPPED
- i lean towards the later because otherwise, you end up with duplicated logic between the Candidate step and the ActuallyDoIt step.
airbyte-ci/connectors/pipelines/pipelines/airbyte_ci/connectors/up_to_date/pipeline.py
Show resolved
Hide resolved
) | ||
|
||
|
||
def compare_semver_versions(version1: str, version2: str) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't used it myself but it looks like packaging.version.Version does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest using the semver
package, you can reuse some logic from the VersionCheck
step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random note I learned semver
crashes with valid PyPi versions like "2.4" and "2.3.1.dev1" - I'm going to use use the python packaging.version
as it's more likely to work as expected.
…s/up_to_date/commands.py Co-authored-by: Catherine Noll <clnoll@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a command to ugprade the cdk for python and java connectors (logic here doc here). It has not been updated to support poetry though.
I expected the up_to_date
command to not only update the cdk but also all other dependencies of the connector if they exist.
It would "simply" be a "à la Dependabot" global dependecy update with: poetry update
.
poetry update
would upgrade all the dependencies while still respecting the dependency upper bounds defined in pyproject.toml
.
I'm requesting changes because I'd appreciate either:
A. A rework of the upgrade_cdk
command to support poetry
.
B. A new command which would update all the dependencies, not only the CDK.
|
||
|
||
class CheckIsUpdateCdkCandidate(Step): | ||
"""Check if the connector is a candidate for migration to poetry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Check if the connector is a candidate for migration to poetry. | |
"""Check if the connector is a candidate for CDK update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to work this in to the whole dependabot thing. The stuff I mentioned in the PR. Is it not possible to check in current progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleonard we could check in current progress but what I don't get is why we should consider the CDK
a specific dependency, different from the ones that would be updated with a single poetry update
execution.
From your PR description if think it don't get the else of do a poetry update to update everything else.
I'd be happier merging a new command which just does poetry update
on a connector and exports back the updated poetry.lock
file. It would introduce a new command which has a net new feature compared to the existing upgrade-cdk
one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I feel like we should set the CDK explicitly so doing a poetry update wouldn't fix it. It's easy to set that explicitly and clarity on CDK conpatibility is important. *
or ^0
won't provide that. I can add the Poetry update command now if that's the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should set the CDK explicitly
Ahh makes sense, this is what I did not get. Let me double check with @girarda that this is what we want.
I can add the Poetry update command now if that's the request.
This can be a follow up PR if the update to a specific CDK version is what we want instead of "always latest".
PACKAGE_NAME_PATTERN = r"^([a-zA-Z0-9_.\-]+)(?:\[(.*?)\])?([=~><!]=?[a-zA-Z0-9\.]+)?$" | ||
|
||
|
||
class CheckIsUpdateCdkCandidate(Step): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if the pipeline is the language agnostic but the Step
is not I would suggest creating language specific submodules under a step
subpackage like we did for build. You could argue that it's not something I respected for migrate_to_poetry
and migrate_to_base_image
commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I copied migrate_to_poetry basically.
status=StepStatus.SKIPPED, | ||
stderr="The connector is not a source connector.", | ||
) | ||
if "poetry.lock" not in connector_dir_entries or "pyproject.toml" not in connector_dir_entries: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should have an helper to check whether the connector is poetry based
PACKAGE_NAME_PATTERN = r"^([a-zA-Z0-9_.\-]+)(?:\[(.*?)\])?([=~><!]=?[a-zA-Z0-9\.]+)?$" | ||
|
||
|
||
class CheckIsUpdateCdkCandidate(Step): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a significant code duplication with CheckIsMigrationCandidate
on the poetry pipeline in this step. Can we create a parent class for both steps?
I also realize that these step are not concretely performing anything. Capturing this logic inside a function which would early exit the pipeline might be more straightforward.
stderr="The connector CDK can't be updated because it does not have a base image defined in the metadata.", | ||
) | ||
|
||
# TODO: is there a fast way to check if the connector is already using the latest CDK version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't want to build the connector container the following shortcuts to suggest
Approach A: compare dependencies.json
with the latest cdk verion
- Check if
poetry.lock
has changed on the branch by checking the content ofself.context.modified_files
- If
poetry.lock
was not modified we can consider that the CDK version did not changed compared to the last version. - As we archive the connector dependency at build time (for the data team) you can consume the last dependency snapshot of the connector by GETting and parsing this kind of url: https://storage.googleapis.com/prod-airbyte-cloud-connector-metadata-service/connector_dependencies/source-airtable/4.2.0/dependencies.json
- You can get the latest cdk version by using the
get_latest_cdk_version
from this module - You can parse
dependencies.json
and compare the installed cdk version to the latest cdk version.
Approach B: pip freeze
on the latest released connector
Another approach, which would use a container but without the build footprint could be:
- Keep step 1 and 2 from previous suggestion
- If the lockfile was not updated:
#Instantiate a dagger container with the latest connector image:
latest_connector_container = self.dagger_client.container().from(f"{self.connector.metadata["dockerRepository"]}:latest")
# The following line will pull the connector image if it's not cached. It has some network overhead
installed_dependencies = (await latest_connector_container.with_exec(["pip", "freeze"], skip_entrypoint=True).stdout()).splitlines()
cdk_version = [dependency.split("==")[1] for dependency in installed_dependencies if "airbyte-cdk" in dependency][0]
latest_cdk_version = get_latest_cdk_version()
Approach C [preferred?] : parse the poetry.lock
file
lockfile_content = await self.context.get_connector_dir().file("poetry.lock").contents()
Then you can use regex or this library to parse and query the lockfile and finally compare the cdk version with the output of get_latest_cdk_version()
airbyte-ci/connectors/pipelines/pipelines/airbyte_ci/connectors/up_to_date/pipeline.py
Outdated
Show resolved
Hide resolved
airbyte-ci/connectors/pipelines/pipelines/airbyte_ci/connectors/up_to_date/pipeline.py
Outdated
Show resolved
Hide resolved
await new_connector_container.with_mounted_file( | ||
"pyproject.toml", (await self.context.get_connector_dir(include=["pyproject.toml"])).file("pyproject.toml") | ||
).with_exec(["poetry", "run", self.context.connector.technical_name, "spec"], skip_entrypoint=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we mounting these files because they are not kept after build on the built connector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know anything about mounting. I was just copying what you did in the migrate_to_poetry action. Is this what it means: You should mount things that might have changed since the container was made ?
) | ||
|
||
|
||
def compare_semver_versions(version1: str, version2: str) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest using the semver
package, you can reuse some logic from the VersionCheck
step
airbyte-ci/connectors/pipelines/pipelines/airbyte_ci/connectors/up_to_date/pipeline.py
Outdated
Show resolved
Hide resolved
I've updated this with sevral changes based on the Slack discussions. It will now expclity move the airbyte-cdk pin so that PyAirbyte and others using pip will work predictably. It also has a place to handle the logic for which version that can be updated to. It will This is not getting called anywhere (low risk) and it being on this branch is blocking me from working on all the connector PRs in another branch. If there is nothing scary, can we merge and keep a list for later updates? |
airbyte-ci/connectors/pipelines/pipelines/airbyte_ci/connectors/up_to_date/pipeline.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look like a good enough first iteration on this command 👍 Thanks for your patience with this code base. Can you please bump the tool version in pyproject.toml
and add a changelog entry in the README? 🙏
There are room for logic consolidation but I suggest to tackle it once we want to automatically orchestrate it and make it do other things.
I left a couple of other suggestions. But as you said, nothing in this branch is in the critical path so feel free to merge if you need to manage your lots of connectors PRs.
async def get_poetry_versions(connector_container: dagger.Container, only: str | None = None) -> dict[str, Version]: | ||
# -T makes it only the top-level ones | ||
# poetry show -T --only main will jsut be the main dependecies | ||
command = ["poetry", "show", "-T"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL
) | ||
|
||
|
||
class UpdatePoetry(Step): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I find the name of this class not matchin what it does. UpdateConnectorDependencies
?
- The original dependencies are installed in the new connector image. | ||
- The dev dependencies are not installed in the new connector image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The original dependencies are installed in the new connector image. | |
- The dev dependencies are not installed in the new connector image. |
match = re.match(package_name_pattern, dep) | ||
if match: | ||
package = match.group(1) | ||
versions[package] = dep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I followed the code path correctly we'd get a dict with the following structure:
{
"airbyte-cdk: "airbyte-cdk==0.80.0"
}
Am I correct? (would be worth adding a docstring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. the second part being what would be poetry add
# update everything else | ||
connector_container = await connector_container.with_exec(["poetry", "update"]) | ||
poetry_update_output = await connector_container.stdout() | ||
self.logger.info(poetry_update_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this section execution be conditioned by a flag passed to UpdatePoetry
to optionally run a full poetry update
(this could also be a different step in the pipeline)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You think there's a reason to be optional? I'm pretty open to that.
current_cdk_version = before_versions.get("airbyte-cdk") or None | ||
if current_cdk_version: | ||
# We want the CDK pinned exactly so it also works as expected in PyAirbyte and other `pip` scenarios | ||
new_cdk_version = pick_airbyte_cdk_version(current_cdk_version, self.context) | ||
self.logger.info(f"Updating airbyte-cdk from {current_cdk_version} to {new_cdk_version}") | ||
if new_cdk_version > current_cdk_version: | ||
connector_container = await connector_container.with_exec(["poetry", "add", f"airbyte-cdk=={new_cdk_version}"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this else
block and instantiate a second UpdatePoetry
step in the run_up_to_date_pipeline
, the pipeline would pick the right cdk version (according to inputs or by fetching the latest version).
I would remove "business" logic from the Step
and make it more reusable - self contained.
In other words: can we make self.specified_versions
not optional 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. But I don't thin kthe eventual cron
script will specify the versions for all the updates that should be poetry update
--- specified_versions
was really only for local work. "Let's change try changing the semantic version of pytest" kind of stuff.
type=bool, | ||
default=False, | ||
is_flag=True, | ||
help="Force update when there are only dev changes.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean only dev dependencies are changed?
Meant to be run on a cron script.
Actions:
Examples
airbyte-ci connectors --name=source-openweather up_to_date
: upgrades main dependeciesairbyte-ci connectors --name=source-openweather up_to_date --dev
: forces update if there are only dev changesairbyte-ci connectors --name=source-openweather up_to_date --dep pytest@^8.10 --dep airbyte-cdk@0.80.0
: allows update to toml files as wellOther things it could do later