Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

airbyte_ci: up_to_date command #37487

Merged
merged 17 commits into from
Apr 26, 2024
Merged

airbyte_ci: up_to_date command #37487

merged 17 commits into from
Apr 26, 2024

Conversation

bleonard
Copy link
Contributor

@bleonard bleonard commented Apr 22, 2024

Meant to be run on a cron script.

Actions:

  • Upgrades python connectors to the newest versions of libraries.
  • It's meant to (eventually) be run every N days, but can also be helpful in doing mass work.

Examples

  • airbyte-ci connectors --name=source-openweather up_to_date: upgrades main dependecies
  • airbyte-ci connectors --name=source-openweather up_to_date --dev: forces update if there are only dev changes
  • airbyte-ci connectors --name=source-openweather up_to_date --dep pytest@^8.10 --dep airbyte-cdk@0.80.0: allows update to toml files as well

Other things it could do later

  • make the pull requests on a well known branch, replacing the last one if still open
  • bump the conenctor version and metadata and changelog
  • also bump the manifest version of the CDK

@bleonard bleonard requested a review from a team as a code owner April 22, 2024 21:18
Copy link

vercel bot commented Apr 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Apr 26, 2024 7:53pm

Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @bleonard!

Requesting one change related to the steps_results[-1] indexing, and added a few more minor comments.

status=StepStatus.SKIPPED,
stderr="The connector is not a Python connector.",
)
if self.context.connector.connector_type != "source":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we require it to be a source connector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to changing later. I felt like I was mostly targeting the source CDK

status=StepStatus.SKIPPED,
stderr="The connector is not a source connector.",
)
if not "poetry.lock" in connector_dir_entries or not "pyproject.toml" in connector_dir_entries:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the poetry.lock file is auto-generated, it feels like for this check we can assert that the connector has a pyproject.toml.

Can we add a check here that the pyproject.toml includes the CDK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave that TODO for my next round. I'd a bit worried about the Candidate notion if this pipeline does lots of things anyway. Not sure it will be helpful if there are N possible things up_to_date could do.

stderr="The connector CDK can't be updated because it does not have a base image defined in the metadata.",
)

# TODO: is there a fast way to check if the connector is already using the latest CDK version?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poetry has some tools to play around with for this. poetry show -o | awk '{print $1}' | grep airbyte-cdk will work. I didn't immediately see a way to make poetry emit json but might have missed it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that below. I didn't know if you could do that outside of the container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to build the connector container the following shortcuts to suggest

Approach A: compare dependencies.json with the latest cdk verion

  1. Check if poetry.lock has changed on the branch by checking the content of self.context.modified_files
  2. If poetry.lock was not modified we can consider that the CDK version did not changed compared to the last version.
  3. As we archive the connector dependency at build time (for the data team) you can consume the last dependency snapshot of the connector by GETting and parsing this kind of url: https://storage.googleapis.com/prod-airbyte-cloud-connector-metadata-service/connector_dependencies/source-airtable/4.2.0/dependencies.json
  4. You can get the latest cdk version by using the get_latest_cdk_version from this module
  5. You can parse dependencies.json and compare the installed cdk version to the latest cdk version.

Approach B: pip freeze on the latest released connector

Another approach, which would use a container but without the build footprint could be:

  • Keep step 1 and 2 from previous suggestion
  • If the lockfile was not updated:
#Instantiate a dagger container with the latest connector image:
latest_connector_container = self.dagger_client.container().from(f"{self.connector.metadata["dockerRepository"]}:latest")
# The following line will pull the connector image if it's not cached. It has some network overhead
installed_dependencies = (await latest_connector_container.with_exec(["pip", "freeze"], skip_entrypoint=True).stdout()).splitlines()
cdk_version = [dependency.split("==")[1] for dependency in installed_dependencies if "airbyte-cdk" in dependency][0]
latest_cdk_version = get_latest_cdk_version()

Approach C [preferred?] : parse the poetry.lock file

lockfile_content = await self.context.get_connector_dir().file("poetry.lock").contents()

Then you can use regex or this library to parse and query the lockfile and finally compare the cdk version with the output of get_latest_cdk_version()

Comment on lines 108 to 113
poetry_show_result = await connector_container.with_exec(sh_dash_c(["poetry show | grep airbyte-cdk"])).stdout()
if not poetry_show_result:
return StepResult(step=self, status=StepStatus.FAILURE, stderr="CDK is not installed in the connector.")
current_cdk_version = extract_poetry_show_version(poetry_show_result)
if not current_cdk_version:
return StepResult(step=self, status=StepStatus.FAILURE, stderr="CDK was not found in the connector poetry.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it belongs in CheckIsUpdateCdkCandidate. Any reason not to put it there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was back to when I spun up the container. I tried to do it all in one spot. Is that common?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these checks do belong here and not in a separate step. I would also suggest placing here the "check if the current cdk version is the latest and skip if it is" logic.
It will lead to a more self contained and explanatory Step: Why are we skipping the UpdateCDK execution? because the connector is already on the latest version sounds simpler to understand compared to Why did UpdateCDK did not run? > Because CheckIsUpdateCdkCandidate was skipped.

Copy link
Contributor Author

@bleonard bleonard Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the world we are moving to that will do several (but not all) of 7 things, should there be 7 different checks like this? Or should there be 7 tasks that might return SKIPPED - i lean towards the later because otherwise, you end up with duplicated logic between the Candidate step and the ActuallyDoIt step.

)


def compare_semver_versions(version1: str, version2: str) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used it myself but it looks like packaging.version.Version does this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using the semver package, you can reuse some logic from the VersionCheck step

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random note I learned semver crashes with valid PyPi versions like "2.4" and "2.3.1.dev1" - I'm going to use use the python packaging.version as it's more likely to work as expected.

bleonard and others added 4 commits April 22, 2024 16:07
@bleonard bleonard requested a review from clnoll April 22, 2024 23:22
Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a command to ugprade the cdk for python and java connectors (logic here doc here). It has not been updated to support poetry though.

I expected the up_to_date command to not only update the cdk but also all other dependencies of the connector if they exist.
It would "simply" be a "à la Dependabot" global dependecy update with: poetry update.
poetry update would upgrade all the dependencies while still respecting the dependency upper bounds defined in pyproject.toml.

I'm requesting changes because I'd appreciate either:
A. A rework of the upgrade_cdk command to support poetry.
B. A new command which would update all the dependencies, not only the CDK.



class CheckIsUpdateCdkCandidate(Step):
"""Check if the connector is a candidate for migration to poetry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Check if the connector is a candidate for migration to poetry.
"""Check if the connector is a candidate for CDK update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to work this in to the whole dependabot thing. The stuff I mentioned in the PR. Is it not possible to check in current progress?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleonard we could check in current progress but what I don't get is why we should consider the CDK a specific dependency, different from the ones that would be updated with a single poetry update execution.

From your PR description if think it don't get the else of do a poetry update to update everything else.

I'd be happier merging a new command which just does poetry update on a connector and exports back the updated poetry.lock file. It would introduce a new command which has a net new feature compared to the existing upgrade-cdk one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I feel like we should set the CDK explicitly so doing a poetry update wouldn't fix it. It's easy to set that explicitly and clarity on CDK conpatibility is important. * or ^0 won't provide that. I can add the Poetry update command now if that's the request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should set the CDK explicitly

Ahh makes sense, this is what I did not get. Let me double check with @girarda that this is what we want.

I can add the Poetry update command now if that's the request.

This can be a follow up PR if the update to a specific CDK version is what we want instead of "always latest".

PACKAGE_NAME_PATTERN = r"^([a-zA-Z0-9_.\-]+)(?:\[(.*?)\])?([=~><!]=?[a-zA-Z0-9\.]+)?$"


class CheckIsUpdateCdkCandidate(Step):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if the pipeline is the language agnostic but the Step is not I would suggest creating language specific submodules under a step subpackage like we did for build. You could argue that it's not something I respected for migrate_to_poetry and migrate_to_base_image commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I copied migrate_to_poetry basically.

status=StepStatus.SKIPPED,
stderr="The connector is not a source connector.",
)
if "poetry.lock" not in connector_dir_entries or "pyproject.toml" not in connector_dir_entries:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should have an helper to check whether the connector is poetry based

PACKAGE_NAME_PATTERN = r"^([a-zA-Z0-9_.\-]+)(?:\[(.*?)\])?([=~><!]=?[a-zA-Z0-9\.]+)?$"


class CheckIsUpdateCdkCandidate(Step):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a significant code duplication with CheckIsMigrationCandidate on the poetry pipeline in this step. Can we create a parent class for both steps?
I also realize that these step are not concretely performing anything. Capturing this logic inside a function which would early exit the pipeline might be more straightforward.

stderr="The connector CDK can't be updated because it does not have a base image defined in the metadata.",
)

# TODO: is there a fast way to check if the connector is already using the latest CDK version?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to build the connector container the following shortcuts to suggest

Approach A: compare dependencies.json with the latest cdk verion

  1. Check if poetry.lock has changed on the branch by checking the content of self.context.modified_files
  2. If poetry.lock was not modified we can consider that the CDK version did not changed compared to the last version.
  3. As we archive the connector dependency at build time (for the data team) you can consume the last dependency snapshot of the connector by GETting and parsing this kind of url: https://storage.googleapis.com/prod-airbyte-cloud-connector-metadata-service/connector_dependencies/source-airtable/4.2.0/dependencies.json
  4. You can get the latest cdk version by using the get_latest_cdk_version from this module
  5. You can parse dependencies.json and compare the installed cdk version to the latest cdk version.

Approach B: pip freeze on the latest released connector

Another approach, which would use a container but without the build footprint could be:

  • Keep step 1 and 2 from previous suggestion
  • If the lockfile was not updated:
#Instantiate a dagger container with the latest connector image:
latest_connector_container = self.dagger_client.container().from(f"{self.connector.metadata["dockerRepository"]}:latest")
# The following line will pull the connector image if it's not cached. It has some network overhead
installed_dependencies = (await latest_connector_container.with_exec(["pip", "freeze"], skip_entrypoint=True).stdout()).splitlines()
cdk_version = [dependency.split("==")[1] for dependency in installed_dependencies if "airbyte-cdk" in dependency][0]
latest_cdk_version = get_latest_cdk_version()

Approach C [preferred?] : parse the poetry.lock file

lockfile_content = await self.context.get_connector_dir().file("poetry.lock").contents()

Then you can use regex or this library to parse and query the lockfile and finally compare the cdk version with the output of get_latest_cdk_version()

Comment on lines +174 to +176
await new_connector_container.with_mounted_file(
"pyproject.toml", (await self.context.get_connector_dir(include=["pyproject.toml"])).file("pyproject.toml")
).with_exec(["poetry", "run", self.context.connector.technical_name, "spec"], skip_entrypoint=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we mounting these files because they are not kept after build on the built connector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about mounting. I was just copying what you did in the migrate_to_poetry action. Is this what it means: You should mount things that might have changed since the container was made ?

)


def compare_semver_versions(version1: str, version2: str) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using the semver package, you can reuse some logic from the VersionCheck step

@bleonard
Copy link
Contributor Author

I've updated this with sevral changes based on the Slack discussions. It will now expclity move the airbyte-cdk pin so that PyAirbyte and others using pip will work predictably. It also has a place to handle the logic for which version that can be updated to. It will poetry update the rest.

This is not getting called anywhere (low risk) and it being on this branch is blocking me from working on all the connector PRs in another branch. If there is nothing scary, can we merge and keep a list for later updates?

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look like a good enough first iteration on this command 👍 Thanks for your patience with this code base. Can you please bump the tool version in pyproject.toml and add a changelog entry in the README? 🙏

There are room for logic consolidation but I suggest to tackle it once we want to automatically orchestrate it and make it do other things.

I left a couple of other suggestions. But as you said, nothing in this branch is in the critical path so feel free to merge if you need to manage your lots of connectors PRs.

async def get_poetry_versions(connector_container: dagger.Container, only: str | None = None) -> dict[str, Version]:
# -T makes it only the top-level ones
# poetry show -T --only main will jsut be the main dependecies
command = ["poetry", "show", "-T"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

)


class UpdatePoetry(Step):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find the name of this class not matchin what it does. UpdateConnectorDependencies?

Comment on lines +189 to +190
- The original dependencies are installed in the new connector image.
- The dev dependencies are not installed in the new connector image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The original dependencies are installed in the new connector image.
- The dev dependencies are not installed in the new connector image.

match = re.match(package_name_pattern, dep)
if match:
package = match.group(1)
versions[package] = dep
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I followed the code path correctly we'd get a dict with the following structure:

{
  "airbyte-cdk: "airbyte-cdk==0.80.0"
}

Am I correct? (would be worth adding a docstring)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. the second part being what would be poetry add

Comment on lines +121 to +124
# update everything else
connector_container = await connector_container.with_exec(["poetry", "update"])
poetry_update_output = await connector_container.stdout()
self.logger.info(poetry_update_output)
Copy link
Contributor

@alafanechere alafanechere Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this section execution be conditioned by a flag passed to UpdatePoetry to optionally run a full poetry update (this could also be a different step in the pipeline)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You think there's a reason to be optional? I'm pretty open to that.

Comment on lines +113 to +119
current_cdk_version = before_versions.get("airbyte-cdk") or None
if current_cdk_version:
# We want the CDK pinned exactly so it also works as expected in PyAirbyte and other `pip` scenarios
new_cdk_version = pick_airbyte_cdk_version(current_cdk_version, self.context)
self.logger.info(f"Updating airbyte-cdk from {current_cdk_version} to {new_cdk_version}")
if new_cdk_version > current_cdk_version:
connector_container = await connector_container.with_exec(["poetry", "add", f"airbyte-cdk=={new_cdk_version}"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this else block and instantiate a second UpdatePoetry step in the run_up_to_date_pipeline, the pipeline would pick the right cdk version (according to inputs or by fetching the latest version).
I would remove "business" logic from the Step and make it more reusable - self contained.

In other words: can we make self.specified_versions not optional 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. But I don't thin kthe eventual cron script will specify the versions for all the updates that should be poetry update --- specified_versions was really only for local work. "Let's change try changing the semantic version of pytest" kind of stuff.

@bleonard bleonard enabled auto-merge (squash) April 26, 2024 19:51
@bleonard bleonard merged commit c47142f into master Apr 26, 2024
35 checks passed
@bleonard bleonard deleted the up_to_date_command branch April 26, 2024 20:15
type=bool,
default=False,
is_flag=True,
help="Force update when there are only dev changes.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean only dev dependencies are changed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants