New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python-connector-base: add CDK system dependencies #31929
python-connector-base: add CDK system dependencies #31929
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Current dependencies on/for this PR: This stack of pull requests is managed by Graphite. |
73a2959
to
0e1d1db
Compare
0e1d1db
to
d59fc1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference to me makes sense - these ones go in the base image rather than build customization because they're not just one connector dependency, they're CDK dependencies. I'd be interested to hear more about a hypothetical plan for Eventually record somewhere a mapping between cdk-version <> compatible base image.
:)
No blocking issues, just some questions + suggestions 🚂
Our base images are declared in code, using the [Dagger Python SDK](https://dagger-io.readthedocs.io/en/sdk-python-v0.6.4/). | ||
|
||
- [Python base image code declaration](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/base_images/base_images/python/bases.py) | ||
- ~Java base image code declaration~ TODO | ||
- ~Java base image code declaration~ *TODO* | ||
|
||
|
||
## Where are the Dockerfiles? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new version should also come with an artificially generated Dockerfile, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually can't find those in the repo anymore but I swear they were there when we added this functionality...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Maybe it just updates the readme according to this, and we just need to change the However, we do artificially generate Dockerfiles for debugging and documentation purposes.
ntlk_data = { | ||
"tokenizers": {"https://github.com/nltk/nltk_data/raw/5db857e6f7df11eabb5e5665836db9ec8df07e28/packages/tokenizers/punkt.zip"}, | ||
"taggers": { | ||
"https://github.com/nltk/nltk_data/raw/5db857e6f7df11eabb5e5665836db9ec8df07e28/packages/taggers/averaged_perceptron_tagger.zip" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that since these are raw github locations, if these packages move for any reason, all of our connectors will start to fail to build (right?).
I guess this is the same for e.g. any pypi package, but it feels a bit more dangerous on github where e.g. a repository rename or something that removes this specific commit could affect this. Don't think there's any action to take, just wanted to flag!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of our connectors will start to fail to build
Nope, as the base image is published to DockerHub, once its built successfully all the connectors can use it as a base image with a FROM
like statement, the layers downloading the remote resources are not recomputed at connector build time.
This official NLTK data is published on GitHub. We keep per commit URL to make sure we get static file content that can't change to boost build repro.
It's unlikely that Github deletes per commit files, unless NTLK explicitely asks them to unindex these URLs.
In any case: if it breaks it will break when we'll try to cut a new base image version, not a connector version.
# Install CDK system dependencies | ||
.with_(self.install_cdk_system_dependencies()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this - helps us figure out where things go if there are more of them, and what things to pull out if we ever move them somewhere else
await python_sanity_checks.check_nltk_data(container) | ||
await python_sanity_checks.check_tesseract_version(container, "5.3.0") | ||
await python_sanity_checks.check_poppler_utils_version(container, "22.12.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: bundling these under a check_cdk_system_dependencies
method might make it clearer how the checks match up to the way the image is built
}, | ||
} | ||
|
||
def install_cdk_system_dependencies(self) -> Callable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ I guess this callable pattern is what allows us to perform the manipulation of the container as part of the container building pipeline rather than e.g.
container = (self.get_base_container(..... other steps ))
return with_file_based_connector_dependencies(container)
If so, nice, and I'm wondering if this can help with the strangeness we have in pipelines where some things go from context -> container and others go from container -> container 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erohmensing yes the container.with_(callable)
is a nice pattern to keep the a clean container operation call chain.
It was introduced in a "recent" dagger version, which explain why we're not using it everywhere.
It also found it tricky is some cases to create an async function which returns callable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the context!
container = container.with_exec( | ||
["sh", "-c", "apt-get update && apt-get install -y tesseract-ocr=5.3.0-2 poppler-utils=22.12.0-2+b1"], skip_entrypoint=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing multiple places where the sh_dash_c
util from pipelines could potentially be helpful out here
try: | ||
await with_nltk.with_exec( | ||
["python", "-c", 'import nltk;nltk.data.find("taggers/averaged_perceptron_tagger");nltk.data.find("tokenizers/punkt")'], | ||
skip_entrypoint=True, | ||
) | ||
except dagger.ExecError as e: | ||
raise errors.SanityCheckError(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't have to be here, but I think we would benefit from making some custom assertion mechanism for this test file - we are doing quite a lot of
try:
...
except dagger.ExecError as e:
raise errors.SanityCheckError(e)
d59fc1a
to
ff0a9ad
Compare
What
Connectors parsing unstructured data require system dependencies (see #31904).
The unstructured data parsing logic is defined at the CDK level so these dependencies can be considered as "CDK system dependencies".
This PR bundles the following dependencies on our python connector base image:
How
🚨 User Impact 🚨
The base image sizes grows from 80mb to 140mb.