New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 and Azure Blob Storage: Update File CDK to support document file types #31904
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
@alafanechere All the file-based sources that support document file types will require the same build customization. For now it's not a huge lift as there aren't tons of them, but it might be nice to put this into a special spot so we won't have the same file in x places. Maybe the CDK exports the build customization (or you have another better idea)? |
"pytz", | ||
"fastavro==1.4.11", | ||
"pyarrow", | ||
"unstructured==0.10.19", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stuff is also included in airbyte-cdk[file-based]
, but then there would be a version mismatch with fastavro 1.4.11 which is listed explicitly here (it's fastavro~=1.8.0
in the file-based extra). Is this a leftover or is there a strong reason to not rely on the CDK "standard" dependencies for file based sources?
Same question applies to S3 (which makes me think that it wasn't a conscious choice).
Do you know where this came from @clnoll ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this comment after I submitted my first review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to your PR title I would assume that these two connectors would depend on the file based CDK and the unstructured lib would be bundled in it. It's be great if connectors and CDK could agree on the fastavro
version indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any reason not to use the file-based CDKs dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched it to use the extra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes until we find the best option to avoid build_customization.py
code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This build_customization.py
is the same as the S3 one right?
I'd like to think a bit with my team what could be the best option to avoid code duplication:
A. Install these dependencies in our python-connector-base
image. Advantage: simple incremental change. Drawback: all our python connector image will grow in size.
B. Put this build_customization.py
file in the file cdk and create symlinks in s3 and azure connector to this file. Advantage: code reuse without base image change. Drawback: its another edge case to remind of for our team.
C. Create a specific base image for file/LLM connectors based on python-connector-base
. Advantage: a new clean and centralized artifact. Drawback: a change in this package needs to happen to build multiple images for the same connector language.
I'm more inclined toward C. But would recommend B to not block you right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with all of this (B for now, C in a follow-up). We could re-build the base image as part of the CDK publish action (still would require manually bumping the affected connectors but I think this would still be helpful)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking a bit more about it I'm not sure that C is our best long-term bet. I believe it adds up complexity and we can end up in a similar situation as we are today with strict-encrypt
connectors. Managing variants is feasible but likely to be cumbersome. The base image will become a dependency of the file base image and compatibility, version pinning problems etc. will come with it.
As of today, I think option A - installing your new system dependencies in the base image - is the one with the lighter long-term maintenance burden, but as I said here I find it risky to download the nltk data through a python script execution as we don't have any reproducibility guarantee and will have to maintain nltk version equivalence between the base image and the cdk...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the best solution, but the symlink thing seems wrong as well. For now I duplicated the file, but happy to go for a better solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the build_customization.py
is just a py file that can import things, would it be possible for all of the helper methods to live in the file based CDK, imported from build_customization.py?
Then the pre_connector_install
and post_connector_install
might be duplicated, but small and declarative: "before installing, install_tesseract_and_poppler`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here's what we realized could be the best approach to take with @flash1293 :
- @flash1293 's changes mean the CDK now has system dependencies. Which is definitely a new thing.
- These system dependencies can change along the CDK version
- We should bundle the CDK system dependencies in the python connector base image
- As system dependencies can change on CDK version change we now have a coupling between CDK version used by the connector and the base image they can use.
- We should hardcode somewhere a mapping between CDK version and Base image version compatibility
- On connector build we should compare their CDK version to the base image version and fail the build if they're not compatible according to our mapping
- Connector developer will have to change the
baseImage
metadata according to this failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @erohmensing suggestion is interesting - this means we would need to install the CDK in the version specified in the setup.py of the connector in the python environment that is building the dagger pipeline - if that's possible it might work, although it feels a little crazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the build_customization.py is just a py file that can import things, would it be possible for all of the helper methods to live in the file based CDK, imported from build_customization.py?
The build_customization.py
is imported at runtime in the build process here. @erohmensing it would mean that airbyte-ci
would depend on the CDK if we'd import helpers from there . As these helpers can change according to the CDK version it means we can be in a situation where we build a connector image depending on an old version of the CDK with helpers from the latest version.
@@ -5,7 +5,20 @@ | |||
|
|||
from setuptools import find_packages, setup | |||
|
|||
MAIN_REQUIREMENTS = ["airbyte-cdk>=0.51.17", "smart_open[azure]", "pytz", "fastavro==1.4.11", "pyarrow"] | |||
MAIN_REQUIREMENTS = [ | |||
"airbyte-cdk>=0.51.17", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not updating the CDK?
According to your changelog entry, I would expect that a new CDK version comes with the libraries you added as requirements to this connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly updated
""" | ||
|
||
connector_container = connector_container.with_exec( | ||
["sh", "-c", "apt-get update && apt-get install -y tesseract-ocr poppler-utils"], skip_entrypoint=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we install specific versions of these utils to maximize build reproducibility?
nltk_python_script = textwrap.dedent( | ||
""" | ||
import nltk | ||
nltk.download('punkt') | ||
nltk.download('averaged_perceptron_tagger') | ||
""" | ||
) | ||
connector_container = ( | ||
connector_container.with_new_file("/tmp/nltk_python_script.py", nltk_python_script) | ||
.with_exec(["python", "/tmp/nltk_python_script.py"], skip_entrypoint=True) | ||
.with_exec(["rm", "/tmp/nltk_python_script.py"], skip_entrypoint=True) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you comment on which system path the nltk data will be written?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
""" | ||
|
||
# Setup nltk in the container | ||
connector_container = setup_nltk(connector_container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something interesting to note here, which is playing against the bundling of the nltk data in the base image, is that:
- this execution would depend of the installation of
nltk
in the base image - the result of this execution (the file download) is dynamic. We don't control what's being downloaded exactly
This is why it makes sense for it to be a post_connector_install
thing.
If we downloaded this data in the base image we'd have to make sure the nltk version we download this data with is the same as the one declared in the CDK...
I'm also bit concerned by the fact that this post_connector_install
can't guarantee reproducible results as we rely on nltk
to decide which data is downloaded.
@flash1293 could we host snapshots of this data on one of our public bucket and make this hook download this data and mount it to the container?
What I mean by "reproducible build" is that we always get the same image when running the build command on the same commit. If the data downloading function of nltk is dynamic we don't have reproducibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inlined the nltk data index in the customization script so it points to a raw github url of that specific commit - I have a pretty high confidence this is always reproduceable (short of the maintainers deleting the repo and then we have problems anyway)
connector_container = setup_nltk(connector_container) | ||
|
||
# Install Tesseract and Poppler | ||
connector_container = install_tesseract_and_poppler(connector_container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be called in a pre_install_hook
?
These are system dependency installation that do not depend on the python package installation if I'm not mistaken. This will speed up the build as the apt-get install
layers will be cached. In your current implementation we'll redownload and install these tools on any code change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the install to pre_connector_install
and pinned the versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about symlink and git:
Do you see the connector build_customization.py
(symlink) in the git diff when you modify the root file_based_build_customization.py
.
As of today we detect connector changes based on changes in their folder with a git diff
command. It conditionally triggers our CI and connector tests. I think we would like to keep this behavior.
In other words:
When you change file_base_build_customization.py
do you want to release a new connector version for all the connectors using it or do you want to decouple the hook changes from the connector release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alafanechere CDK changes are currently decoupled from connector changes - if a connector wants to use changed CDK stuff, the CDK needs to be published and a separate PR bumping the version needs to be opened (like what I'm doing here). It seems correct to me to extend this to the build customization script as well.
So no, this shouldn't automatically trigger a connector release. Seems like this makes the symlink approach less attractive.
@@ -21,7 +21,7 @@ | |||
"order": 10, | |||
"type": "array", | |||
"items": { | |||
"title": "S3FileBasedStreamConfig", | |||
"title": "BasedStreamConfig", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"title": "BasedStreamConfig", | |
"title": "FileBasedStreamConfig", |
Thanks to @alafanechere, native dependencies are now provided by the base image. I added an acceptance test for azure to make sure the new parser actually works (this already exists for S3) |
Neat, all 🟢 . Let's wait for a 👍 from my team on #31929 before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#31929 was approved
So our base image version 1.2.0 is now official
What
This PR updates the CDK for S3 and Azure Blob Storage sources.
For S3, this is moving the document file type parsing logic into the CDK.
For Azure Blob Storage, it's now supporting document file types similar to S3.