Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 and Azure Blob Storage: Update File CDK to support document file types #31904

Merged
merged 13 commits into from Oct 31, 2023

Conversation

flash1293
Copy link
Contributor

What

This PR updates the CDK for S3 and Azure Blob Storage sources.

For S3, this is moving the document file type parsing logic into the CDK.

For Azure Blob Storage, it's now supporting document file types similar to S3.

@vercel
Copy link

vercel bot commented Oct 27, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 30, 2023 2:40pm

@github-actions
Copy link
Contributor

github-actions bot commented Oct 27, 2023

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@flash1293 flash1293 marked this pull request as ready for review October 27, 2023 08:33
@flash1293 flash1293 requested a review from a team October 27, 2023 08:33
@octavia-squidington-iv octavia-squidington-iv requested a review from a team October 27, 2023 08:35
@flash1293
Copy link
Contributor Author

@alafanechere All the file-based sources that support document file types will require the same build customization. For now it's not a huge lift as there aren't tons of them, but it might be nice to put this into a special spot so we won't have the same file in x places. Maybe the CDK exports the build customization (or you have another better idea)?

"pytz",
"fastavro==1.4.11",
"pyarrow",
"unstructured==0.10.19",
Copy link
Contributor Author

@flash1293 flash1293 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff is also included in airbyte-cdk[file-based], but then there would be a version mismatch with fastavro 1.4.11 which is listed explicitly here (it's fastavro~=1.8.0 in the file-based extra). Is this a leftover or is there a strong reason to not rely on the CDK "standard" dependencies for file based sources?

Same question applies to S3 (which makes me think that it wasn't a conscious choice).

Do you know where this came from @clnoll ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this comment after I submitted my first review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to your PR title I would assume that these two connectors would depend on the file based CDK and the unstructured lib would be bundled in it. It's be great if connectors and CDK could agree on the fastavro version indeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any reason not to use the file-based CDKs dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched it to use the extra

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes until we find the best option to avoid build_customization.py code duplication

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This build_customization.py is the same as the S3 one right?

I'd like to think a bit with my team what could be the best option to avoid code duplication:

A. Install these dependencies in our python-connector-base image. Advantage: simple incremental change. Drawback: all our python connector image will grow in size.
B. Put this build_customization.py file in the file cdk and create symlinks in s3 and azure connector to this file. Advantage: code reuse without base image change. Drawback: its another edge case to remind of for our team.
C. Create a specific base image for file/LLM connectors based on python-connector-base. Advantage: a new clean and centralized artifact. Drawback: a change in this package needs to happen to build multiple images for the same connector language.

I'm more inclined toward C. But would recommend B to not block you right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with all of this (B for now, C in a follow-up). We could re-build the base image as part of the CDK publish action (still would require manually bumping the affected connectors but I think this would still be helpful)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking a bit more about it I'm not sure that C is our best long-term bet. I believe it adds up complexity and we can end up in a similar situation as we are today with strict-encrypt connectors. Managing variants is feasible but likely to be cumbersome. The base image will become a dependency of the file base image and compatibility, version pinning problems etc. will come with it.
As of today, I think option A - installing your new system dependencies in the base image - is the one with the lighter long-term maintenance burden, but as I said here I find it risky to download the nltk data through a python script execution as we don't have any reproducibility guarantee and will have to maintain nltk version equivalence between the base image and the cdk...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the best solution, but the symlink thing seems wrong as well. For now I duplicated the file, but happy to go for a better solution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the build_customization.py is just a py file that can import things, would it be possible for all of the helper methods to live in the file based CDK, imported from build_customization.py?

Then the pre_connector_install and post_connector_install might be duplicated, but small and declarative: "before installing, install_tesseract_and_poppler`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here's what we realized could be the best approach to take with @flash1293 :

  • @flash1293 's changes mean the CDK now has system dependencies. Which is definitely a new thing.
  • These system dependencies can change along the CDK version
  • We should bundle the CDK system dependencies in the python connector base image
  • As system dependencies can change on CDK version change we now have a coupling between CDK version used by the connector and the base image they can use.
  • We should hardcode somewhere a mapping between CDK version and Base image version compatibility
  • On connector build we should compare their CDK version to the base image version and fail the build if they're not compatible according to our mapping
  • Connector developer will have to change the baseImage metadata according to this failure.

Copy link
Contributor Author

@flash1293 flash1293 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @erohmensing suggestion is interesting - this means we would need to install the CDK in the version specified in the setup.py of the connector in the python environment that is building the dagger pipeline - if that's possible it might work, although it feels a little crazy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the build_customization.py is just a py file that can import things, would it be possible for all of the helper methods to live in the file based CDK, imported from build_customization.py?

The build_customization.py is imported at runtime in the build process here. @erohmensing it would mean that airbyte-ci would depend on the CDK if we'd import helpers from there . As these helpers can change according to the CDK version it means we can be in a situation where we build a connector image depending on an old version of the CDK with helpers from the latest version.

@@ -5,7 +5,20 @@

from setuptools import find_packages, setup

MAIN_REQUIREMENTS = ["airbyte-cdk>=0.51.17", "smart_open[azure]", "pytz", "fastavro==1.4.11", "pyarrow"]
MAIN_REQUIREMENTS = [
"airbyte-cdk>=0.51.17",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not updating the CDK?
According to your changelog entry, I would expect that a new CDK version comes with the libraries you added as requirements to this connector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly updated

@flash1293 flash1293 requested a review from a team as a code owner October 27, 2023 09:18
@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Oct 27, 2023
"""

connector_container = connector_container.with_exec(
["sh", "-c", "apt-get update && apt-get install -y tesseract-ocr poppler-utils"], skip_entrypoint=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we install specific versions of these utils to maximize build reproducibility?

Comment on lines 20 to 31
nltk_python_script = textwrap.dedent(
"""
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
"""
)
connector_container = (
connector_container.with_new_file("/tmp/nltk_python_script.py", nltk_python_script)
.with_exec(["python", "/tmp/nltk_python_script.py"], skip_entrypoint=True)
.with_exec(["rm", "/tmp/nltk_python_script.py"], skip_entrypoint=True)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment on which system path the nltk data will be written?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""

# Setup nltk in the container
connector_container = setup_nltk(connector_container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something interesting to note here, which is playing against the bundling of the nltk data in the base image, is that:

  • this execution would depend of the installation of nltk in the base image
  • the result of this execution (the file download) is dynamic. We don't control what's being downloaded exactly

This is why it makes sense for it to be a post_connector_install thing.
If we downloaded this data in the base image we'd have to make sure the nltk version we download this data with is the same as the one declared in the CDK...

I'm also bit concerned by the fact that this post_connector_install can't guarantee reproducible results as we rely on nltk to decide which data is downloaded.
@flash1293 could we host snapshots of this data on one of our public bucket and make this hook download this data and mount it to the container?

What I mean by "reproducible build" is that we always get the same image when running the build command on the same commit. If the data downloading function of nltk is dynamic we don't have reproducibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined the nltk data index in the customization script so it points to a raw github url of that specific commit - I have a pretty high confidence this is always reproduceable (short of the maintainers deleting the repo and then we have problems anyway)

connector_container = setup_nltk(connector_container)

# Install Tesseract and Poppler
connector_container = install_tesseract_and_poppler(connector_container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be called in a pre_install_hook?
These are system dependency installation that do not depend on the python package installation if I'm not mistaken. This will speed up the build as the apt-get install layers will be cached. In your current implementation we'll redownload and install these tools on any code change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the install to pre_connector_install and pinned the versions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about symlink and git:
Do you see the connector build_customization.py (symlink) in the git diff when you modify the root file_based_build_customization.py.

As of today we detect connector changes based on changes in their folder with a git diff command. It conditionally triggers our CI and connector tests. I think we would like to keep this behavior.

In other words:
When you change file_base_build_customization.py do you want to release a new connector version for all the connectors using it or do you want to decouple the hook changes from the connector release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alafanechere CDK changes are currently decoupled from connector changes - if a connector wants to use changed CDK stuff, the CDK needs to be published and a separate PR bumping the version needs to be opened (like what I'm doing here). It seems correct to me to extend this to the build customization script as well.

So no, this shouldn't automatically trigger a connector release. Seems like this makes the symlink approach less attractive.

@@ -21,7 +21,7 @@
"order": 10,
"type": "array",
"items": {
"title": "S3FileBasedStreamConfig",
"title": "BasedStreamConfig",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"title": "BasedStreamConfig",
"title": "FileBasedStreamConfig",

@octavia-squidington-iii octavia-squidington-iii removed the CDK Connector Development Kit label Oct 30, 2023
@flash1293
Copy link
Contributor Author

Thanks to @alafanechere, native dependencies are now provided by the base image. I added an acceptance test for azure to make sure the new parser actually works (this already exists for S3)

@alafanechere
Copy link
Contributor

Neat, all 🟢 . Let's wait for a 👍 from my team on #31929 before merging.

Copy link
Collaborator

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 🚀

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#31929 was approved
So our base image version 1.2.0 is now official

@flash1293 flash1293 merged commit 7c7acad into master Oct 31, 2023
27 of 28 checks passed
@flash1293 flash1293 deleted the flash1293/update-file-cdk branch October 31, 2023 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/azure-blob-storage connectors/source/s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants