S3 and Azure Blob Storage: Update File CDK to support document file types #31904

flash1293 · 2023-10-27T08:31:09Z

What

This PR updates the CDK for S3 and Azure Blob Storage sources.

For S3, this is moving the document file type parsing logic into the CDK.

For Azure Blob Storage, it's now supporting document file types similar to S3.

…le-cdk

vercel · 2023-10-27T08:31:15Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Oct 30, 2023 2:40pm

github-actions · 2023-10-27T08:31:33Z

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

PR name follows PR naming conventions
Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
Secrets in the connector's spec are annotated with airbyte_secret
All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

Check for hidden checklists in your PR description
Toggle the github label checklist-action-run on/off to re-run the checklist CI.

flash1293 · 2023-10-27T08:35:54Z

@alafanechere All the file-based sources that support document file types will require the same build customization. For now it's not a huge lift as there aren't tons of them, but it might be nice to put this into a special spot so we won't have the same file in x places. Maybe the CDK exports the build customization (or you have another better idea)?

flash1293 · 2023-10-27T08:38:24Z

airbyte-integrations/connectors/source-azure-blob-storage/setup.py

+    "pytz",
+    "fastavro==1.4.11",
+    "pyarrow",
+    "unstructured==0.10.19",


This stuff is also included in airbyte-cdk[file-based], but then there would be a version mismatch with fastavro 1.4.11 which is listed explicitly here (it's fastavro~=1.8.0 in the file-based extra). Is this a leftover or is there a strong reason to not rely on the CDK "standard" dependencies for file based sources?

Same question applies to S3 (which makes me think that it wasn't a conscious choice).

Do you know where this came from @clnoll ?

I read this comment after I submitted my first review.

According to your PR title I would assume that these two connectors would depend on the file based CDK and the unstructured lib would be bundled in it. It's be great if connectors and CDK could agree on the fastavro version indeed.

I'm not aware of any reason not to use the file-based CDKs dependencies.

Switched it to use the extra

alafanechere

Requesting changes until we find the best option to avoid build_customization.py code duplication

alafanechere · 2023-10-27T08:44:53Z

airbyte-integrations/connectors/source-azure-blob-storage/build_customization.py

This build_customization.py is the same as the S3 one right?

I'd like to think a bit with my team what could be the best option to avoid code duplication:

A. Install these dependencies in our python-connector-base image. Advantage: simple incremental change. Drawback: all our python connector image will grow in size.
B. Put this build_customization.py file in the file cdk and create symlinks in s3 and azure connector to this file. Advantage: code reuse without base image change. Drawback: its another edge case to remind of for our team.
C. Create a specific base image for file/LLM connectors based on python-connector-base. Advantage: a new clean and centralized artifact. Drawback: a change in this package needs to happen to build multiple images for the same connector language.

I'm more inclined toward C. But would recommend B to not block you right now.

I agree with all of this (B for now, C in a follow-up). We could re-build the base image as part of the CDK publish action (still would require manually bumping the affected connectors but I think this would still be helpful)

After thinking a bit more about it I'm not sure that C is our best long-term bet. I believe it adds up complexity and we can end up in a similar situation as we are today with strict-encrypt connectors. Managing variants is feasible but likely to be cumbersome. The base image will become a dependency of the file base image and compatibility, version pinning problems etc. will come with it.
As of today, I think option A - installing your new system dependencies in the base image - is the one with the lighter long-term maintenance burden, but as I said here I find it risky to download the nltk data through a python script execution as we don't have any reproducibility guarantee and will have to maintain nltk version equivalence between the base image and the cdk...

Not sure about the best solution, but the symlink thing seems wrong as well. For now I duplicated the file, but happy to go for a better solution

Since the build_customization.py is just a py file that can import things, would it be possible for all of the helper methods to live in the file based CDK, imported from build_customization.py?

Then the pre_connector_install and post_connector_install might be duplicated, but small and declarative: "before installing, install_tesseract_and_poppler`

So here's what we realized could be the best approach to take with @flash1293 :

@flash1293 's changes mean the CDK now has system dependencies. Which is definitely a new thing.

These system dependencies can change along the CDK version

We should bundle the CDK system dependencies in the python connector base image

As system dependencies can change on CDK version change we now have a coupling between CDK version used by the connector and the base image they can use.

We should hardcode somewhere a mapping between CDK version and Base image version compatibility

On connector build we should compare their CDK version to the base image version and fail the build if they're not compatible according to our mapping

Connector developer will have to change the baseImage metadata according to this failure.

I think @erohmensing suggestion is interesting - this means we would need to install the CDK in the version specified in the setup.py of the connector in the python environment that is building the dagger pipeline - if that's possible it might work, although it feels a little crazy.

Since the build_customization.py is just a py file that can import things, would it be possible for all of the helper methods to live in the file based CDK, imported from build_customization.py?

The build_customization.py is imported at runtime in the build process here. @erohmensing it would mean that airbyte-ci would depend on the CDK if we'd import helpers from there . As these helpers can change according to the CDK version it means we can be in a situation where we build a connector image depending on an old version of the CDK with helpers from the latest version.

alafanechere · 2023-10-27T08:47:13Z

airbyte-integrations/connectors/source-azure-blob-storage/setup.py

@@ -5,7 +5,20 @@

 from setuptools import find_packages, setup

-MAIN_REQUIREMENTS = ["airbyte-cdk>=0.51.17", "smart_open[azure]", "pytz", "fastavro==1.4.11", "pyarrow"]
+MAIN_REQUIREMENTS = [
+    "airbyte-cdk>=0.51.17",


You are not updating the CDK?
According to your changelog entry, I would expect that a new CDK version comes with the libraries you added as requirements to this connector.

Explicitly updated

alafanechere · 2023-10-27T10:05:14Z