New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

python-connector-base: add CDK system dependencies #31929

Merged

alafanechere merged 2 commits into master from augustin/10-27-python-connector-base_file_based_CDK_system_deps

Oct 31, 2023

Contributor

alafanechere commented Oct 27, 2023 •

edited

What

Connectors parsing unstructured data require system dependencies (see #31904).
The unstructured data parsing logic is defined at the CDK level so these dependencies can be considered as "CDK system dependencies".
This PR bundles the following dependencies on our python connector base image:

nltk data
tesseract
poppler

How

Download nltk data and unzip in a separate container and write this data to the base image container in `"/usr/share/nltk_data"
Install tesseract and poppler
Write sanity checks to validate these dependencies are properly installed
Release a new base image minor version

🚨 User Impact 🚨

The base image sizes grows from 80mb to 140mb.

vercel bot commented Oct 27, 2023 •

edited

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Oct 31, 2023 8:30am

Contributor Author

alafanechere commented Oct 27, 2023 •

edited

Current dependencies on/for this PR:

master
- PR python-connector-base: add CDK system dependencies #31929 👈

This stack of pull requests is managed by Graphite.

alafanechere force-pushed the augustin/10-27-python-connector-base_file_based_CDK_system_deps branch from 73a2959 to 0e1d1db Compare

October 30, 2023 10:42

alafanechere changed the title ~~python-connector-base: file based CDK system deps~~ python-connector-base: add CDK system dependencies

alafanechere force-pushed the augustin/10-27-python-connector-base_file_based_CDK_system_deps branch from 0e1d1db to d59fc1a Compare

October 30, 2023 10:47

alafanechere marked this pull request as ready for review

October 30, 2023 10:52

alafanechere requested review from a team and flash1293

October 30, 2023 10:52

alafanechere mentioned this pull request

S3 and Azure Blob Storage: Update File CDK to support document file types #31904

Merged

erohmensing approved these changes

View reviewed changes

Contributor

erohmensing left a comment

The difference to me makes sense - these ones go in the base image rather than build customization because they're not just one connector dependency, they're CDK dependencies. I'd be interested to hear more about a hypothetical plan for Eventually record somewhere a mapping between cdk-version <> compatible base image. :)

No blocking issues, just some questions + suggestions 🚂

airbyte-ci/connectors/base_images/README.md

               Our base images are declared in code, using the [Dagger Python SDK](https://dagger-io.readthedocs.io/en/sdk-python-v0.6.4/).
               - [Python base image code declaration](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/base_images/base_images/python/bases.py)
-              - ~Java base image code declaration~ TODO
+              - ~Java base image code declaration~ *TODO*
               ## Where are the Dockerfiles?

Contributor

erohmensing Oct 30, 2023

This new version should also come with an artificially generated Dockerfile, right?

Contributor

erohmensing Oct 30, 2023

I actually can't find those in the repo anymore but I swear they were there when we added this functionality...

Contributor

erohmensing Oct 30, 2023

Ah! Maybe it just updates the readme according to this, and we just need to change the However, we do artificially generate Dockerfiles for debugging and documentation purposes.

airbyte-ci/connectors/base_images/base_images/python/bases.py

Comment on lines +22 to +26

+                  ntlk_data = {
+                      "tokenizers": {"https://github.com/nltk/nltk_data/raw/5db857e6f7df11eabb5e5665836db9ec8df07e28/packages/tokenizers/punkt.zip"},
+                      "taggers": {
+                          "https://github.com/nltk/nltk_data/raw/5db857e6f7df11eabb5e5665836db9ec8df07e28/packages/taggers/averaged_perceptron_tagger.zip"
+                      },

Contributor

erohmensing Oct 30, 2023

Just noting that since these are raw github locations, if these packages move for any reason, all of our connectors will start to fail to build (right?).

I guess this is the same for e.g. any pypi package, but it feels a bit more dangerous on github where e.g. a repository rename or something that removes this specific commit could affect this. Don't think there's any action to take, just wanted to flag!

Contributor Author

alafanechere Oct 31, 2023

all of our connectors will start to fail to build

Nope, as the base image is published to DockerHub, once its built successfully all the connectors can use it as a base image with a FROM like statement, the layers downloading the remote resources are not recomputed at connector build time.

This official NLTK data is published on GitHub. We keep per commit URL to make sure we get static file content that can't change to boost build repro.
It's unlikely that Github deletes per commit files, unless NTLK explicitely asks them to unindex these URLs.

In any case: if it breaks it will break when we'll try to cut a new base image version, not a connector version.

airbyte-ci/connectors/base_images/base_images/python/bases.py

Comment on lines +105 to +106

		# Install CDK system dependencies
		.with_(self.install_cdk_system_dependencies())

Contributor

erohmensing Oct 30, 2023

Love this - helps us figure out where things go if there are more of them, and what things to pull out if we ever move them somewhere else

airbyte-ci/connectors/base_images/base_images/python/bases.py Outdated

Comment on lines 126 to 128

+                      await python_sanity_checks.check_nltk_data(container)
+                      await python_sanity_checks.check_tesseract_version(container, "5.3.0")
+                      await python_sanity_checks.check_poppler_utils_version(container, "22.12.0")

Contributor

erohmensing Oct 30, 2023

Optional: bundling these under a check_cdk_system_dependencies method might make it clearer how the checks match up to the way the image is built

airbyte-ci/connectors/base_images/base_images/python/bases.py

+                      },
+                  }
+                  def install_cdk_system_dependencies(self) -> Callable:

Contributor

erohmensing Oct 30, 2023

❓ I guess this callable pattern is what allows us to perform the manipulation of the container as part of the container building pipeline rather than e.g.

container = (self.get_base_container(..... other steps )) 
return with_file_based_connector_dependencies(container)

If so, nice, and I'm wondering if this can help with the strangeness we have in pipelines where some things go from context -> container and others go from container -> container 🤔

Contributor Author

alafanechere Oct 31, 2023

@erohmensing yes the container.with_(callable) is a nice pattern to keep the a clean container operation call chain.
It was introduced in a "recent" dagger version, which explain why we're not using it everywhere.
It also found it tricky is some cases to create an async function which returns callable.

Contributor

erohmensing Oct 31, 2023

Got it, thanks for the context!

airbyte-ci/connectors/base_images/base_images/python/bases.py

Comment on lines +56 to +57

		container = container.with_exec(
		["sh", "-c", "apt-get update && apt-get install -y tesseract-ocr=5.3.0-2 poppler-utils=22.12.0-2+b1"], skip_entrypoint=True

Contributor

erohmensing Oct 30, 2023

Seeing multiple places where the sh_dash_c util from pipelines could potentially be helpful out here

airbyte-ci/connectors/base_images/base_images/python/sanity_checks.py

Comment on lines +103 to +109

+                  try:
+                      await with_nltk.with_exec(
+                          ["python", "-c", 'import nltk;nltk.data.find("taggers/averaged_perceptron_tagger");nltk.data.find("tokenizers/punkt")'],
+                          skip_entrypoint=True,
+                      )
+                  except dagger.ExecError as e:
+                      raise errors.SanityCheckError(e)

Contributor

erohmensing Oct 30, 2023

Doesn't have to be here, but I think we would benefit from making some custom assertion mechanism for this test file - we are doing quite a lot of

try: 
    ... 
except dagger.ExecError as e:
    raise errors.SanityCheckError(e)

alafanechere added 2 commits

October 31, 2023 09:30


          python-connector-base: add CDK system dependencies

08864f7


          create global sanity check for CDK deps

ff0a9ad

alafanechere force-pushed the augustin/10-27-python-connector-base_file_based_CDK_system_deps branch from d59fc1a to ff0a9ad Compare

October 31, 2023 08:30

alafanechere enabled auto-merge (squash)

October 31, 2023 08:32

alafanechere merged commit deef5ee into master

21 checks passed

alafanechere deleted the augustin/10-27-python-connector-base_file_based_CDK_system_deps branch

October 31, 2023 08:42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment