Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python-connector-base: add CDK system dependencies #31929

Conversation

alafanechere
Copy link
Contributor

@alafanechere alafanechere commented Oct 27, 2023

What

Connectors parsing unstructured data require system dependencies (see #31904).
The unstructured data parsing logic is defined at the CDK level so these dependencies can be considered as "CDK system dependencies".
This PR bundles the following dependencies on our python connector base image:

  • nltk data
  • tesseract
  • poppler

How

  • Download nltk data and unzip in a separate container and write this data to the base image container in `"/usr/share/nltk_data"
  • Install tesseract and poppler
  • Write sanity checks to validate these dependencies are properly installed
  • Release a new base image minor version

🚨 User Impact 🚨

The base image sizes grows from 80mb to 140mb.

@vercel
Copy link

vercel bot commented Oct 27, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Oct 31, 2023 8:30am

Copy link
Contributor Author

alafanechere commented Oct 27, 2023

Current dependencies on/for this PR:

This stack of pull requests is managed by Graphite.

@alafanechere alafanechere force-pushed the augustin/10-27-python-connector-base_file_based_CDK_system_deps branch from 73a2959 to 0e1d1db Compare October 30, 2023 10:42
@alafanechere alafanechere changed the title python-connector-base: file based CDK system deps python-connector-base: add CDK system dependencies Oct 30, 2023
@alafanechere alafanechere force-pushed the augustin/10-27-python-connector-base_file_based_CDK_system_deps branch from 0e1d1db to d59fc1a Compare October 30, 2023 10:47
@alafanechere alafanechere marked this pull request as ready for review October 30, 2023 10:52
@alafanechere alafanechere requested review from a team and flash1293 October 30, 2023 10:52
Copy link
Contributor

@erohmensing erohmensing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference to me makes sense - these ones go in the base image rather than build customization because they're not just one connector dependency, they're CDK dependencies. I'd be interested to hear more about a hypothetical plan for Eventually record somewhere a mapping between cdk-version <> compatible base image. :)

No blocking issues, just some questions + suggestions 🚂

Our base images are declared in code, using the [Dagger Python SDK](https://dagger-io.readthedocs.io/en/sdk-python-v0.6.4/).

- [Python base image code declaration](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/base_images/base_images/python/bases.py)
- ~Java base image code declaration~ TODO
- ~Java base image code declaration~ *TODO*


## Where are the Dockerfiles?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new version should also come with an artificially generated Dockerfile, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually can't find those in the repo anymore but I swear they were there when we added this functionality...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Maybe it just updates the readme according to this, and we just need to change the However, we do artificially generate Dockerfiles for debugging and documentation purposes.

Comment on lines +22 to +26
ntlk_data = {
"tokenizers": {"https://github.com/nltk/nltk_data/raw/5db857e6f7df11eabb5e5665836db9ec8df07e28/packages/tokenizers/punkt.zip"},
"taggers": {
"https://github.com/nltk/nltk_data/raw/5db857e6f7df11eabb5e5665836db9ec8df07e28/packages/taggers/averaged_perceptron_tagger.zip"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that since these are raw github locations, if these packages move for any reason, all of our connectors will start to fail to build (right?).

I guess this is the same for e.g. any pypi package, but it feels a bit more dangerous on github where e.g. a repository rename or something that removes this specific commit could affect this. Don't think there's any action to take, just wanted to flag!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of our connectors will start to fail to build

Nope, as the base image is published to DockerHub, once its built successfully all the connectors can use it as a base image with a FROM like statement, the layers downloading the remote resources are not recomputed at connector build time.

This official NLTK data is published on GitHub. We keep per commit URL to make sure we get static file content that can't change to boost build repro.
It's unlikely that Github deletes per commit files, unless NTLK explicitely asks them to unindex these URLs.

In any case: if it breaks it will break when we'll try to cut a new base image version, not a connector version.

Comment on lines +105 to +106
# Install CDK system dependencies
.with_(self.install_cdk_system_dependencies())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this - helps us figure out where things go if there are more of them, and what things to pull out if we ever move them somewhere else

Comment on lines 126 to 128
await python_sanity_checks.check_nltk_data(container)
await python_sanity_checks.check_tesseract_version(container, "5.3.0")
await python_sanity_checks.check_poppler_utils_version(container, "22.12.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: bundling these under a check_cdk_system_dependencies method might make it clearer how the checks match up to the way the image is built

},
}

def install_cdk_system_dependencies(self) -> Callable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ I guess this callable pattern is what allows us to perform the manipulation of the container as part of the container building pipeline rather than e.g.

container = (self.get_base_container(..... other steps )) 
return with_file_based_connector_dependencies(container)

If so, nice, and I'm wondering if this can help with the strangeness we have in pipelines where some things go from context -> container and others go from container -> container 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erohmensing yes the container.with_(callable) is a nice pattern to keep the a clean container operation call chain.
It was introduced in a "recent" dagger version, which explain why we're not using it everywhere.
It also found it tricky is some cases to create an async function which returns callable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the context!

Comment on lines +56 to +57
container = container.with_exec(
["sh", "-c", "apt-get update && apt-get install -y tesseract-ocr=5.3.0-2 poppler-utils=22.12.0-2+b1"], skip_entrypoint=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing multiple places where the sh_dash_c util from pipelines could potentially be helpful out here

Comment on lines +103 to +109
try:
await with_nltk.with_exec(
["python", "-c", 'import nltk;nltk.data.find("taggers/averaged_perceptron_tagger");nltk.data.find("tokenizers/punkt")'],
skip_entrypoint=True,
)
except dagger.ExecError as e:
raise errors.SanityCheckError(e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be here, but I think we would benefit from making some custom assertion mechanism for this test file - we are doing quite a lot of

try: 
    ... 
except dagger.ExecError as e:
    raise errors.SanityCheckError(e)

@alafanechere alafanechere force-pushed the augustin/10-27-python-connector-base_file_based_CDK_system_deps branch from d59fc1a to ff0a9ad Compare October 31, 2023 08:30
@alafanechere alafanechere enabled auto-merge (squash) October 31, 2023 08:32
@alafanechere alafanechere merged commit deef5ee into master Oct 31, 2023
21 checks passed
@alafanechere alafanechere deleted the augustin/10-27-python-connector-base_file_based_CDK_system_deps branch October 31, 2023 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants