New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 and Azure Blob Storage: Update File CDK to support document file types #31904
Changes from 5 commits
25aceaf
23ea93f
1cc2edc
27fdffe
74dd1eb
5b1587f
287c540
6f8ca09
57635c9
be56016
9cf4387
6027f15
9bc621a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# | ||
# Copyright (c) 2023 Airbyte, Inc., all rights reserved. | ||
# | ||
from __future__ import annotations | ||
|
||
import textwrap | ||
from typing import TYPE_CHECKING | ||
|
||
if TYPE_CHECKING: | ||
from dagger import Container | ||
|
||
|
||
def setup_nltk(connector_container: Container) -> Container: | ||
""" | ||
Seeds the connector with nltk data at build time. This is because the nltk data | ||
is large and takes a long time to download. It runs a python script that downloads | ||
the data following connector installation. | ||
""" | ||
|
||
nltk_python_script = textwrap.dedent( | ||
""" | ||
import nltk | ||
nltk.download('punkt') | ||
nltk.download('averaged_perceptron_tagger') | ||
""" | ||
) | ||
connector_container = ( | ||
connector_container.with_new_file("/tmp/nltk_python_script.py", nltk_python_script) | ||
.with_exec(["python", "/tmp/nltk_python_script.py"], skip_entrypoint=True) | ||
.with_exec(["rm", "/tmp/nltk_python_script.py"], skip_entrypoint=True) | ||
) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you comment on which system path the nltk data will be written? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
|
||
return connector_container | ||
|
||
|
||
def install_tesseract_and_poppler(connector_container: Container) -> Container: | ||
""" | ||
Installs Tesseract-OCR and Poppler-utils in the container. These tools are necessary for | ||
OCR (Optical Character Recognition) processes and working with PDFs, respectively. | ||
""" | ||
|
||
connector_container = connector_container.with_exec( | ||
["sh", "-c", "apt-get update && apt-get install -y tesseract-ocr poppler-utils"], skip_entrypoint=True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we install specific versions of these utils to maximize build reproducibility? |
||
) | ||
|
||
return connector_container | ||
|
||
|
||
async def post_connector_install(connector_container: Container) -> Container: | ||
""" | ||
Handles post-installation setup for the connector by setting up nltk and | ||
installing necessary system dependencies such as Tesseract-OCR and Poppler-utils. | ||
|
||
These steps are necessary if the unstructured parser from the file based CDK is exposed in the connector. | ||
""" | ||
|
||
# Setup nltk in the container | ||
connector_container = setup_nltk(connector_container) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something interesting to note here, which is playing against the bundling of the nltk data in the base image, is that:
This is why it makes sense for it to be a I'm also bit concerned by the fact that this What I mean by "reproducible build" is that we always get the same image when running the build command on the same commit. If the data downloading function of nltk is dynamic we don't have reproducibility. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Inlined the nltk data index in the customization script so it points to a raw github url of that specific commit - I have a pretty high confidence this is always reproduceable (short of the maintainers deleting the repo and then we have problems anyway) |
||
|
||
# Install Tesseract and Poppler | ||
connector_container = install_tesseract_and_poppler(connector_container) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can this be called in a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moved the install to |
||
|
||
return connector_container |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This I'd like to think a bit with my team what could be the best option to avoid code duplication: A. Install these dependencies in our I'm more inclined toward C. But would recommend B to not block you right now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with all of this (B for now, C in a follow-up). We could re-build the base image as part of the CDK publish action (still would require manually bumping the affected connectors but I think this would still be helpful) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After thinking a bit more about it I'm not sure that C is our best long-term bet. I believe it adds up complexity and we can end up in a similar situation as we are today with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure about the best solution, but the symlink thing seems wrong as well. For now I duplicated the file, but happy to go for a better solution There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since the Then the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So here's what we realized could be the best approach to take with @flash1293 :
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think @erohmensing suggestion is interesting - this means we would need to install the CDK in the version specified in the setup.py of the connector in the python environment that is building the dagger pipeline - if that's possible it might work, although it feels a little crazy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../../airbyte-cdk/python/file_based_build_customization.py |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,20 @@ | |
|
||
from setuptools import find_packages, setup | ||
|
||
MAIN_REQUIREMENTS = ["airbyte-cdk>=0.51.17", "smart_open[azure]", "pytz", "fastavro==1.4.11", "pyarrow"] | ||
MAIN_REQUIREMENTS = [ | ||
"airbyte-cdk>=0.52.5", | ||
"smart_open[azure]", | ||
"pytz", | ||
"fastavro==1.4.11", | ||
"pyarrow", | ||
"unstructured==0.10.19", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This stuff is also included in Same question applies to S3 (which makes me think that it wasn't a conscious choice). Do you know where this came from @clnoll ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I read this comment after I submitted my first review. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to your PR title I would assume that these two connectors would depend on the file based CDK and the unstructured lib would be bundled in it. It's be great if connectors and CDK could agree on the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not aware of any reason not to use the file-based CDKs dependencies. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Switched it to use the extra |
||
"pdf2image==1.16.3", | ||
"pdfminer.six==20221105", | ||
"unstructured[docx]==0.10.19", | ||
"unstructured.pytesseract>=0.3.12", | ||
"pytesseract==0.3.10", | ||
"markdown", | ||
] | ||
|
||
TEST_REQUIREMENTS = ["requests-mock~=1.9.3", "pytest-mock~=3.6.1", "pytest~=6.2"] | ||
|
||
|
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../../airbyte-cdk/python/file_based_build_customization.py |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -21,7 +21,7 @@ | |||||
"order": 10, | ||||||
"type": "array", | ||||||
"items": { | ||||||
"title": "S3FileBasedStreamConfig", | ||||||
"title": "BasedStreamConfig", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
"type": "object", | ||||||
"properties": { | ||||||
"name": { | ||||||
|
@@ -270,7 +270,7 @@ | |||||
} | ||||||
}, | ||||||
{ | ||||||
"title": "Markdown/PDF/Docx Format (Experimental)", | ||||||
"title": "Document File Type Format (Experimental)", | ||||||
"type": "object", | ||||||
"properties": { | ||||||
"filetype": { | ||||||
|
@@ -280,7 +280,7 @@ | |||||
"type": "string" | ||||||
} | ||||||
}, | ||||||
"description": "Extract text from document formats and emit as one record per file." | ||||||
"description": "Extract text from document formats (.pdf, .docx, .md) and emit as one record per file." | ||||||
} | ||||||
] | ||||||
}, | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about symlink and git:
Do you see the connector
build_customization.py
(symlink) in the git diff when you modify the rootfile_based_build_customization.py
.As of today we detect connector changes based on changes in their folder with a
git diff
command. It conditionally triggers our CI and connector tests. I think we would like to keep this behavior.In other words:
When you change
file_base_build_customization.py
do you want to release a new connector version for all the connectors using it or do you want to decouple the hook changes from the connector release?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alafanechere CDK changes are currently decoupled from connector changes - if a connector wants to use changed CDK stuff, the CDK needs to be published and a separate PR bumping the version needs to be opened (like what I'm doing here). It seems correct to me to extend this to the build customization script as well.
So no, this shouldn't automatically trigger a connector release. Seems like this makes the symlink approach less attractive.