-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add FsspecConnector
to easily integrate new connectors with a fsspec
implementation available
#318
feat: add FsspecConnector
to easily integrate new connectors with a fsspec
implementation available
#318
Conversation
I also wanted to get your input on how's the best way to integrate this within the CLI, I thought about having separate CLI options per host e.g. |
CI doesn't seem to be passing the tests because both |
This isn't a cache issue. The CI performs the following installs prior to starting the ingest tests: unstructured/.github/workflows/ci.yml Lines 110 to 112 in 95109db
We would expect for make install-ingest-s3 to install the (new) required dependencies (s3fs and fsspec ), but make install-ingest-s3 is simply this:Lines 52 to 55 in 95109db
And this PR does not update the unstructured/requirements/ingest-s3.txt Lines 19 to 20 in 95109db
And does not install e.g. s3fs or fsspec . Resolving this issue is as simple as running
and committing the changes.
|
True @tomaarsen I was running those locally installing the extras. I'll rerun |
I've also another doubt @cragwolfe for
I think that we need to discuss the Azure Blob Storage integration further, but that can be done separately in #257 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/aio-libs/aiobotocore not being highly maintained gives me pause on making this the default for s3. In the context of ingest multiprocessing fetching on file per process, aiobotocore probably isn't helping much over boto3 either.
Besides that (and additional considerations below), Fsspec seems like a really neat library to support and quickly get a broad range of sources to connect to.
What about making fsspec
the connector where filesystem
could be the first param? Not entirely sure how the filesystem-specific extra params would be passed through but there are ways to solve that. Maybe easier if ffspec
is a subcommand. I'm also not entirely sure how to handle the dependencies as they may be different than a "native" implementation, e.g. s3. I mean, there could be a lot of make install-fsspec-
targets and setup.py extras, but that's starting to get excessive. Maybe draw the line there and leave it to the user to the the deps right.
@tomaarsen 's prior comments on Azure Blob Storage (ABS) also kind of align here. If there are quirks for other filesystems in fsspec that make them best handled natively, then that option is preserved by just having fsspec remain its own subcommand.
So AFAIU s3fs is actively maintained (see https://github.com/fsspec/s3fs) and widely used by some popular libraries such as
I think the line should be drawn at this point, because support is offered for
Also, the Azyure Blob Storage comments above were mine not @tomaarsen's 😆, but sure I agree, maybe we can merge this time (just active S3 support) and I'll work in the CLI improvement in a separate PR to both keep the current functionality but also support a fast plug-and-play for any other filesystem with a |
my bad!
I guess i don't see the benefit of immediately replacing the existing connector vs keeping them side by side (hesitation being really old version of boto3, unneeded async layer from aiobotocore, weirder stack traces when something goes wrong through extra layers of abstraction) but if the test passes, sure. we could always revert. |
So the issue seems to be related to But for now I'd say we're good to merge this PR 👍🏻 |
Would this interfere with Linux users' ability to use the functionality from this PR? |
Hopefully only an s3fs issue, but good point. Probably worth a mention in a docstring for now in _s3.py and fsspec.py. |
I'll explore this further today and come back with either the docstrings update or a solution on that :) But yes it seems just a |
Ok I've already solved it and made sure it works on macOS, Windows, and Linux, so I'll push the commits now to use I'll give you more details once I make sure the CI is also passing @cragwolfe @tomaarsen |
So the issue seems to be just occurring in Linux since the default To check that Windows, Linux, and macOS support Also, I took the decision to include that on the top of the file surrounded with a I would have posted more references, but I checked a lot of resources and took ideas from here and there while trying to test this on Windows, Linux, and macOS to ensure it works consistently across OS. |
Another finding is that when using Current one: Former one: Note also the differences between the |
from unstructured.ingest.connector.wikipedia import ( | ||
SimpleWikipediaConfig, | ||
WikipediaConnector, | ||
) | ||
from unstructured.ingest.doc_processor.generalized import initialize, process_document | ||
from unstructured.ingest.logger import ingest_log_streaming_init, logger | ||
|
||
with suppress(RuntimeError): | ||
mp.set_start_method("spawn") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big change here and I agree it seems to be a good one.
To confirm, I tested all the other example scripts to confirm on both linux and mac. 😅
All very nice detailed comments, and thanks for the references @alvarobartt ! ...the prior double logging on Linux 🤦 |
Glad to see this merged by the way. The use of |
What's in this PR?
So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available
fsspec
implementation. This is a way to standardize how the remote filesystems are used withinunstructured
.I've additionally renamed
s3_connector.py
tos3.py
for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.