Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add FsspecConnector to easily integrate new connectors with a fsspec implementation available #318

Merged
merged 65 commits into from
Mar 10, 2023

Conversation

alvarobartt
Copy link
Contributor

@alvarobartt alvarobartt commented Mar 1, 2023

What's in this PR?

So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured.

I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.

@alvarobartt
Copy link
Contributor Author

I also wanted to get your input on how's the best way to integrate this within the CLI, I thought about having separate CLI options per host e.g. unstructured-ingest s3 --path ... --auth ..., or just something more general unstructured-ingest --path ... and then the connector to be inferred based on that e.g. if starts with http/https just check the available connectors with URLs starting like that, otherwise extract the fsspec protocol and if that's supported then call one connector or the other.

@alvarobartt
Copy link
Contributor Author

CI doesn't seem to be passing the tests because both s3fs and fsspec are not installed properly as part of the s3 extra, so I assume this has something to do with an issue with the caching that doesn't reinstall everything, anyway, help if appreciated

@tomaarsen
Copy link
Contributor

This isn't a cache issue. The CI performs the following installs prior to starting the ingest tests:

make install-ingest-s3
make install-ingest-github
make install-ingest-wikipedia

We would expect for make install-ingest-s3 to install the (new) required dependencies (s3fs and fsspec), but make install-ingest-s3 is simply this:

unstructured/Makefile

Lines 52 to 55 in 95109db

## install-ingest-s3: install requirements for the s3 connector
.PHONY: install-ingest-s3
install-ingest-s3:
pip install -r requirements/ingest-s3.txt

And this PR does not update the requirements/ingest-s3.txt file, which still happily installs the old dependency of boto3:

boto3==1.26.80
# via unstructured (setup.py)

And does not install e.g. s3fs or fsspec. Resolving this issue is as simple as running

pip-compile --extra=s3 --output-file=requirements/ingest-s3.txt requirements/base.txt setup.py

and committing the changes.

  • Tom Aarsen

@alvarobartt
Copy link
Contributor Author

True @tomaarsen I was running those locally installing the extras.

I'll rerun pip-compile now!

@alvarobartt
Copy link
Contributor Author

I've also another doubt @cragwolfe for gcs and s3 seems straightforward to define the protocol name and the name of the connector classes etc. but for Azure Blob Storage (ABS) is weird, because:

  1. It's not usually shortened as ABS and that's not a well-known acronym
  2. ABS is the acronym but the protocol is adfs
  3. Since adlfs supports both Azure Blob Storage and Azure DataLake Gen1 & 2, should we rename the ABSConnector to AzureConnector and handle both filesystems? Or is that too wide?

I think that we need to discuss the Azure Blob Storage integration further, but that can be done separately in #257

Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/aio-libs/aiobotocore not being highly maintained gives me pause on making this the default for s3. In the context of ingest multiprocessing fetching on file per process, aiobotocore probably isn't helping much over boto3 either.

Besides that (and additional considerations below), Fsspec seems like a really neat library to support and quickly get a broad range of sources to connect to.

What about making fsspec the connector where filesystem could be the first param? Not entirely sure how the filesystem-specific extra params would be passed through but there are ways to solve that. Maybe easier if ffspec is a subcommand. I'm also not entirely sure how to handle the dependencies as they may be different than a "native" implementation, e.g. s3. I mean, there could be a lot of make install-fsspec- targets and setup.py extras, but that's starting to get excessive. Maybe draw the line there and leave it to the user to the the deps right.

@tomaarsen 's prior comments on Azure Blob Storage (ABS) also kind of align here. If there are quirks for other filesystems in fsspec that make them best handled natively, then that option is preserved by just having fsspec remain its own subcommand.

.github/workflows/ci.yml Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
requirements/ingest-s3.txt Show resolved Hide resolved
@alvarobartt
Copy link
Contributor Author

https://github.com/aio-libs/aiobotocore not being highly maintained gives me pause on making this the default for s3. In the context of ingest multiprocessing fetching on file per process, aiobotocore probably isn't helping much over boto3 either.

So AFAIU s3fs is actively maintained (see https://github.com/fsspec/s3fs) and widely used by some popular libraries such as transformers or datasets, since it's easy to plug in new remote data sources based on fsspec.

Besides that (and additional considerations below), Fsspec seems like a really neat library to support and quickly get a broad range of sources to connect to.

What about making fsspec the connector where filesystem could be the first param? Not entirely sure how the filesystem-specific extra params would be passed through but there are ways to solve that. Maybe easier if ffspec is a subcommand. I'm also not entirely sure how to handle the dependencies as they may be different than a "native" implementation, e.g. s3. I mean, there could be a lot of make install-fsspec- targets and setup.py extras, but that's starting to get excessive. Maybe draw the line there and leave it to the user to the the deps right.

I think the line should be drawn at this point, because support is offered for s3, gcs, and abs, which are the 3 main cloud storage providers. For the rest of them, we just offer the abstract connector so that they can either add their own based on fsspec or just improve the CLI so that if the protocol is an available fsspec implementation (registered in fsspec) then the user can still use the unstructured-ingest over that filesystem but being in charge and control of the specific dependencies that the filesystem may need.

@tomaarsen 's prior comments on Azure Blob Storage (ABS) also kind of align here. If there are quirks for other filesystems in fsspec that make them best handled natively, then that option is preserved by just having fsspec remain its own subcommand.

Also, the Azyure Blob Storage comments above were mine not @tomaarsen's 😆, but sure I agree, maybe we can merge this time (just active S3 support) and I'll work in the CLI improvement in a separate PR to both keep the current functionality but also support a fast plug-and-play for any other filesystem with a fsspec interface, if that's OK with you?

@cragwolfe
Copy link
Contributor

Also, the Azyure Blob Storage comments above were mine not @tomaarsen's 😆,

my bad!

maybe we can merge this time (just active S3 support)

I guess i don't see the benefit of immediately replacing the existing connector vs keeping them side by side (hesitation being really old version of boto3, unneeded async layer from aiobotocore, weirder stack traces when something goes wrong through extra layers of abstraction) but if the test passes, sure. we could always revert.

@alvarobartt
Copy link
Contributor Author

So the issue seems to be related to multiprocessing and fsspec in Linux, as in macOS I'm not able to reproduce it, so I'll try to explore this further and hopefully create another PR on top to solve that since I think I have a stable solution but requires more testing.

But for now I'd say we're good to merge this PR 👍🏻

@tomaarsen
Copy link
Contributor

So the issue seems to be related to multiprocessing and fsspec in Linux, as in macOS I'm not able to reproduce it, so I'll try to explore this further and hopefully create another PR on top to solve that since I think I have a stable solution but requires more testing.

Would this interfere with Linux users' ability to use the functionality from this PR?
Even if yes, I suppose it beats not having the functionality for anyone, haha

@cragwolfe
Copy link
Contributor

Would this interfere with Linux users' ability to use the functionality from this PR?
Even if yes, I suppose it beats not having the functionality for anyone, haha

Hopefully only an s3fs issue, but good point. Probably worth a mention in a docstring for now in _s3.py and fsspec.py.

@alvarobartt
Copy link
Contributor Author

So the issue seems to be related to multiprocessing and fsspec in Linux, as in macOS I'm not able to reproduce it, so I'll try to explore this further and hopefully create another PR on top to solve that since I think I have a stable solution but requires more testing.

Would this interfere with Linux users' ability to use the functionality from this PR? Even if yes, I suppose it beats not having the functionality for anyone, haha

I'll explore this further today and come back with either the docstrings update or a solution on that :)

But yes it seems just a s3fs issue as far as I could check, anyway I'll explore this further today before we merge!

@alvarobartt
Copy link
Contributor Author

Ok I've already solved it and made sure it works on macOS, Windows, and Linux, so I'll push the commits now to use _s3.py instead, and if the unit tests pass, then I'll remove s3.py and rename _s3.py to s3.py

I'll give you more details once I make sure the CI is also passing @cragwolfe @tomaarsen

@alvarobartt
Copy link
Contributor Author

So the issue seems to be just occurring in Linux since the default multiprocessing start method is fork (which is not safe on macOS), while on both Windows and macOS the default start method is spawn. So on, for consistency, if we set the default start method to spawn, the s3fs will also work in Linux, as the issue happens due to multiprocessing using fork as the start method.

To check that Windows, Linux, and macOS support spawn one can run multiprocessing.get_all_start_methods() and spawn will always be listed there, so the introduced multiprocessing.set_start_method("spawn") won't break anything in any OS.

Also, I took the decision to include that on the top of the file surrounded with a try-except, and as far as I could check that's the safest way, even though it could also be placed inside the if __name__ == "__main__".

I would have posted more references, but I checked a lot of resources and took ideas from here and there while trying to test this on Windows, Linux, and macOS to ensure it works consistently across OS.

@alvarobartt alvarobartt requested review from cragwolfe and removed request for cragwolfe March 9, 2023 15:32
@alvarobartt
Copy link
Contributor Author

alvarobartt commented Mar 9, 2023

Another finding is that when using fork the logs were duplicated as anyone can check in previous runs of the CI when running test_unstructured_ingest/test-ingest-s3.sh, plus this implementation being a little bit faster than the previous one (but this may have something to do with the latency), as the former run of test_unstructured_ingest/test-ingest-s3.sh was taking around 3 minutes in the CI job, while the one implemented here takes around 2:30 minutes.

Current one:

image

Former one:

image

Note also the differences between the spawn and fork start methods.

from unstructured.ingest.connector.wikipedia import (
SimpleWikipediaConfig,
WikipediaConnector,
)
from unstructured.ingest.doc_processor.generalized import initialize, process_document
from unstructured.ingest.logger import ingest_log_streaming_init, logger

with suppress(RuntimeError):
mp.set_start_method("spawn")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big change here and I agree it seems to be a good one.

To confirm, I tested all the other example scripts to confirm on both linux and mac. 😅

@cragwolfe cragwolfe enabled auto-merge (squash) March 10, 2023 05:58
@cragwolfe cragwolfe disabled auto-merge March 10, 2023 05:58
@cragwolfe cragwolfe enabled auto-merge (squash) March 10, 2023 06:01
@cragwolfe
Copy link
Contributor

All very nice detailed comments, and thanks for the references @alvarobartt !

...the prior double logging on Linux 🤦

@cragwolfe cragwolfe merged commit c51adb2 into Unstructured-IO:main Mar 10, 2023
@tomaarsen
Copy link
Contributor

Glad to see this merged by the way. The use of fsspec is quite smart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants