Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add FsspecConnector to easily integrate new connectors with a fsspec implementation available #318

Merged
merged 65 commits into from
Mar 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
f428dae
Add `ingest/connector/fsspec.py`
alvarobartt Mar 1, 2023
0d5ebfe
Improve S3 connector & rename to `s3.py`
alvarobartt Mar 1, 2023
28d5b21
Add `s3fs` and `gcsfs` dependencies
alvarobartt Mar 1, 2023
ea97470
Add `gcs` and `azure` extras placeholders
alvarobartt Mar 1, 2023
8025524
Update `requires_depedencies` params
alvarobartt Mar 1, 2023
6fdf2cb
Add `GCSConnector` #301
alvarobartt Mar 1, 2023
7e742f9
Add `ABSConnector` #257
alvarobartt Mar 1, 2023
044cf2d
Fix to keep previous `output_filename` format
alvarobartt Mar 1, 2023
0aeacf4
Bump version to 0.5.2-dev2 & update `CHANGELOG.md`
alvarobartt Mar 1, 2023
af6ef04
Make `black` and `ruff` smile again
alvarobartt Mar 1, 2023
1d5acc4
Make `ruff` smile (for real)
alvarobartt Mar 1, 2023
97c2fe8
Make `black` smile (for real)
alvarobartt Mar 1, 2023
b784499
Manually solve `flake8` line-length complaint
alvarobartt Mar 1, 2023
b5900eb
Update `requirements/ingest-s3.txt`
alvarobartt Mar 1, 2023
3eb6b91
Add `ingest-azure.txt` and `ingest-gcs.txt`
alvarobartt Mar 1, 2023
f4ca767
Update `Ingest.md`
alvarobartt Mar 1, 2023
6d4f9f8
Update `bricks.rst` to use `s3fs` instead
alvarobartt Mar 1, 2023
8269e0e
Add `install-ingest-` for `azure` and `gcs`
alvarobartt Mar 1, 2023
1cdfd4d
Add note in `ci.yml`
alvarobartt Mar 1, 2023
ee58b52
Format YAML files (just cosmetic)
alvarobartt Mar 1, 2023
d0a9bf1
Add `.ruff_cache/` to `.gitignore`
alvarobartt Mar 1, 2023
b7fcc5b
Fix `abfs` protocol name
alvarobartt Mar 1, 2023
8139970
Merge remote-tracking branch 'upstream/main' into any-fsspec-connector
alvarobartt Mar 2, 2023
0fda5fb
Update `CHANGELOG.md`
alvarobartt Mar 2, 2023
62c13b8
Apply suggestions from code review
alvarobartt Mar 2, 2023
1e3a43a
Merge branch 'main' into any-fsspec-connector
alvarobartt Mar 3, 2023
b5b0507
Rename `ABSConnector` to `AzureBlobStorageConnector`
alvarobartt Mar 3, 2023
59c051f
Add `AzureBlobStorageConnector` in ingest-cli
alvarobartt Mar 3, 2023
ef35067
Add `azure/ingest.sh`
alvarobartt Mar 3, 2023
2840227
Rename `ABS` to `AzureBlobStorage` in `CHANGELOG.md`
alvarobartt Mar 3, 2023
af1fd45
Merge remote-tracking branch 'upstream/main' into any-fsspec-connector
alvarobartt Mar 3, 2023
a8271d7
Merge branch 'any-fsspec-connector' of https://github.com/alvarobartt…
alvarobartt Mar 3, 2023
b25829c
Fix `mypy` type checking
alvarobartt Mar 5, 2023
addc1a9
Merge branch 'main' into any-fsspec-connector
alvarobartt Mar 5, 2023
4fe07e7
Update unstructured/ingest/main.py
alvarobartt Mar 7, 2023
bc8cfa6
Merge remote-tracking branch 'upstream/main' into any-fsspec-connector
alvarobartt Mar 7, 2023
aa9275d
Remove extra stuff from PR
alvarobartt Mar 7, 2023
fd9b279
Update `Ingest.md` as suggested in code review
alvarobartt Mar 7, 2023
5385278
Revert "Format YAML files (just cosmetic)"
alvarobartt Mar 7, 2023
8fcfcbd
Run `make pip-compile`
alvarobartt Mar 7, 2023
c5083fa
Fully revert `ci.yml` changes
alvarobartt Mar 7, 2023
fbea573
Remove unused stuff
alvarobartt Mar 7, 2023
7d41d3f
Align `FsspecConnector` with `S3Connector` logging
alvarobartt Mar 7, 2023
cecebcc
remove verbose
cragwolfe Mar 8, 2023
d3e58b6
Merge branch 'main' into any-fsspec-connector
cragwolfe Mar 8, 2023
d1e62b0
version bump
cragwolfe Mar 8, 2023
475c6e1
Replace `!=` with `-ne` in `test-ingest-s3.sh`
alvarobartt Mar 8, 2023
bedaa9f
Fix `diff` between generated and existing
alvarobartt Mar 8, 2023
0f9919a
Merge branch 'main' into any-fsspec-connector
cragwolfe Mar 8, 2023
d893de6
Remove `GCSConnector`
alvarobartt Mar 8, 2023
9d7aa66
Bring former `S3Connector` back
alvarobartt Mar 8, 2023
7a443dd
Move `S3Connector` with `fsspec` to `_s3.py`
alvarobartt Mar 8, 2023
149c471
Remove `s3fs` usage from `bricks.rst`
alvarobartt Mar 8, 2023
c644efd
Run `make pip-compile`
alvarobartt Mar 8, 2023
11f95d5
Update extra dependencies in `setup.py`
alvarobartt Mar 8, 2023
533fcb5
Update `CHANGELOG.md`
alvarobartt Mar 8, 2023
d76e405
Use former `diff` in `test-ingest-s3.sh`
alvarobartt Mar 9, 2023
fb2d41d
Set `multiprocessing.set_start_method("spawn")`
alvarobartt Mar 9, 2023
9e8d41e
Use `contextlib.suppress` instead of `try-except-pass`
alvarobartt Mar 9, 2023
5786971
Add `ingest-s3-alt.txt` requirements
alvarobartt Mar 9, 2023
e8ce4f4
Merge branch 'main' into any-fsspec-connector
alvarobartt Mar 9, 2023
a9d7dfe
Run `make pip-compile`
alvarobartt Mar 9, 2023
fbdb32d
Use `s3fs` instead of `boto3`
alvarobartt Mar 9, 2023
48e6853
Merge branch 'main' into any-fsspec-connector
cragwolfe Mar 10, 2023
41e8a2f
version
cragwolfe Mar 10, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -184,3 +184,6 @@ tags
[._]*.un~

.DS_Store

# Ruff cache
.ruff_cache/
8 changes: 7 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
## 0.5.4-dev0
## 0.5.4-dev1

### Enhancements


* Add `FsspecConnector` to easily integrate any existing `fsspec` filesystem as a connector.
* Rename `s3_connector.py` to `s3.py` for readability and consistency with the
rest of the connectors.
* Now `S3Connector` relies on `s3fs` instead of on `boto3`, and it inherits
from `FsspecConnector`.
* Adds an `UNSTRUCTURED_LANGUAGE_CHECKS` environment variable to control whether or not language
specific checks like vocabulary and POS tagging are applied. Set to `"true"` for higher
resolution partitioning and `"false"` for faster processing.
Expand Down
10 changes: 6 additions & 4 deletions Ingest.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,9 @@ just execute `unstructured/ingest/main.py`, e.g.:

## Adding Data Connectors

To add a connector, refer to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py) as example that implements the three relelvant abstract base classes.
To add a connector, refer to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py) as example that implements the three relelvant abstract base classes.

If the connector has an available `fsspec` implementation, then refer to [unstructured/ingest/connector/s3.py](unstructured/ingest/connector/s3.py).

Then, update [unstructured/ingest/main.py](unstructured/ingest/main.py) to instantiate
the connector specific to your class if its command line options are invoked.
Expand All @@ -56,7 +58,7 @@ The `main.py` flags of --re-download/--no-re-download , --download-dir, --preser

In checklist form, the above steps are summarized as:

- [ ] Create a new module under [unstructured/ingest/connector/](unstructured/ingest/connector/) implementing the 3 abstract base classes, similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py).
- [ ] Create a new module under [unstructured/ingest/connector/](unstructured/ingest/connector/) implementing the 3 abstract base classes, similar to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py).
- [ ] The subclass of `BaseIngestDoc` overrides `process_file()` if extra processing logic is needed other than what is provided by [auto.partition()](unstructured/partition/auto.py).
- [ ] Update [unstructured/ingest/main.py](unstructured/ingest/main.py) with support for the new connector.
- [ ] Create a folder under [examples/ingest](examples/ingest) that includes at least one well documented script.
Expand All @@ -67,12 +69,12 @@ In checklist form, the above steps are summarized as:
- [ ] Add them as an extra to [setup.py](unstructured/setup.py).
- [ ] Update the Makefile, adding a target for `install-ingest-<name>` and adding another `pip-compile` line to the `pip-compile` make target. See [this commit](https://github.com/Unstructured-IO/unstructured/commit/ab542ca3c6274f96b431142262d47d727f309e37) for a reference.
- [ ] The added dependencies should be imported at runtime when the new connector is invoked, rather than as top-level imports.
- [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `S3Connector` should look like `@requires_dependencies(dependencies=["boto3"], extras="s3")`
- [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `GitHubConnector` should look like `@requires_dependencies(dependencies=["github"], extras="github")`
- [ ] Run `make tidy` and `make check` to ensure linting checks pass.
- [ ] Honors the conventions of `BaseConnectorConfig` defined in [unstructured/ingest/interfaces.py](unstructured/ingest/interfaces.py) which is passed through [the CLI](unstructured/ingest/main.py):
- [ ] If running with an `.output_dir` where structured outputs already exists for a given file, the file content is not re-downloaded from the data source nor is it reprocessed. This is made possible by implementing the call to `MyIngestDoc.has_output()` which is invoked in [MainProcess._filter_docs_with_outputs](ingest-prep-for-many/unstructured/ingest/main.py).
- [ ] Unless `.reprocess` is `True`, then documents are always reprocessed.
- [ ] If `.preserve_download` is `True`, documents downloaded to `.download_dir` are not removed after processing.
- [ ] Else if `.preserve_download` is `False`, documents downloaded to `.download_dir` are removed after they are **successfully** processed during the invocation of `MyIngestDoc.cleanup_file()` in [process_document](unstructured/ingest/doc_processor/generalized.py)
- [ ] Does not re-download documents to `.download_dir` if `.re_download` is False, enforced in `MyIngestDoc.get_file()`
- [ ] Prints more details if `.verbose` similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py).
- [ ] Prints more details if `--verbose` in ingest CLI, similar to [unstructured/ingest/connector/github.py](unstructured/ingest/connector/github.py) logging messages.
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ certifi==2022.12.7
# via
# -r requirements/build.in
# requests
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via requests
docutils==0.18.1
# via
Expand Down
6 changes: 3 additions & 3 deletions docs/source/bricks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1230,12 +1230,12 @@ files to an S3 bucket.

# Upload staged data files to S3 from local output directory.
def upload_staged_files():
import boto3
s3 = boto3.client("s3")
from s3fs import S3FileSystem
fs = S3FileSystem()
for filename in os.listdir(LOCAL_OUTPUT_DIRECTORY):
filepath = os.path.join(LOCAL_OUTPUT_DIRECTORY, filename)
upload_key = os.path.join(S3_BUCKET_KEY_PREFIX, filename)
s3.upload_file(filepath, Bucket=S3_BUCKET_NAME, Key=upload_key)
fs.put_file(lpath=filepath, rpath=os.path.join(S3_BUCKET_NAME, upload_key))

upload_staged_files()

Expand Down
18 changes: 13 additions & 5 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
anyio==3.6.2
# via httpcore
argilla==1.3.1
argilla==1.4.0
# via unstructured (setup.py)
backoff==2.2.1
# via argilla
Expand All @@ -16,10 +16,12 @@ certifi==2022.12.7
# httpx
# requests
# unstructured (setup.py)
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via requests
click==8.1.3
# via nltk
commonmark==0.9.1
# via rich
deprecated==1.2.13
# via argilla
et-xmlfile==1.1.0
Expand Down Expand Up @@ -66,8 +68,10 @@ pillow==9.4.0
# via
# python-pptx
# unstructured (setup.py)
pydantic==1.10.5
pydantic==1.10.6
# via argilla
pygments==2.14.0
# via rich
python-dateutil==2.8.2
# via pandas
python-docx==0.8.11
Expand All @@ -84,19 +88,23 @@ requests==2.28.2
# via unstructured (setup.py)
rfc3986[idna2008]==1.5.0
# via httpx
rich==13.0.1
# via argilla
six==1.16.0
# via python-dateutil
sniffio==1.3.0
# via
# anyio
# httpcore
# httpx
tqdm==4.64.1
tqdm==4.65.0
# via
# argilla
# nltk
typing-extensions==4.5.0
# via pydantic
# via
# pydantic
# rich
urllib3==1.26.14
# via requests
wrapt==1.14.1
Expand Down
2 changes: 1 addition & 1 deletion requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ certifi==2022.12.7
# via
# -r requirements/build.in
# requests
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via requests
docutils==0.18.1
# via
Expand Down
14 changes: 7 additions & 7 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ filelock==3.9.0
# via virtualenv
fqdn==1.5.1
# via jsonschema
identify==2.5.18
identify==2.5.19
# via pre-commit
idna==3.4
# via
Expand All @@ -67,7 +67,7 @@ importlib-metadata==6.0.0
# nbconvert
importlib-resources==5.12.0
# via jsonschema
ipykernel==6.21.2
ipykernel==6.21.3
# via
# ipywidgets
# jupyter
Expand Down Expand Up @@ -115,7 +115,7 @@ jupyter-client==8.0.3
# nbclient
# notebook
# qtconsole
jupyter-console==6.6.2
jupyter-console==6.6.3
# via jupyter
jupyter-core==5.2.0
# via
Expand All @@ -132,7 +132,7 @@ jupyter-core==5.2.0
# qtconsole
jupyter-events==0.6.3
# via jupyter-server
jupyter-server==2.3.0
jupyter-server==2.4.0
# via
# nbclassic
# notebook-shim
Expand All @@ -152,7 +152,7 @@ matplotlib-inline==0.1.6
# ipython
mistune==2.0.5
# via nbconvert
nbclassic==0.5.2
nbclassic==0.5.3
# via notebook
nbclient==0.7.2
# via nbconvert
Expand All @@ -176,7 +176,7 @@ nest-asyncio==1.5.6
# notebook
nodeenv==1.7.0
# via pre-commit
notebook==6.5.2
notebook==6.5.3
# via jupyter
notebook-shim==0.2.2
# via nbclassic
Expand All @@ -199,7 +199,7 @@ pip-tools==6.12.3
# via -r requirements/dev.in
pkgutil-resolve-name==1.3.10
# via jsonschema
platformdirs==3.0.0
platformdirs==3.1.0
# via
# jupyter-core
# virtualenv
Expand Down
17 changes: 12 additions & 5 deletions requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
anyio==3.6.2
# via httpcore
argilla==1.3.1
argilla==1.4.0
# via unstructured (setup.py)
backoff==2.2.1
# via argilla
Expand All @@ -16,12 +16,14 @@ certifi==2022.12.7
# httpx
# requests
# unstructured (setup.py)
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via requests
click==8.1.3
# via
# nltk
# sacremoses
commonmark==0.9.1
# via rich
deprecated==1.2.13
# via argilla
et-xmlfile==1.1.0
Expand All @@ -36,7 +38,7 @@ httpcore==0.16.3
# via httpx
httpx==0.23.3
# via argilla
huggingface-hub==0.12.1
huggingface-hub==0.13.1
# via transformers
idna==3.4
# via
Expand Down Expand Up @@ -82,8 +84,10 @@ pillow==9.4.0
# via
# python-pptx
# unstructured (setup.py)
pydantic==1.10.5
pydantic==1.10.6
# via argilla
pygments==2.14.0
# via rich
python-dateutil==2.8.2
# via pandas
python-docx==0.8.11
Expand All @@ -110,6 +114,8 @@ requests==2.28.2
# unstructured (setup.py)
rfc3986[idna2008]==1.5.0
# via httpx
rich==13.0.1
# via argilla
sacremoses==0.0.53
# via unstructured (setup.py)
sentencepiece==0.1.97
Expand All @@ -128,7 +134,7 @@ tokenizers==0.13.2
# via transformers
torch==1.13.1
# via unstructured (setup.py)
tqdm==4.64.1
tqdm==4.65.0
# via
# argilla
# huggingface-hub
Expand All @@ -141,6 +147,7 @@ typing-extensions==4.5.0
# via
# huggingface-hub
# pydantic
# rich
# torch
urllib3==1.26.14
# via requests
Expand Down
21 changes: 17 additions & 4 deletions requirements/ingest-github.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ anyio==3.6.2
# via
# -r requirements/base.txt
# httpcore
argilla==1.3.1
argilla==1.4.0
# via
# -r requirements/base.txt
# unstructured (setup.py)
Expand All @@ -25,14 +25,18 @@ certifi==2022.12.7
# unstructured (setup.py)
cffi==1.15.1
# via pynacl
charset-normalizer==3.0.1
charset-normalizer==3.1.0
# via
# -r requirements/base.txt
# requests
click==8.1.3
# via
# -r requirements/base.txt
# nltk
commonmark==0.9.1
# via
# -r requirements/base.txt
# rich
deprecated==1.2.13
# via
# -r requirements/base.txt
Expand Down Expand Up @@ -111,12 +115,16 @@ pillow==9.4.0
# unstructured (setup.py)
pycparser==2.21
# via cffi
pydantic==1.10.5
pydantic==1.10.6
# via
# -r requirements/base.txt
# argilla
pygithub==1.57.0
# via unstructured (setup.py)
pygments==2.14.0
# via
# -r requirements/base.txt
# rich
pyjwt==2.6.0
# via pygithub
pynacl==1.5.0
Expand Down Expand Up @@ -154,6 +162,10 @@ rfc3986[idna2008]==1.5.0
# via
# -r requirements/base.txt
# httpx
rich==13.0.1
# via
# -r requirements/base.txt
# argilla
six==1.16.0
# via
# -r requirements/base.txt
Expand All @@ -164,7 +176,7 @@ sniffio==1.3.0
# anyio
# httpcore
# httpx
tqdm==4.64.1
tqdm==4.65.0
# via
# -r requirements/base.txt
# argilla
Expand All @@ -173,6 +185,7 @@ typing-extensions==4.5.0
# via
# -r requirements/base.txt
# pydantic
# rich
urllib3==1.26.14
# via
# -r requirements/base.txt
Expand Down
Loading