Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add FsspecConnector to easily integrate new connectors with a fsspec implementation available #318

Merged
merged 65 commits into from
Mar 10, 2023
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
f428dae
Add `ingest/connector/fsspec.py`
alvarobartt Mar 1, 2023
0d5ebfe
Improve S3 connector & rename to `s3.py`
alvarobartt Mar 1, 2023
28d5b21
Add `s3fs` and `gcsfs` dependencies
alvarobartt Mar 1, 2023
ea97470
Add `gcs` and `azure` extras placeholders
alvarobartt Mar 1, 2023
8025524
Update `requires_depedencies` params
alvarobartt Mar 1, 2023
6fdf2cb
Add `GCSConnector` #301
alvarobartt Mar 1, 2023
7e742f9
Add `ABSConnector` #257
alvarobartt Mar 1, 2023
044cf2d
Fix to keep previous `output_filename` format
alvarobartt Mar 1, 2023
0aeacf4
Bump version to 0.5.2-dev2 & update `CHANGELOG.md`
alvarobartt Mar 1, 2023
af6ef04
Make `black` and `ruff` smile again
alvarobartt Mar 1, 2023
1d5acc4
Make `ruff` smile (for real)
alvarobartt Mar 1, 2023
97c2fe8
Make `black` smile (for real)
alvarobartt Mar 1, 2023
b784499
Manually solve `flake8` line-length complaint
alvarobartt Mar 1, 2023
b5900eb
Update `requirements/ingest-s3.txt`
alvarobartt Mar 1, 2023
3eb6b91
Add `ingest-azure.txt` and `ingest-gcs.txt`
alvarobartt Mar 1, 2023
f4ca767
Update `Ingest.md`
alvarobartt Mar 1, 2023
6d4f9f8
Update `bricks.rst` to use `s3fs` instead
alvarobartt Mar 1, 2023
8269e0e
Add `install-ingest-` for `azure` and `gcs`
alvarobartt Mar 1, 2023
1cdfd4d
Add note in `ci.yml`
alvarobartt Mar 1, 2023
ee58b52
Format YAML files (just cosmetic)
alvarobartt Mar 1, 2023
d0a9bf1
Add `.ruff_cache/` to `.gitignore`
alvarobartt Mar 1, 2023
b7fcc5b
Fix `abfs` protocol name
alvarobartt Mar 1, 2023
8139970
Merge remote-tracking branch 'upstream/main' into any-fsspec-connector
alvarobartt Mar 2, 2023
0fda5fb
Update `CHANGELOG.md`
alvarobartt Mar 2, 2023
62c13b8
Apply suggestions from code review
alvarobartt Mar 2, 2023
1e3a43a
Merge branch 'main' into any-fsspec-connector
alvarobartt Mar 3, 2023
b5b0507
Rename `ABSConnector` to `AzureBlobStorageConnector`
alvarobartt Mar 3, 2023
59c051f
Add `AzureBlobStorageConnector` in ingest-cli
alvarobartt Mar 3, 2023
ef35067
Add `azure/ingest.sh`
alvarobartt Mar 3, 2023
2840227
Rename `ABS` to `AzureBlobStorage` in `CHANGELOG.md`
alvarobartt Mar 3, 2023
af1fd45
Merge remote-tracking branch 'upstream/main' into any-fsspec-connector
alvarobartt Mar 3, 2023
a8271d7
Merge branch 'any-fsspec-connector' of https://github.com/alvarobartt…
alvarobartt Mar 3, 2023
b25829c
Fix `mypy` type checking
alvarobartt Mar 5, 2023
addc1a9
Merge branch 'main' into any-fsspec-connector
alvarobartt Mar 5, 2023
4fe07e7
Update unstructured/ingest/main.py
alvarobartt Mar 7, 2023
bc8cfa6
Merge remote-tracking branch 'upstream/main' into any-fsspec-connector
alvarobartt Mar 7, 2023
aa9275d
Remove extra stuff from PR
alvarobartt Mar 7, 2023
fd9b279
Update `Ingest.md` as suggested in code review
alvarobartt Mar 7, 2023
5385278
Revert "Format YAML files (just cosmetic)"
alvarobartt Mar 7, 2023
8fcfcbd
Run `make pip-compile`
alvarobartt Mar 7, 2023
c5083fa
Fully revert `ci.yml` changes
alvarobartt Mar 7, 2023
fbea573
Remove unused stuff
alvarobartt Mar 7, 2023
7d41d3f
Align `FsspecConnector` with `S3Connector` logging
alvarobartt Mar 7, 2023
cecebcc
remove verbose
cragwolfe Mar 8, 2023
d3e58b6
Merge branch 'main' into any-fsspec-connector
cragwolfe Mar 8, 2023
d1e62b0
version bump
cragwolfe Mar 8, 2023
475c6e1
Replace `!=` with `-ne` in `test-ingest-s3.sh`
alvarobartt Mar 8, 2023
bedaa9f
Fix `diff` between generated and existing
alvarobartt Mar 8, 2023
0f9919a
Merge branch 'main' into any-fsspec-connector
cragwolfe Mar 8, 2023
d893de6
Remove `GCSConnector`
alvarobartt Mar 8, 2023
9d7aa66
Bring former `S3Connector` back
alvarobartt Mar 8, 2023
7a443dd
Move `S3Connector` with `fsspec` to `_s3.py`
alvarobartt Mar 8, 2023
149c471
Remove `s3fs` usage from `bricks.rst`
alvarobartt Mar 8, 2023
c644efd
Run `make pip-compile`
alvarobartt Mar 8, 2023
11f95d5
Update extra dependencies in `setup.py`
alvarobartt Mar 8, 2023
533fcb5
Update `CHANGELOG.md`
alvarobartt Mar 8, 2023
d76e405
Use former `diff` in `test-ingest-s3.sh`
alvarobartt Mar 9, 2023
fb2d41d
Set `multiprocessing.set_start_method("spawn")`
alvarobartt Mar 9, 2023
9e8d41e
Use `contextlib.suppress` instead of `try-except-pass`
alvarobartt Mar 9, 2023
5786971
Add `ingest-s3-alt.txt` requirements
alvarobartt Mar 9, 2023
e8ce4f4
Merge branch 'main' into any-fsspec-connector
alvarobartt Mar 9, 2023
a9d7dfe
Run `make pip-compile`
alvarobartt Mar 9, 2023
fbdb32d
Use `s3fs` instead of `boto3`
alvarobartt Mar 9, 2023
48e6853
Merge branch 'main' into any-fsspec-connector
cragwolfe Mar 10, 2023
41e8a2f
version
cragwolfe Mar 10, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 84 additions & 82 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,68 +5,67 @@ on:
# We can switch to running on push if we make this repo public or are fine with
# paying for CI minutes.
push:
branches: [ main ]
branches: [main]
pull_request:
branches: [ main ]

branches: [main]

jobs:
setup:
strategy:
matrix:
python-version: ["3.8","3.9","3.10"]
python-version: ["3.8", "3.9", "3.10"]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
steps:
- uses: actions/checkout@v3
- uses: actions/cache@v3
id: virtualenv-cache
with:
path: |
.venv
nltk_data
key: unstructured-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('requirements/*.txt') }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Setup virtual environment (no cache hit)
if: steps.virtualenv-cache.outputs.cache-hit != 'true'
run: |
python${{ matrix.python-version }} -m venv .venv
source .venv/bin/activate
make install-ci
- uses: actions/checkout@v3
- uses: actions/cache@v3
id: virtualenv-cache
with:
path: |
.venv
nltk_data
key: unstructured-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('requirements/*.txt') }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Setup virtual environment (no cache hit)
if: steps.virtualenv-cache.outputs.cache-hit != 'true'
run: |
python${{ matrix.python-version }} -m venv .venv
source .venv/bin/activate
make install-ci

lint:
strategy:
matrix:
python-version: ["3.8","3.9","3.10"]
python-version: ["3.8", "3.9", "3.10"]
runs-on: ubuntu-latest
needs: setup
steps:
- uses: actions/checkout@v3
- uses: actions/cache@v3
id: virtualenv-cache
with:
path: .venv
key: unstructured-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('requirements/*.txt') }}
# NOTE(robinson) - This is a fallback in case the lint job does not find the cache.
# We can take this out when we implement the fix in CORE-99
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Setup virtual environment (no cache hit)
if: steps.virtualenv-cache.outputs.cache-hit != 'true'
run: |
python${{ matrix.python-version }} -m venv .venv
source .venv/bin/activate
make install-ci
- name: Lint
run: |
source .venv/bin/activate
make check
- uses: actions/checkout@v3
- uses: actions/cache@v3
id: virtualenv-cache
with:
path: .venv
key: unstructured-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('requirements/*.txt') }}
# NOTE(robinson) - This is a fallback in case the lint job does not find the cache.
# We can take this out when we implement the fix in CORE-99
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Setup virtual environment (no cache hit)
if: steps.virtualenv-cache.outputs.cache-hit != 'true'
run: |
python${{ matrix.python-version }} -m venv .venv
source .venv/bin/activate
make install-ci
- name: Lint
run: |
source .venv/bin/activate
make check

shellcheck:
runs-on: ubuntu-latest
Expand All @@ -78,50 +77,53 @@ jobs:
test:
strategy:
matrix:
python-version: ["3.8","3.9","3.10"]
python-version: ["3.8", "3.9", "3.10"]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup, lint]
steps:
- uses: actions/checkout@v3
- uses: actions/cache@v3
id: virtualenv-cache
with:
path: |
.venv
nltk_data
key: unstructured-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('requirements/*.txt') }}
# NOTE(robinson) - This is a fallback in case the lint job does not find the cache.
# We can take this out when we implement the fix in CORE-99
- name: Setup virtual environment (no cache hit)
if: steps.virtualenv-cache.outputs.cache-hit != 'true'
run: |
python${{ matrix.python-version}} -m venv .venv
source .venv/bin/activate
make install-ci
- name: Test
run: |
source .venv/bin/activate
make install-detectron2
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice
make test
make check-coverage
make install-ingest-s3
make install-ingest-github
make install-ingest-wikipedia
./test_unstructured_ingest/test-ingest.sh
- uses: actions/checkout@v3
- uses: actions/cache@v3
id: virtualenv-cache
with:
path: |
.venv
nltk_data
key: unstructured-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('requirements/*.txt') }}
# NOTE(robinson) - This is a fallback in case the lint job does not find the cache.
# We can take this out when we implement the fix in CORE-99
- name: Setup virtual environment (no cache hit)
if: steps.virtualenv-cache.outputs.cache-hit != 'true'
run: |
python${{ matrix.python-version}} -m venv .venv
source .venv/bin/activate
make install-ci
# NOTE: we need to add `make install-ingest-...` for each connector, and some are currently missing
# because the tests have not been created e.g. gcs or azure (abs), once a public remote path is available
# for testing purposes, the tests can be developed and triggered as part of the CI.
- name: Test
run: |
source .venv/bin/activate
make install-detectron2
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice
make test
make check-coverage
make install-ingest-s3
make install-ingest-github
make install-ingest-wikipedia
./test_unstructured_ingest/test-ingest.sh

changelog:
runs-on: ubuntu-latest
steps:
- if: github.ref != 'refs/heads/main'
uses: dorny/paths-filter@v2
id: changes
with:
filters: |
src:
- 'unstructured/**'
- if: github.ref != 'refs/heads/main'
uses: dorny/paths-filter@v2
id: changes
with:
filters: |
src:
- 'unstructured/**'

- if: steps.changes.outputs.src == 'true' && github.ref != 'refs/heads/main'
uses: dangoslen/changelog-enforcer@v3
- if: steps.changes.outputs.src == 'true' && github.ref != 'refs/heads/main'
uses: dangoslen/changelog-enforcer@v3
alvarobartt marked this conversation as resolved.
Show resolved Hide resolved
65 changes: 32 additions & 33 deletions .github/workflows/codeql-analysis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ name: "CodeQL"

on:
push:
branches: [ "main" ]
branches: ["main"]
pull_request:
# The branches below must be a subset of the branches above
branches: [ "main" ]
branches: ["main"]
schedule:
- cron: '21 21 * * 5'
- cron: "21 21 * * 5"

jobs:
analyze:
Expand All @@ -32,43 +32,42 @@ jobs:
strategy:
fail-fast: false
matrix:
language: [ 'python' ]
language: ["python"]
# CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python', 'ruby' ]
# Learn more about CodeQL language support at https://aka.ms/codeql-docs/language-support

steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Checkout repository
uses: actions/checkout@v3

# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@v2
with:
languages: ${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
# By default, queries listed here will override any specified in a config file.
# Prefix the list here with "+" to use these queries and those in the config file.

# Details on CodeQL's query packs refer to : https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs
# queries: security-extended,security-and-quality
# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@v2
with:
languages: ${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
# By default, queries listed here will override any specified in a config file.
# Prefix the list here with "+" to use these queries and those in the config file.


# Autobuild attempts to build any compiled languages (C/C++, C#, or Java).
# If this step fails, then you should remove it and run the build manually (see below)
- name: Autobuild
uses: github/codeql-action/autobuild@v2
# Details on CodeQL's query packs refer to : https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs
# queries: security-extended,security-and-quality

# ℹ️ Command-line programs to run using the OS shell.
# 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun
# Autobuild attempts to build any compiled languages (C/C++, C#, or Java).
# If this step fails, then you should remove it and run the build manually (see below)
- name: Autobuild
uses: github/codeql-action/autobuild@v2

# If the Autobuild fails above, remove it and uncomment the following three lines.
# modify them (or add more) to build your code if your project, please refer to the EXAMPLE below for guidance.
# ℹ️ Command-line programs to run using the OS shell.
# 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun

# - run: |
# echo "Run, Build Application using script"
# ./location_of_script_within_repo/buildscript.sh
# If the Autobuild fails above, remove it and uncomment the following three lines.
# modify them (or add more) to build your code if your project, please refer to the EXAMPLE below for guidance.

- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v2
with:
category: "/language:${{matrix.language}}"
# - run: |
# echo "Run, Build Application using script"
# ./location_of_script_within_repo/buildscript.sh

- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v2
with:
category: "/language:${{matrix.language}}"
30 changes: 15 additions & 15 deletions .github/workflows/sphinx.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,23 +5,23 @@ name: Sphinx build

on:
push:
branches: [ main ]
branches: [main]

jobs:
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Build HTML
uses: ammaraskar/sphinx-action@e781e9af3e80bfe0ea539e4ea46858d51e027214
- name: Upload artifacts
uses: actions/upload-artifact@v3
with:
name: html-docs
path: docs/build/html/
- name: Deploy
uses: peaceiris/actions-gh-pages@v3
if: github.ref == 'refs/heads/main'
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: docs/build/html
- uses: actions/checkout@v3
- name: Build HTML
uses: ammaraskar/sphinx-action@e781e9af3e80bfe0ea539e4ea46858d51e027214
- name: Upload artifacts
uses: actions/upload-artifact@v3
with:
name: html-docs
path: docs/build/html/
- name: Deploy
uses: peaceiris/actions-gh-pages@v3
if: github.ref == 'refs/heads/main'
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: docs/build/html
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -182,3 +182,6 @@ tags
[._]*.un~

.DS_Store

# Ruff cache
.ruff_cache/
13 changes: 12 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,20 @@
## 0.5.2-dev1
## 0.5.2-dev2

### Enhancements

* Add `FsspecConnector` to easily integrate any existing `fsspec` filesystem
as a connector.
* Rename `s3_connector.py` to `s3.py` for readability and consistency with the
rest of the connectors.

### Features

* Update `S3Connector` to inherit from `FsspecConnector`
* Add `GCSConnector` (missing CLI integration and working example but has been tested
with a private bucket with a PDF and works as expected)
* Add `ABSConnector` (missing CLI integration and working example but has been tested
with a private container with a PDF and works as expected)

### Fixes

## 0.5.1
Expand Down
2 changes: 1 addition & 1 deletion Ingest.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ In checklist form, the above steps are summarized as:
- [ ] Add them as an extra to [setup.py](unstructured/setup.py).
- [ ] Update the Makefile, adding a target for `install-ingest-<name>` and adding another `pip-compile` line to the `pip-compile` make target. See [this commit](https://github.com/Unstructured-IO/unstructured/commit/ab542ca3c6274f96b431142262d47d727f309e37) for a reference.
- [ ] The added dependencies should be imported at runtime when the new connector is invoked, rather than as top-level imports.
- [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `S3Connector` should look like `@requires_dependencies(dependencies=["boto3"], extras="s3")`
- [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `S3Connector` should look like `@requires_dependencies(dependencies=["s3fs", "fsspec"], extras="s3")`
- [ ] Honors the conventions of `BaseConnectorConfig` defined in [unstructured/ingest/interfaces.py](unstructured/ingest/interfaces.py) which is passed through [the CLI](unstructured/ingest/main.py):
- [ ] If running with an `.output_dir` where structured outputs already exists for a given file, the file content is not re-downloaded from the data source nor is it reprocessed. This is made possible by implementing the call to `MyIngestDoc.has_output()` which is invoked in [MainProcess._filter_docs_with_outputs](ingest-prep-for-many/unstructured/ingest/main.py).
- [ ] Unless `.reprocess` is `True`, then documents are always reprocessed.
Expand Down
11 changes: 10 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,18 @@ install-dev:
install-build:
pip install -r requirements/build.txt

## install-ingest-s3: install requirements for the s3 connector
.PHONY: install-ingest-s3
alvarobartt marked this conversation as resolved.
Show resolved Hide resolved
install-ingest-s3:
pip install -r requirements/ingest-s3.txt

.PHONY: install-ingest-gcs
install-ingest-gcs:
pip install -r requirements/ingest-gcs.txt

.PHONY: install-ingest-azure
install-ingest-azure:
pip install -r requirements/ingest-azure.txt

.PHONY: install-ingest-github
install-ingest-github:
pip install -r requirements/ingest-github.txt
Expand Down Expand Up @@ -95,6 +102,8 @@ pip-compile:
# sphinx docs looks for additional requirements
cp requirements/build.txt docs/requirements.txt
pip-compile --upgrade --extra=s3 --output-file=requirements/ingest-s3.txt requirements/base.txt setup.py
pip-compile --upgrade --extra=gcs --output-file=requirements/ingest-gcs.txt requirements/base.txt setup.py
pip-compile --upgrade --extra=azure --output-file=requirements/ingest-azure.txt requirements/base.txt setup.py
pip-compile --upgrade --extra=reddit --output-file=requirements/ingest-reddit.txt requirements/base.txt setup.py
pip-compile --upgrade --extra=github --output-file=requirements/ingest-github.txt requirements/base.txt setup.py
pip-compile --upgrade --extra=wikipedia --output-file=requirements/ingest-wikipedia.txt requirements/base.txt setup.py
Expand Down
Loading