Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connector for Google Drive #294

Merged
merged 18 commits into from
Mar 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.5.3-dev1
## 0.5.3-dev2

### Enhancements

Expand All @@ -7,6 +7,7 @@
* Add `--wikipedia-auto-suggest` argument to the ingest CLI to disable automatic redirection
to pages with similar names.
* Add optional `encoding` argument to the `partition_(text/email/html)` functions.
* Added Google Drive connector for ingest cli.

### Fixes

Expand Down
1 change: 1 addition & 0 deletions Ingest.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ In checklist form, the above steps are summarized as:
- [ ] Update the Makefile, adding a target for `install-ingest-<name>` and adding another `pip-compile` line to the `pip-compile` make target. See [this commit](https://github.com/Unstructured-IO/unstructured/commit/ab542ca3c6274f96b431142262d47d727f309e37) for a reference.
- [ ] The added dependencies should be imported at runtime when the new connector is invoked, rather than as top-level imports.
- [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `S3Connector` should look like `@requires_dependencies(dependencies=["boto3"], extras="s3")`
- [ ] Run `make tidy` and `make check` to ensure linting checks pass.
- [ ] Honors the conventions of `BaseConnectorConfig` defined in [unstructured/ingest/interfaces.py](unstructured/ingest/interfaces.py) which is passed through [the CLI](unstructured/ingest/main.py):
- [ ] If running with an `.output_dir` where structured outputs already exists for a given file, the file content is not re-downloaded from the data source nor is it reprocessed. This is made possible by implementing the call to `MyIngestDoc.has_output()` which is invoked in [MainProcess._filter_docs_with_outputs](ingest-prep-for-many/unstructured/ingest/main.py).
- [ ] Unless `.reprocess` is `True`, then documents are always reprocessed.
Expand Down
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@ install-dev:
install-build:
pip install -r requirements/build.txt

.PHONY: install-ingest-google-drive
install-ingest-google-drive:
pip install -r requirements/ingest-google-drive.txt

## install-ingest-s3: install requirements for the s3 connector
.PHONY: install-ingest-s3
install-ingest-s3:
Expand Down Expand Up @@ -98,6 +102,7 @@ pip-compile:
pip-compile --upgrade --extra=reddit --output-file=requirements/ingest-reddit.txt requirements/base.txt setup.py
pip-compile --upgrade --extra=github --output-file=requirements/ingest-github.txt requirements/base.txt setup.py
pip-compile --upgrade --extra=wikipedia --output-file=requirements/ingest-wikipedia.txt requirements/base.txt setup.py
pip-compile --upgrade --extra=google-drive --output-file=requirements/ingest-google-drive.txt requirements/base.txt setup.py

## install-project-local: install unstructured into your local python environment
.PHONY: install-project-local
Expand Down
36 changes: 36 additions & 0 deletions examples/ingest/google_drive/ingest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/env bash

# Processes the Unstructured-IO/unstructured repository
# through Unstructured's library in 2 processes.

# Structured outputs are stored in google-drive-ingest-output/

# NOTE, this script is not ready-to-run!
# You must enter a Drive ID and a Drive Service Account Key before running.

# You can find out how to the Service account Key:
# https://developers.google.com/workspace/guides/create-credentials#service-account

# The File or Folder ID can be gotten from the url of the file, such as:
# https://drive.google.com/drive/folders/{folder-id}
# https://drive.google.com/file/d/{file-id}

# NOTE: Using the Service Account key only works when the file or folder
# is shared atleast with permission for "Anyone with the link" to view
# OR the email address for the service account is given access to the file
# or folder.

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "$SCRIPT_DIR"/../../.. || exit 1

PYTHONPATH=. ./unstructured/ingest/main.py \
--drive-id "<file or folder id>" \
--drive-service-account-key "<path to drive service account key>" \
--structured-output-dir google-drive-ingest-output \
--num-processes 2 \
--drive-recursive \
--verbose \
# --extension ".docx" # Ensures only .docx files are processed.

# Alternatively, you can call it using:
# unstructured-ingest --drive-id ...
218 changes: 218 additions & 0 deletions requirements/ingest-google-drive.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
#
# This file is autogenerated by pip-compile with Python 3.9
# by the following command:
#
# pip-compile --extra=google-drive --output-file=requirements/ingest-google-drive.txt requirements/base.txt setup.py
#
anyio==3.6.2
# via
# -r requirements/base.txt
# httpcore
argilla==1.3.1
# via
# -r requirements/base.txt
# unstructured (setup.py)
backoff==2.2.1
# via
# -r requirements/base.txt
# argilla
cachetools==5.3.0
# via google-auth
certifi==2022.12.7
# via
# -r requirements/base.txt
# httpcore
# httpx
# requests
# unstructured (setup.py)
charset-normalizer==3.0.1
# via
# -r requirements/base.txt
# requests
click==8.1.3
# via
# -r requirements/base.txt
# nltk
deprecated==1.2.13
# via
# -r requirements/base.txt
# argilla
et-xmlfile==1.1.0
# via
# -r requirements/base.txt
# openpyxl
google-api-core==2.11.0
# via google-api-python-client
google-api-python-client==2.80.0
# via unstructured (setup.py)
google-auth==2.16.2
# via
# google-api-core
# google-api-python-client
# google-auth-httplib2
google-auth-httplib2==0.1.0
# via google-api-python-client
googleapis-common-protos==1.58.0
# via google-api-core
h11==0.14.0
# via
# -r requirements/base.txt
# httpcore
httpcore==0.16.3
# via
# -r requirements/base.txt
# httpx
httplib2==0.21.0
# via
# google-api-python-client
# google-auth-httplib2
httpx==0.23.3
# via
# -r requirements/base.txt
# argilla
idna==3.4
# via
# -r requirements/base.txt
# anyio
# requests
# rfc3986
importlib-metadata==6.0.0
# via
# -r requirements/base.txt
# markdown
joblib==1.2.0
# via
# -r requirements/base.txt
# nltk
lxml==4.9.2
# via
# -r requirements/base.txt
# python-docx
# python-pptx
# unstructured (setup.py)
markdown==3.4.1
# via
# -r requirements/base.txt
# unstructured (setup.py)
monotonic==1.6
# via
# -r requirements/base.txt
# argilla
nltk==3.8.1
# via
# -r requirements/base.txt
# unstructured (setup.py)
numpy==1.23.5
# via
# -r requirements/base.txt
# argilla
# pandas
openpyxl==3.1.1
# via
# -r requirements/base.txt
# unstructured (setup.py)
packaging==23.0
# via
# -r requirements/base.txt
# argilla
pandas==1.5.3
# via
# -r requirements/base.txt
# argilla
# unstructured (setup.py)
pillow==9.4.0
# via
# -r requirements/base.txt
# python-pptx
# unstructured (setup.py)
protobuf==4.22.0
# via
# google-api-core
# googleapis-common-protos
pyasn1==0.4.8
# via
# pyasn1-modules
# rsa
pyasn1-modules==0.2.8
# via google-auth
pydantic==1.10.5
# via
# -r requirements/base.txt
# argilla
pyparsing==3.0.9
# via httplib2
python-dateutil==2.8.2
# via
# -r requirements/base.txt
# pandas
python-docx==0.8.11
# via
# -r requirements/base.txt
# unstructured (setup.py)
python-magic==0.4.27
# via
# -r requirements/base.txt
# unstructured (setup.py)
python-pptx==0.6.21
# via
# -r requirements/base.txt
# unstructured (setup.py)
pytz==2022.7.1
# via
# -r requirements/base.txt
# pandas
regex==2022.10.31
# via
# -r requirements/base.txt
# nltk
requests==2.28.2
# via
# -r requirements/base.txt
# google-api-core
# unstructured (setup.py)
rfc3986[idna2008]==1.5.0
# via
# -r requirements/base.txt
# httpx
rsa==4.9
# via google-auth
six==1.16.0
# via
# -r requirements/base.txt
# google-auth
# google-auth-httplib2
# python-dateutil
sniffio==1.3.0
# via
# -r requirements/base.txt
# anyio
# httpcore
# httpx
tqdm==4.64.1
# via
# -r requirements/base.txt
# argilla
# nltk
typing-extensions==4.5.0
# via
# -r requirements/base.txt
# pydantic
uritemplate==4.1.1
# via google-api-python-client
urllib3==1.26.14
# via
# -r requirements/base.txt
# requests
wrapt==1.14.1
# via
# -r requirements/base.txt
# argilla
# deprecated
xlsxwriter==3.0.8
# via
# -r requirements/base.txt
# python-pptx
zipp==3.15.0
# via
# -r requirements/base.txt
# importlib-metadata
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@
],
"reddit": ["praw"],
"wikipedia": ["wikipedia"],
"google-drive": ["google-api-python-client"],
},
package_dir={"unstructured": "unstructured"},
package_data={"unstructured": ["nlp/*.txt"]},
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.5.3-dev1" # pragma: no cover
__version__ = "0.5.3-dev2" # pragma: no cover
5 changes: 3 additions & 2 deletions unstructured/file_utils/filetype.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,10 +218,11 @@ def detect_filetype(
with open(filename, "rb") as f:
filetype = _detect_filetype_from_octet_stream(file=f)

extension = extension if extension else ""
if filetype == FileType.UNK:
return FileType.ZIP
return EXT_TO_FILETYPE.get(extension.lower(), FileType.ZIP)
else:
return filetype
return EXT_TO_FILETYPE.get(extension.lower(), filetype)

logger.warn(
f"MIME type was {mime_type}. This file type is not currently supported in unstructured.",
Expand Down
9 changes: 9 additions & 0 deletions unstructured/file_utils/google_filetype.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
GOOGLE_DRIVE_EXPORT_TYPES = {
"application/vnd.google-apps.document": "application/"
"vnd.openxmlformats-officedocument.wordprocessingml.document",
"application/vnd.google-apps.spreadsheet": "application/"
"vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"application/vnd.google-apps.presentation": "application/"
"vnd.openxmlformats-officedocument.presentationml.presentation",
"application/vnd.google-apps.photo": "image/jpeg",
}
Loading