Skip to content

Commit

Permalink
Connector for Google Drive (#294)
Browse files Browse the repository at this point in the history
Implements issue #244
  • Loading branch information
HAKSOAT committed Mar 7, 2023
1 parent 905e4ae commit 4117f57
Show file tree
Hide file tree
Showing 11 changed files with 611 additions and 4 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.5.3-dev1
## 0.5.3-dev2

### Enhancements

Expand All @@ -7,6 +7,7 @@
* Add `--wikipedia-auto-suggest` argument to the ingest CLI to disable automatic redirection
to pages with similar names.
* Add optional `encoding` argument to the `partition_(text/email/html)` functions.
* Added Google Drive connector for ingest cli.

### Fixes

Expand Down
1 change: 1 addition & 0 deletions Ingest.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ In checklist form, the above steps are summarized as:
- [ ] Update the Makefile, adding a target for `install-ingest-<name>` and adding another `pip-compile` line to the `pip-compile` make target. See [this commit](https://github.com/Unstructured-IO/unstructured/commit/ab542ca3c6274f96b431142262d47d727f309e37) for a reference.
- [ ] The added dependencies should be imported at runtime when the new connector is invoked, rather than as top-level imports.
- [ ] Add the decorator `unstructured.utils.requires_dependencies` on top of each class instance or function that uses those connector-specific dependencies e.g. for `S3Connector` should look like `@requires_dependencies(dependencies=["boto3"], extras="s3")`
- [ ] Run `make tidy` and `make check` to ensure linting checks pass.
- [ ] Honors the conventions of `BaseConnectorConfig` defined in [unstructured/ingest/interfaces.py](unstructured/ingest/interfaces.py) which is passed through [the CLI](unstructured/ingest/main.py):
- [ ] If running with an `.output_dir` where structured outputs already exists for a given file, the file content is not re-downloaded from the data source nor is it reprocessed. This is made possible by implementing the call to `MyIngestDoc.has_output()` which is invoked in [MainProcess._filter_docs_with_outputs](ingest-prep-for-many/unstructured/ingest/main.py).
- [ ] Unless `.reprocess` is `True`, then documents are always reprocessed.
Expand Down
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@ install-dev:
install-build:
pip install -r requirements/build.txt

.PHONY: install-ingest-google-drive
install-ingest-google-drive:
pip install -r requirements/ingest-google-drive.txt

## install-ingest-s3: install requirements for the s3 connector
.PHONY: install-ingest-s3
install-ingest-s3:
Expand Down Expand Up @@ -98,6 +102,7 @@ pip-compile:
pip-compile --upgrade --extra=reddit --output-file=requirements/ingest-reddit.txt requirements/base.txt setup.py
pip-compile --upgrade --extra=github --output-file=requirements/ingest-github.txt requirements/base.txt setup.py
pip-compile --upgrade --extra=wikipedia --output-file=requirements/ingest-wikipedia.txt requirements/base.txt setup.py
pip-compile --upgrade --extra=google-drive --output-file=requirements/ingest-google-drive.txt requirements/base.txt setup.py

## install-project-local: install unstructured into your local python environment
.PHONY: install-project-local
Expand Down
36 changes: 36 additions & 0 deletions examples/ingest/google_drive/ingest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/env bash

# Processes the Unstructured-IO/unstructured repository
# through Unstructured's library in 2 processes.

# Structured outputs are stored in google-drive-ingest-output/

# NOTE, this script is not ready-to-run!
# You must enter a Drive ID and a Drive Service Account Key before running.

# You can find out how to the Service account Key:
# https://developers.google.com/workspace/guides/create-credentials#service-account

# The File or Folder ID can be gotten from the url of the file, such as:
# https://drive.google.com/drive/folders/{folder-id}
# https://drive.google.com/file/d/{file-id}

# NOTE: Using the Service Account key only works when the file or folder
# is shared atleast with permission for "Anyone with the link" to view
# OR the email address for the service account is given access to the file
# or folder.

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "$SCRIPT_DIR"/../../.. || exit 1

PYTHONPATH=. ./unstructured/ingest/main.py \
--drive-id "<file or folder id>" \
--drive-service-account-key "<path to drive service account key>" \
--structured-output-dir google-drive-ingest-output \
--num-processes 2 \
--drive-recursive \
--verbose \
# --extension ".docx" # Ensures only .docx files are processed.

# Alternatively, you can call it using:
# unstructured-ingest --drive-id ...
218 changes: 218 additions & 0 deletions requirements/ingest-google-drive.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
#
# This file is autogenerated by pip-compile with Python 3.9
# by the following command:
#
# pip-compile --extra=google-drive --output-file=requirements/ingest-google-drive.txt requirements/base.txt setup.py
#
anyio==3.6.2
# via
# -r requirements/base.txt
# httpcore
argilla==1.3.1
# via
# -r requirements/base.txt
# unstructured (setup.py)
backoff==2.2.1
# via
# -r requirements/base.txt
# argilla
cachetools==5.3.0
# via google-auth
certifi==2022.12.7
# via
# -r requirements/base.txt
# httpcore
# httpx
# requests
# unstructured (setup.py)
charset-normalizer==3.0.1
# via
# -r requirements/base.txt
# requests
click==8.1.3
# via
# -r requirements/base.txt
# nltk
deprecated==1.2.13
# via
# -r requirements/base.txt
# argilla
et-xmlfile==1.1.0
# via
# -r requirements/base.txt
# openpyxl
google-api-core==2.11.0
# via google-api-python-client
google-api-python-client==2.80.0
# via unstructured (setup.py)
google-auth==2.16.2
# via
# google-api-core
# google-api-python-client
# google-auth-httplib2
google-auth-httplib2==0.1.0
# via google-api-python-client
googleapis-common-protos==1.58.0
# via google-api-core
h11==0.14.0
# via
# -r requirements/base.txt
# httpcore
httpcore==0.16.3
# via
# -r requirements/base.txt
# httpx
httplib2==0.21.0
# via
# google-api-python-client
# google-auth-httplib2
httpx==0.23.3
# via
# -r requirements/base.txt
# argilla
idna==3.4
# via
# -r requirements/base.txt
# anyio
# requests
# rfc3986
importlib-metadata==6.0.0
# via
# -r requirements/base.txt
# markdown
joblib==1.2.0
# via
# -r requirements/base.txt
# nltk
lxml==4.9.2
# via
# -r requirements/base.txt
# python-docx
# python-pptx
# unstructured (setup.py)
markdown==3.4.1
# via
# -r requirements/base.txt
# unstructured (setup.py)
monotonic==1.6
# via
# -r requirements/base.txt
# argilla
nltk==3.8.1
# via
# -r requirements/base.txt
# unstructured (setup.py)
numpy==1.23.5
# via
# -r requirements/base.txt
# argilla
# pandas
openpyxl==3.1.1
# via
# -r requirements/base.txt
# unstructured (setup.py)
packaging==23.0
# via
# -r requirements/base.txt
# argilla
pandas==1.5.3
# via
# -r requirements/base.txt
# argilla
# unstructured (setup.py)
pillow==9.4.0
# via
# -r requirements/base.txt
# python-pptx
# unstructured (setup.py)
protobuf==4.22.0
# via
# google-api-core
# googleapis-common-protos
pyasn1==0.4.8
# via
# pyasn1-modules
# rsa
pyasn1-modules==0.2.8
# via google-auth
pydantic==1.10.5
# via
# -r requirements/base.txt
# argilla
pyparsing==3.0.9
# via httplib2
python-dateutil==2.8.2
# via
# -r requirements/base.txt
# pandas
python-docx==0.8.11
# via
# -r requirements/base.txt
# unstructured (setup.py)
python-magic==0.4.27
# via
# -r requirements/base.txt
# unstructured (setup.py)
python-pptx==0.6.21
# via
# -r requirements/base.txt
# unstructured (setup.py)
pytz==2022.7.1
# via
# -r requirements/base.txt
# pandas
regex==2022.10.31
# via
# -r requirements/base.txt
# nltk
requests==2.28.2
# via
# -r requirements/base.txt
# google-api-core
# unstructured (setup.py)
rfc3986[idna2008]==1.5.0
# via
# -r requirements/base.txt
# httpx
rsa==4.9
# via google-auth
six==1.16.0
# via
# -r requirements/base.txt
# google-auth
# google-auth-httplib2
# python-dateutil
sniffio==1.3.0
# via
# -r requirements/base.txt
# anyio
# httpcore
# httpx
tqdm==4.64.1
# via
# -r requirements/base.txt
# argilla
# nltk
typing-extensions==4.5.0
# via
# -r requirements/base.txt
# pydantic
uritemplate==4.1.1
# via google-api-python-client
urllib3==1.26.14
# via
# -r requirements/base.txt
# requests
wrapt==1.14.1
# via
# -r requirements/base.txt
# argilla
# deprecated
xlsxwriter==3.0.8
# via
# -r requirements/base.txt
# python-pptx
zipp==3.15.0
# via
# -r requirements/base.txt
# importlib-metadata
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@
],
"reddit": ["praw"],
"wikipedia": ["wikipedia"],
"google-drive": ["google-api-python-client"],
},
package_dir={"unstructured": "unstructured"},
package_data={"unstructured": ["nlp/*.txt"]},
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.5.3-dev1" # pragma: no cover
__version__ = "0.5.3-dev2" # pragma: no cover
5 changes: 3 additions & 2 deletions unstructured/file_utils/filetype.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,10 +218,11 @@ def detect_filetype(
with open(filename, "rb") as f:
filetype = _detect_filetype_from_octet_stream(file=f)

extension = extension if extension else ""
if filetype == FileType.UNK:
return FileType.ZIP
return EXT_TO_FILETYPE.get(extension.lower(), FileType.ZIP)
else:
return filetype
return EXT_TO_FILETYPE.get(extension.lower(), filetype)

logger.warn(
f"MIME type was {mime_type}. This file type is not currently supported in unstructured.",
Expand Down
9 changes: 9 additions & 0 deletions unstructured/file_utils/google_filetype.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
GOOGLE_DRIVE_EXPORT_TYPES = {
"application/vnd.google-apps.document": "application/"
"vnd.openxmlformats-officedocument.wordprocessingml.document",
"application/vnd.google-apps.spreadsheet": "application/"
"vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"application/vnd.google-apps.presentation": "application/"
"vnd.openxmlformats-officedocument.presentationml.presentation",
"application/vnd.google-apps.photo": "image/jpeg",
}
Loading

0 comments on commit 4117f57

Please sign in to comment.