-
Notifications
You must be signed in to change notification settings - Fork 573
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implements issue #244
- Loading branch information
Showing
11 changed files
with
611 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
#!/usr/bin/env bash | ||
|
||
# Processes the Unstructured-IO/unstructured repository | ||
# through Unstructured's library in 2 processes. | ||
|
||
# Structured outputs are stored in google-drive-ingest-output/ | ||
|
||
# NOTE, this script is not ready-to-run! | ||
# You must enter a Drive ID and a Drive Service Account Key before running. | ||
|
||
# You can find out how to the Service account Key: | ||
# https://developers.google.com/workspace/guides/create-credentials#service-account | ||
|
||
# The File or Folder ID can be gotten from the url of the file, such as: | ||
# https://drive.google.com/drive/folders/{folder-id} | ||
# https://drive.google.com/file/d/{file-id} | ||
|
||
# NOTE: Using the Service Account key only works when the file or folder | ||
# is shared atleast with permission for "Anyone with the link" to view | ||
# OR the email address for the service account is given access to the file | ||
# or folder. | ||
|
||
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) | ||
cd "$SCRIPT_DIR"/../../.. || exit 1 | ||
|
||
PYTHONPATH=. ./unstructured/ingest/main.py \ | ||
--drive-id "<file or folder id>" \ | ||
--drive-service-account-key "<path to drive service account key>" \ | ||
--structured-output-dir google-drive-ingest-output \ | ||
--num-processes 2 \ | ||
--drive-recursive \ | ||
--verbose \ | ||
# --extension ".docx" # Ensures only .docx files are processed. | ||
|
||
# Alternatively, you can call it using: | ||
# unstructured-ingest --drive-id ... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,218 @@ | ||
# | ||
# This file is autogenerated by pip-compile with Python 3.9 | ||
# by the following command: | ||
# | ||
# pip-compile --extra=google-drive --output-file=requirements/ingest-google-drive.txt requirements/base.txt setup.py | ||
# | ||
anyio==3.6.2 | ||
# via | ||
# -r requirements/base.txt | ||
# httpcore | ||
argilla==1.3.1 | ||
# via | ||
# -r requirements/base.txt | ||
# unstructured (setup.py) | ||
backoff==2.2.1 | ||
# via | ||
# -r requirements/base.txt | ||
# argilla | ||
cachetools==5.3.0 | ||
# via google-auth | ||
certifi==2022.12.7 | ||
# via | ||
# -r requirements/base.txt | ||
# httpcore | ||
# httpx | ||
# requests | ||
# unstructured (setup.py) | ||
charset-normalizer==3.0.1 | ||
# via | ||
# -r requirements/base.txt | ||
# requests | ||
click==8.1.3 | ||
# via | ||
# -r requirements/base.txt | ||
# nltk | ||
deprecated==1.2.13 | ||
# via | ||
# -r requirements/base.txt | ||
# argilla | ||
et-xmlfile==1.1.0 | ||
# via | ||
# -r requirements/base.txt | ||
# openpyxl | ||
google-api-core==2.11.0 | ||
# via google-api-python-client | ||
google-api-python-client==2.80.0 | ||
# via unstructured (setup.py) | ||
google-auth==2.16.2 | ||
# via | ||
# google-api-core | ||
# google-api-python-client | ||
# google-auth-httplib2 | ||
google-auth-httplib2==0.1.0 | ||
# via google-api-python-client | ||
googleapis-common-protos==1.58.0 | ||
# via google-api-core | ||
h11==0.14.0 | ||
# via | ||
# -r requirements/base.txt | ||
# httpcore | ||
httpcore==0.16.3 | ||
# via | ||
# -r requirements/base.txt | ||
# httpx | ||
httplib2==0.21.0 | ||
# via | ||
# google-api-python-client | ||
# google-auth-httplib2 | ||
httpx==0.23.3 | ||
# via | ||
# -r requirements/base.txt | ||
# argilla | ||
idna==3.4 | ||
# via | ||
# -r requirements/base.txt | ||
# anyio | ||
# requests | ||
# rfc3986 | ||
importlib-metadata==6.0.0 | ||
# via | ||
# -r requirements/base.txt | ||
# markdown | ||
joblib==1.2.0 | ||
# via | ||
# -r requirements/base.txt | ||
# nltk | ||
lxml==4.9.2 | ||
# via | ||
# -r requirements/base.txt | ||
# python-docx | ||
# python-pptx | ||
# unstructured (setup.py) | ||
markdown==3.4.1 | ||
# via | ||
# -r requirements/base.txt | ||
# unstructured (setup.py) | ||
monotonic==1.6 | ||
# via | ||
# -r requirements/base.txt | ||
# argilla | ||
nltk==3.8.1 | ||
# via | ||
# -r requirements/base.txt | ||
# unstructured (setup.py) | ||
numpy==1.23.5 | ||
# via | ||
# -r requirements/base.txt | ||
# argilla | ||
# pandas | ||
openpyxl==3.1.1 | ||
# via | ||
# -r requirements/base.txt | ||
# unstructured (setup.py) | ||
packaging==23.0 | ||
# via | ||
# -r requirements/base.txt | ||
# argilla | ||
pandas==1.5.3 | ||
# via | ||
# -r requirements/base.txt | ||
# argilla | ||
# unstructured (setup.py) | ||
pillow==9.4.0 | ||
# via | ||
# -r requirements/base.txt | ||
# python-pptx | ||
# unstructured (setup.py) | ||
protobuf==4.22.0 | ||
# via | ||
# google-api-core | ||
# googleapis-common-protos | ||
pyasn1==0.4.8 | ||
# via | ||
# pyasn1-modules | ||
# rsa | ||
pyasn1-modules==0.2.8 | ||
# via google-auth | ||
pydantic==1.10.5 | ||
# via | ||
# -r requirements/base.txt | ||
# argilla | ||
pyparsing==3.0.9 | ||
# via httplib2 | ||
python-dateutil==2.8.2 | ||
# via | ||
# -r requirements/base.txt | ||
# pandas | ||
python-docx==0.8.11 | ||
# via | ||
# -r requirements/base.txt | ||
# unstructured (setup.py) | ||
python-magic==0.4.27 | ||
# via | ||
# -r requirements/base.txt | ||
# unstructured (setup.py) | ||
python-pptx==0.6.21 | ||
# via | ||
# -r requirements/base.txt | ||
# unstructured (setup.py) | ||
pytz==2022.7.1 | ||
# via | ||
# -r requirements/base.txt | ||
# pandas | ||
regex==2022.10.31 | ||
# via | ||
# -r requirements/base.txt | ||
# nltk | ||
requests==2.28.2 | ||
# via | ||
# -r requirements/base.txt | ||
# google-api-core | ||
# unstructured (setup.py) | ||
rfc3986[idna2008]==1.5.0 | ||
# via | ||
# -r requirements/base.txt | ||
# httpx | ||
rsa==4.9 | ||
# via google-auth | ||
six==1.16.0 | ||
# via | ||
# -r requirements/base.txt | ||
# google-auth | ||
# google-auth-httplib2 | ||
# python-dateutil | ||
sniffio==1.3.0 | ||
# via | ||
# -r requirements/base.txt | ||
# anyio | ||
# httpcore | ||
# httpx | ||
tqdm==4.64.1 | ||
# via | ||
# -r requirements/base.txt | ||
# argilla | ||
# nltk | ||
typing-extensions==4.5.0 | ||
# via | ||
# -r requirements/base.txt | ||
# pydantic | ||
uritemplate==4.1.1 | ||
# via google-api-python-client | ||
urllib3==1.26.14 | ||
# via | ||
# -r requirements/base.txt | ||
# requests | ||
wrapt==1.14.1 | ||
# via | ||
# -r requirements/base.txt | ||
# argilla | ||
# deprecated | ||
xlsxwriter==3.0.8 | ||
# via | ||
# -r requirements/base.txt | ||
# python-pptx | ||
zipp==3.15.0 | ||
# via | ||
# -r requirements/base.txt | ||
# importlib-metadata |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
__version__ = "0.5.3-dev1" # pragma: no cover | ||
__version__ = "0.5.3-dev2" # pragma: no cover |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
GOOGLE_DRIVE_EXPORT_TYPES = { | ||
"application/vnd.google-apps.document": "application/" | ||
"vnd.openxmlformats-officedocument.wordprocessingml.document", | ||
"application/vnd.google-apps.spreadsheet": "application/" | ||
"vnd.openxmlformats-officedocument.spreadsheetml.sheet", | ||
"application/vnd.google-apps.presentation": "application/" | ||
"vnd.openxmlformats-officedocument.presentationml.presentation", | ||
"application/vnd.google-apps.photo": "image/jpeg", | ||
} |
Oops, something went wrong.