Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add google cloud storage connector #746

Merged
merged 38 commits into from
Jun 21, 2023
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
28d8d60
fix incorrect pip install message
Jun 9, 2023
43ef3bc
GCS initial changes that worked
Jun 9, 2023
56e4399
Recursive file finder for GCS added
Jun 10, 2023
e2de170
Small improvement
Jun 10, 2023
2fa6fb5
Improvements on gcs. Add recursive folder walking
Jun 14, 2023
0e229ef
Added requirements for Makefile etc...
Jun 14, 2023
8825035
fixed comment
Jun 14, 2023
639a5d9
fix small comment
Jun 14, 2023
9eda17a
fix small error
Jun 14, 2023
8cf6242
add to manifest
Jun 14, 2023
79a4706
fix conflict
Jun 14, 2023
eddc385
unfix conflict
Jun 14, 2023
a976fa7
more improvements. still working on main.py though
Jun 15, 2023
d0369a1
replace drive recursive, local recurs w/ recursive
Jun 15, 2023
612aa94
Added gcs token flag
Jun 16, 2023
2284baf
ran make tidy and make check
Jun 16, 2023
66a063c
Add auth to gcs test. And fix comment
Jun 17, 2023
40c6d77
run tidy and check
Jun 17, 2023
9c9b349
remove unwanted files
Jun 17, 2023
46aee8e
Added strange azure file?
Jun 17, 2023
902076f
un space strange azure file
Jun 17, 2023
07b8536
fix strange azure file that was pushed
Jun 17, 2023
c3ce8d8
again with the same azure file
Jun 17, 2023
a9f48c0
The file won't fix itself.
Jun 17, 2023
53a4cc8
compile requirements
ryannikolaidis Jun 20, 2023
aeeb907
bump cloud storage example repo
ryannikolaidis Jun 21, 2023
fc9775e
update expected count to 6
ryannikolaidis Jun 21, 2023
b58207c
Merge branch 'main' into potter/google-cloud-storage
ryannikolaidis Jun 21, 2023
b45ddeb
fix args
ryannikolaidis Jun 21, 2023
d22df8e
fix access kwargs
ryannikolaidis Jun 21, 2023
959742a
echo GCP_INGEST_SERVICE_KEY
ryannikolaidis Jun 21, 2023
06bd011
install gcs deps in ci
ryannikolaidis Jun 21, 2023
173eebe
clean write the expected files
ryannikolaidis Jun 21, 2023
11b69a7
update local-recursive flag to recursive
ryannikolaidis Jun 21, 2023
ad5282e
Update ingest test fixtures (#788)
Unstructured-DevOps Jun 21, 2023
da99841
Merge branch 'main' into potter/google-cloud-storage
ryannikolaidis Jun 21, 2023
d33f0de
bump changelog and version
ryannikolaidis Jun 21, 2023
6f806af
rename google-cloud to gcs everywhere for consistency
ryannikolaidis Jun 21, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ include requirements/base.in
include requirements/huggingface.in
include requirements/local-inference.in
include requirements/ingest-s3.in
include requirements/ingest-gcs.in
include requirements/ingest-azure.in
include requirements/ingest-discord.in
include requirements/ingest-github.in
Expand Down
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,10 @@ install-ingest-google-drive:
install-ingest-s3:
python3 -m pip install -r requirements/ingest-s3.txt

.PHONY: install-ingest-gcs
install-ingest-gcs:
python3 -m pip install -r requirements/ingest-gcs.txt

.PHONY: install-ingest-azure
install-ingest-azure:
python3 -m pip install -r requirements/ingest-azure.txt
Expand Down Expand Up @@ -127,6 +131,7 @@ pip-compile:
# sphinx docs looks for additional requirements
cp requirements/build.txt docs/requirements.txt
pip-compile --upgrade requirements/ingest-s3.in
pip-compile --upgrade requirements/ingest-gcs.in
pip-compile --upgrade requirements/ingest-azure.in
pip-compile --upgrade requirements/ingest-discord.in
pip-compile --upgrade requirements/ingest-reddit.in
Expand Down
16 changes: 16 additions & 0 deletions examples/ingest/google_cloud_storage/ingest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/env bash

# Processes several files in a nested folder structure from gs://unstructured_public/
# through Unstructured's library in 2 processes.

# Structured outputs are stored in gcs-output/

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "$SCRIPT_DIR"/../../.. || exit 1

PYTHONPATH=. ./unstructured/ingest/main.py \
--remote-url gs://unstructured_public/ \
--structured-output-dir gcs-output \
--num-processes 2 \
--recursive \
--verbose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed up a gcs test bucket with private and public paths, we should update to use that. I'll follow up with details directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use gs://utic-test-ingest-fixtures

Copy link
Contributor Author

@potter-potter potter-potter Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require the user to have authentication token. Is that what we really want? Or do we want a no auth example?

Copy link
Contributor

@ryannikolaidis ryannikolaidis Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry, this is for the example. yes, no auth. I'll see if I can set up a public bucket for this.

4 changes: 4 additions & 0 deletions requirements/ingest-gcs.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
-c constraints.in
-c base.txt
gcsfs
fsspec
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ def load_requirements(file_list: Optional[Union[str, List[str]]] = None) -> List
"slack": load_requirements("requirements/ingest-slack.in"),
"wikipedia": load_requirements("requirements/ingest-wikipedia.in"),
"google-drive": load_requirements("requirements/ingest-google-drive.in"),
"gcs": load_requirements("requirements/ingest-gcs.in"),
},
package_dir={"unstructured": "unstructured"},
package_data={"unstructured": ["nlp/*.txt"]},
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
[
{
"element_id": "855ecc17dee3ddb9d89d8f48740c9853",
"text": "MIME-Version: 1.0 Date: Fri, 16 Dec 2022 17:04:16 -0500 Message-ID: <CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com> Subject: Test Email From: Matthew Robinson <mrobinson@unstructured.io> To: Matthew Robinson <mrobinson@unstructured.io> Content-Type: multipart/alternative; boundary=\"00000000000095c9b205eff92630\"",
"type": "UncategorizedText",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "c3db8e6c584627c190cc8e1750bdac9c",
"text": "-00000000000095c9b205eff92630",
"type": "ListItem",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "b91c2196ba2a3190ec703710671918b2",
"text": "Content-Type: text/plain; charset=\"UTF-8\"",
"type": "Title",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "f49fbd614ddf5b72e06f59e554e6ae2b",
"text": "This is a test email to use for unit tests.",
"type": "NarrativeText",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "9c218520320f238595f1fde74bdd137d",
"text": "Important points:",
"type": "Title",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "8522061b991b1db70453502d328fe07e",
"text": "Roses are red",
"type": "ListItem",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "c3c4527761d4e4b8d0a4c4a0d46954c8",
"text": "Violets are blue",
"type": "ListItem",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "c3db8e6c584627c190cc8e1750bdac9c",
"text": "-00000000000095c9b205eff92630",
"type": "ListItem",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "c30942ddb17655a8226bf9d50b5c2fb2",
"text": "Content-Type: text/html; charset=\"UTF-8\"",
"type": "Title",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "9f5297daa98b670a4529a64fb1e29067",
"text": "<div dir=\"ltr\"><div>This is a test email to use for unit tests.</div><div><br></div><div>Important points:</div><div><ul><li>Roses are red</li><li>Violets are blue</li></ul></div></div>",
"type": "NarrativeText",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "f09df5cef9c41280d2d859b808e5f658",
"text": "-00000000000095c9b205eff92630--",
"type": "ListItem",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
[
{
"element_id": "1df8eeb8be847c3a1a7411e3be3e0396",
"text": "This is a test document to use for unit tests.",
"type": "NarrativeText",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "a9d4657034aa3fdb5177f1325e912362",
"text": "Doylestown, PA 18901",
"type": "Address",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "9c218520320f238595f1fde74bdd137d",
"text": "Important points:",
"type": "Title",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "39a3ae572581d0f1fe7511fd7b3aa414",
"text": "Hamburgers are delicious",
"type": "ListItem",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "fc1adcb8eaceac694e500a103f9f698f",
"text": "Dogs are the best",
"type": "ListItem",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
},
{
"element_id": "0b61e826b1c4ab05750184da72b89f83",
"text": "I love fuzzy blankets",
"type": "ListItem",
"metadata": {
"data_source": {},
"filetype": "text/plain",
"page_number": 1
}
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[
{
"element_id": "c08fcabe68ba13b7a7cc6592bd5513a8",
"text": "January 2023(Someone fed my essays into GPT to make something that could answer\nquestions based on them, then asked it where good ideas come from. The\nanswer was ok, but not what I would have said. This is what I would have said.)The way to get new ideas is to notice anomalies: what seems strange,\nor missing, or broken? You can see anomalies in everyday life (much\nof standup comedy is based on this), but the best place to look for\nthem is at the frontiers of knowledge.Knowledge grows fractally.\nFrom a distance its edges look smooth, but when you learn enough\nto get close to one, you'll notice it's full of gaps. These gaps\nwill seem obvious; it will seem inexplicable that no one has tried\nx or wondered about y. In the best case, exploring such gaps yields\nwhole new fractal buds.",
"type": "NarrativeText",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
}
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[
{
"element_id": "f8db6c6e535705336195aa2c1d23d414",
"text": "\n \n \n Team\n Location\n Stanley Cups\n \n \n Blues\n STL\n 1\n \n \n Flyers\n PHI\n 2\n \n \n Maple Leafs\n TOR\n 13\n \n \n",
"type": "Table",
"metadata": {
"data_source": {},
"filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"page_number": 1,
"page_name": "Stanley Cups",
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>13</td>\n </tr>\n </tbody>\n</table>"
}
},
{
"element_id": "20f5163a43ac6eb04a40d269d3ad0663",
"text": "\n \n \n Team\n Location\n Stanley Cups\n \n \n Blues\n STL\n 1\n \n \n Flyers\n PHI\n 2\n \n \n Maple Leafs\n TOR\n 0\n \n \n",
"type": "Table",
"metadata": {
"data_source": {},
"filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"page_number": 2,
"page_name": "Stanley Cups Since 67",
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>"
}
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[
{
"element_id": "f8db6c6e535705336195aa2c1d23d414",
"text": "\n \n \n Team\n Location\n Stanley Cups\n \n \n Blues\n STL\n 1\n \n \n Flyers\n PHI\n 2\n \n \n Maple Leafs\n TOR\n 13\n \n \n",
"type": "Table",
"metadata": {
"data_source": {},
"filetype": "text/csv",
"page_number": 1,
"text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Team</td>\n <td>Location</td>\n <td>Stanley Cups</td>\n </tr>\n <tr>\n <td>Blues</td>\n <td>STL</td>\n <td>1</td>\n </tr>\n <tr>\n <td>Flyers</td>\n <td>PHI</td>\n <td>2</td>\n </tr>\n <tr>\n <td>Maple Leafs</td>\n <td>TOR</td>\n <td>13</td>\n </tr>\n </tbody>\n</table>"
}
}
]
46 changes: 46 additions & 0 deletions test_unstructured_ingest/test-ingest-google-cloud.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/usr/bin/env bash

set -e

SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
cd "$SCRIPT_DIR"/.. || exit 1

if [[ "$(find test_unstructured_ingest/expected-structured-output/google-cloud-storage/ -type f -size +2 | wc -l)" -ne 5 ]]; then
echo "The test fixtures in test_unstructured_ingest/expected-structured-output/ look suspicious. At least one of the files is too small."
echo "Did you overwrite test fixtures with bad outputs?"
exit 1
fi

PYTHONPATH=. ./unstructured/ingest/main.py \
--metadata-exclude filename,file_directory,metadata.data_source.date_processed \
--remote-url gs://unstructured_public/ \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use this test folder: gs://utic-test-ingest-fixtures
it has two folders public and private, where private will require that the auth token is used. hopefully can leverage what was done in the google drive test that I just added?

Copy link
Contributor Author

@potter-potter potter-potter Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked. But had to cat the cred info into the tempfile instead of echo is that ok? Could be an OSX bash thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, weird. yea, as long as it works locally and in CI, that works for me.

Copy link
Contributor

@ryannikolaidis ryannikolaidis Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at CI, it doesn't look like linux is happy about the cat change....taking a look

cat: '***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n''***'$'\n': File name too long

https://github.com/Unstructured-IO/unstructured/actions/runs/5330890729/jobs/9658103678

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right, I see the confusion here. (should have noticed before). I think this wasn't working for you locally because you were setting GCP_INGEST_SERVICE_KEY to a filepath. GCP_INGEST_SERVICE_KEY is the actual key itself. This should works with echo on any platform.

--structured-output-dir gcs-output \
--recursive \
--preserve-downloads \
--reprocess

OVERWRITE_FIXTURES=${OVERWRITE_FIXTURES:-false}

set +e

# to update ingest test fixtures, run scripts/ingest-test-fixtures-update.sh on x86_64
if [[ "$OVERWRITE_FIXTURES" != "false" ]]; then

cp gcs-output* test_unstructured_ingest/expected-structured-output/google-cloud-storage

elif ! diff -ru test_unstructured_ingest/expected-structured-output/google-cloud-storage gcs-output ; then
echo
echo "There are differences from the previously checked-in structured outputs."
echo
echo "If these differences are acceptable, overwrite by the fixtures by setting the env var:"
echo
echo " export OVERWRITE_FIXTURES=true"
echo
echo "and then rerun this script."
echo
echo "NOTE: You'll likely just want to run scripts/ingest-test-fixtures-update.sh on x86_64 hardware"
echo "to update fixtures for CI,"
echo
exit 1

fi
1 change: 1 addition & 0 deletions test_unstructured_ingest/test-ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@ export OMP_THREAD_LIMIT=1
./test_unstructured_ingest/test-ingest-local.sh
./test_unstructured_ingest/test-ingest-slack.sh
./test_unstructured_ingest/test-ingest-against-api.sh
./test_unstructured_ingest/google-cloud-storage.sh
Loading