-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add google cloud storage connector #746
Conversation
oh, also, some additions to the base class came in very recently. these will be really important for future cases where we need to be thoughtful about how/when things get reprocessed as well as giving users more access to that information. we haven't added support in any of the connectors yet so there aren't good references, but want to blaze the trail on adding these? Specifically, properties supporting: date_modified, date_processed, exists, record_locator, source_url, and version? |
I can take a look at these. But I want to finish this PR first. |
#!/usr/bin/env bash | ||
|
||
# Processes several files in a nested folder structure from gs://unstructured_public/ | ||
# through Unstructured's library in 2 processes. | ||
|
||
# Structured outputs are stored in gcs-output/ | ||
|
||
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) | ||
cd "$SCRIPT_DIR"/../../.. || exit 1 | ||
|
||
PYTHONPATH=. ./unstructured/ingest/main.py \ | ||
--remote-url gs://unstructured_public/ \ | ||
--structured-output-dir gcs-output \ | ||
--num-processes 2 \ | ||
--recursive \ | ||
--verbose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed up a gcs test bucket with private and public paths, we should update to use that. I'll follow up with details directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use gs://utic-test-ingest-fixtures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require the user to have authentication token. Is that what we really want? Or do we want a no auth example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry, this is for the example. yes, no auth. I'll see if I can set up a public bucket for this.
unstructured/ingest/main.py
Outdated
@click.option( | ||
"--gcs-token", | ||
default=None, | ||
help="Token used to access Google Cloud. By default shouldn't be needed.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than "By default shouldn't be needed" might drop more info that you had in comment below. around using default gcloud credentials - attempt to get credentials from the google metadata service - fall back to anonymous access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
PYTHONPATH=. ./unstructured/ingest/main.py \ | ||
--metadata-exclude filename,file_directory,metadata.data_source.date_processed \ | ||
--remote-url gs://unstructured_public/ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use this test folder: gs://utic-test-ingest-fixtures
it has two folders public
and private
, where private will require that the auth token is used. hopefully can leverage what was done in the google drive test that I just added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worked. But had to cat
the cred info into the tempfile instead of echo
is that ok? Could be an OSX bash thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, weird. yea, as long as it works locally and in CI, that works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at CI, it doesn't look like linux is happy about the cat
change....taking a look
cat: '***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n''***'$'\n': File name too long
https://github.com/Unstructured-IO/unstructured/actions/runs/5330890729/jobs/9658103678
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, right, I see the confusion here. (should have noticed before). I think this wasn't working for you locally because you were setting GCP_INGEST_SERVICE_KEY to a filepath. GCP_INGEST_SERVICE_KEY is the actual key itself. This should works with echo on any platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few comments about switching over the test source and a nit about a description, but otherwise this is looking great!
bump expected output
update expected files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed some changes to support updated fixture location, test key handling, expected output, and addressed some bugs.
Everything looks good!
Closes 301
Note: properties supporting date_modified, date_processed, exists, record_locator, source_url, and version will be handled in a follow-on PR.