Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add google cloud storage connector #746

Merged
merged 38 commits into from
Jun 21, 2023

Conversation

potter-potter
Copy link
Contributor

@potter-potter potter-potter commented Jun 14, 2023

Closes 301

  • Adds google cloud storage connector
  • Adds recursive folder walking for gcs, s3 and azure
  • Adds ingest test for google cloud storage

Note: properties supporting date_modified, date_processed, exists, record_locator, source_url, and version will be handled in a follow-on PR.

@ryannikolaidis
Copy link
Contributor

ryannikolaidis commented Jun 14, 2023

oh, also, some additions to the base class came in very recently. these will be really important for future cases where we need to be thoughtful about how/when things get reprocessed as well as giving users more access to that information.

we haven't added support in any of the connectors yet so there aren't good references, but want to blaze the trail on adding these? Specifically, properties supporting: date_modified, date_processed, exists, record_locator, source_url, and version?

@potter-potter
Copy link
Contributor Author

oh, also, some additions to the base class came in very recently. these will be really important for future cases where we need to be thoughtful about how/when things get reprocessed as well as giving users more access to that information.

we haven't added support in any of the connectors yet so there aren't good references, but want to blaze the trail on adding these? Specifically, properties supporting: date_modified, date_processed, exists, record_locator, source_url, and version?

I can take a look at these. But I want to finish this PR first.

Comment on lines 1 to 16
#!/usr/bin/env bash

# Processes several files in a nested folder structure from gs://unstructured_public/
# through Unstructured's library in 2 processes.

# Structured outputs are stored in gcs-output/

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "$SCRIPT_DIR"/../../.. || exit 1

PYTHONPATH=. ./unstructured/ingest/main.py \
--remote-url gs://unstructured_public/ \
--structured-output-dir gcs-output \
--num-processes 2 \
--recursive \
--verbose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed up a gcs test bucket with private and public paths, we should update to use that. I'll follow up with details directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use gs://utic-test-ingest-fixtures

Copy link
Contributor Author

@potter-potter potter-potter Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require the user to have authentication token. Is that what we really want? Or do we want a no auth example?

Copy link
Contributor

@ryannikolaidis ryannikolaidis Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry, this is for the example. yes, no auth. I'll see if I can set up a public bucket for this.

@click.option(
"--gcs-token",
default=None,
help="Token used to access Google Cloud. By default shouldn't be needed.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than "By default shouldn't be needed" might drop more info that you had in comment below. around using default gcloud credentials - attempt to get credentials from the google metadata service - fall back to anonymous access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Comment on lines 14 to 16
PYTHONPATH=. ./unstructured/ingest/main.py \
--metadata-exclude filename,file_directory,metadata.data_source.date_processed \
--remote-url gs://unstructured_public/ \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use this test folder: gs://utic-test-ingest-fixtures
it has two folders public and private, where private will require that the auth token is used. hopefully can leverage what was done in the google drive test that I just added?

Copy link
Contributor Author

@potter-potter potter-potter Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked. But had to cat the cred info into the tempfile instead of echo is that ok? Could be an OSX bash thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, weird. yea, as long as it works locally and in CI, that works for me.

Copy link
Contributor

@ryannikolaidis ryannikolaidis Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at CI, it doesn't look like linux is happy about the cat change....taking a look

cat: '***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n'' ***'$'\n''***'$'\n': File name too long

https://github.com/Unstructured-IO/unstructured/actions/runs/5330890729/jobs/9658103678

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right, I see the confusion here. (should have noticed before). I think this wasn't working for you locally because you were setting GCP_INGEST_SERVICE_KEY to a filepath. GCP_INGEST_SERVICE_KEY is the actual key itself. This should works with echo on any platform.

Copy link
Contributor

@ryannikolaidis ryannikolaidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few comments about switching over the test source and a nit about a description, but otherwise this is looking great!

@ryannikolaidis ryannikolaidis changed the title Potter/google cloud storage feat: add google cloud storage connector Jun 21, 2023
Copy link
Contributor

@ryannikolaidis ryannikolaidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed some changes to support updated fixture location, test key handling, expected output, and addressed some bugs.

Everything looks good!

@ryannikolaidis ryannikolaidis merged commit 3b472cb into main Jun 21, 2023
22 checks passed
@ryannikolaidis ryannikolaidis deleted the potter/google-cloud-storage branch June 21, 2023 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Connector for Google Cloud Storage
3 participants