Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connector for Google Drive #294

Merged
merged 18 commits into from
Mar 7, 2023

Conversation

HAKSOAT
Copy link
Contributor

@HAKSOAT HAKSOAT commented Feb 27, 2023

Implements enhancement #244

@HAKSOAT HAKSOAT marked this pull request as ready for review March 3, 2023 00:12
@HAKSOAT
Copy link
Contributor Author

HAKSOAT commented Mar 3, 2023

Hi, @cragwolfe I'd appreciate it if a review could begin on my PR.

Everything works as expected from running it, but I am looking to do two more things:

Integrate Google's Auth ID (so files can be downloaded without making them available for everyone to view) and find a way to allow downloads on public files without a token.

I'm not sure why the CI CD checks are not running at the moment.

No tests at the moment since API keys are needed (I'm not sure how to go about creating a test for that without exposing keys).

@cragwolfe
Copy link
Contributor

cragwolfe commented Mar 3, 2023

Hi, @cragwolfe I'd appreciate it if a review could begin on my PR.

Hi @HAKSOAT , happy to! Great to hear you got this working.

Everything works as expected from running it, but I am looking to do two more things:

Integrate Google's Auth ID (so files can be downloaded without making them available for everyone to view) and find a way to allow downloads on public files without a token.

Does that mean following the credentials.json flow described here: https://developers.google.com/drive/api/quickstart/python ?

I'm not sure why the CI CD checks are not running at the moment.

I think it was mostly the gihub outage. I just merged into Main to help trigger that (hope that's OK!).

No tests at the moment since API keys are needed (I'm not sure how to go about creating a test for that without exposing keys).

How does one go about getting an API Key?

A few more things to address:

with open(self.filename, "wb") as handler:
handler.write(file.getbuffer())
if self.config.verbose:
print(f"File downloaded: {self.filename}.")
Copy link
Contributor

@cragwolfe cragwolfe Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be the else condition (i.e. if not file)? should we print a warning if unexpected? (someday soon this would be "log a warning" but logging in multiprocessing is not working yet).

@HAKSOAT
Copy link
Contributor Author

HAKSOAT commented Mar 3, 2023

Hi @cragwolfe

Thanks for the review. I'll go through your comments now.

Does that mean following the credentials.json flow described here: https://developers.google.com/drive/api/quickstart/python ?

More like here:

https://developers.google.com/workspace/guides/create-credentials

I believe a service account is better as it is made for robot accounts. The reason is that the regular API key gets throttled as I experienced an error after a couple of requests that says:

We're sorry...

... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.

See Google Help for more information.

@HAKSOAT
Copy link
Contributor Author

HAKSOAT commented Mar 4, 2023

Hi @cragwolfe

I think I just hit a roadblock for service accounts. Looks like there's an issue with using the service account approach with multiprocessing.

Traceback (most recent call last):
  File "/mnt/c/Users/HAKS/dev/unstructured/./unstructured/ingest/main.py", line 362, in <module>
    main()
  File "/home/haks/Envs/unstructured/lib/python3.9/site-packages/click/core.py", line 1126, in __call__
    return self.main(*args, **kwargs)
  File "/home/haks/Envs/unstructured/lib/python3.9/site-packages/click/core.py", line 1051, in main
    rv = self.invoke(ctx)
  File "/home/haks/Envs/unstructured/lib/python3.9/site-packages/click/core.py", line 1393, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/haks/Envs/unstructured/lib/python3.9/site-packages/click/core.py", line 752, in invoke
    return __callback(*args, **kwargs)
  File "/mnt/c/Users/HAKS/dev/unstructured/./unstructured/ingest/main.py", line 353, in main
    MainProcess(
  File "/mnt/c/Users/HAKS/dev/unstructured/./unstructured/ingest/main.py", line 73, in run
    results = pool.map(self.doc_processor_fn, docs)  # noqa: F841
  File "/home/haks/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/haks/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
  File "/home/haks/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
    put(task)
  File "/home/haks/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/haks/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_cffi_backend.FFI' object

The traceback isn't so helpful either. Just putting this here to serve as an update as I have tried to find a fix but I am yet to find one. I will take another look in the morning here.

@cragwolfe
Copy link
Contributor

I think I just hit a roadblock for service accounts. Looks like there's an issue with using the service account approach with multiprocessing.

Yes, looks that way. A way around that is to remove self.service from the config object, and only create the service object on demand (it looks like the api_key, and drive_id are sufficient to build the service object).

@HAKSOAT
Copy link
Contributor Author

HAKSOAT commented Mar 5, 2023

Yes, looks that way. A way around that is to remove self.service from the config object, and only create the service object on demand (it looks like the api_key, and drive_id are sufficient to build the service object).

Thanks for the hint @cragwolfe It works now. I think a second review will be great so I can fix any other issue.

@HAKSOAT
Copy link
Contributor Author

HAKSOAT commented Mar 7, 2023

Hi @cragwolfe please can you help take a look at this when chanced?

@cragwolfe
Copy link
Contributor

Hi @cragwolfe please can you help take a look at this when chanced?
For sure, on my list for today!

Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works very nicely! 🚀

Though, I did get tripped up by not reading the Note in ingest.sh carefully enough the first time:

# NOTE: Using the Service Account key only works when the file or folder
# is shared atleast with permission for "Anyone with the link" to view
# OR the email address for the service account is given access to the file
# or folder.

service account emails do not look like user emails.

@cragwolfe cragwolfe enabled auto-merge (squash) March 7, 2023 05:52
@cragwolfe cragwolfe merged commit 4117f57 into Unstructured-IO:main Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants