Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google drive: Use smart_open library #31866

Closed
wants to merge 39 commits into from

Conversation

flash1293
Copy link
Contributor

Based on #31458

This PR is using the smart_open library instead of the native Google SDK helpers to download the file.
The advantage is the ability to read the file incrementally (which is a feature of smart_open), the downside is that it requires the usage of undocumented parts of the Google library to put together the right headers for the request.

@vercel
Copy link

vercel bot commented Oct 26, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Oct 31, 2023 4:29pm

@github-actions
Copy link
Contributor

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@aaronsteers
Copy link
Collaborator

I'm a fan of smart_open so overall, I'd be inclined for this. But for context, don't we need to read the full file anyway? I don't see yet why reading incrementally is helpful for our use cases. Can you clarify for me?

@flash1293
Copy link
Contributor Author

I don't see yet why reading incrementally is helpful for our use cases. Can you clarify for me?

cc @clnoll in case I'm missing something here, but reading incrementally allows us to process huge files that wouldn't fit into memory at once (like a multi-gb csv file). I agree that for the use cases we imagined (reading a large number of smaller files with a little bit of text in them), it doesn't matter too much, that's why I split it out and don't see it as a blocker.

@aaronsteers
Copy link
Collaborator

I don't see yet why reading incrementally is helpful for our use cases. Can you clarify for me?

cc @clnoll in case I'm missing something here, but reading incrementally allows us to process huge files that wouldn't fit into memory at once (like a multi-gb csv file). I agree that for the use cases we imagined (reading a large number of smaller files with a little bit of text in them), it doesn't matter too much, that's why I split it out and don't see it as a blocker.

I see! I just wasn't thinking about file-types that emit multiple records per file. For document-type files, we generally have to read the whole thing into memory anyway to get the (singular) record data, but yes 100% - I agree that with CSV's and any source that can send a record at a time, we definitely should parse it serially rather than all at once.

So, yes, I'm in favor of this approach for the reasons you mention. 👍 Thanks!

Base automatically changed from flash1293/source-google-drive to master November 2, 2023 13:28
@flash1293 flash1293 closed this Nov 8, 2023
@clnoll
Copy link
Contributor

clnoll commented Nov 8, 2023

@flash1293 just curious - are you no longer planning to use smart_open for google drive?

@flash1293
Copy link
Contributor Author

Still planning to do this, but I ran into some issues and had to switch to other things. I created an issue to track this work here: https://github.com/airbytehq/airbyte-internal-issues/issues/2599

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants