-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google drive: Use smart_open library #31866
Conversation
…airbyte into flash1293/source-google-drive
…airbyte into flash1293/source-google-drive
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
I'm a fan of |
cc @clnoll in case I'm missing something here, but reading incrementally allows us to process huge files that wouldn't fit into memory at once (like a multi-gb csv file). I agree that for the use cases we imagined (reading a large number of smaller files with a little bit of text in them), it doesn't matter too much, that's why I split it out and don't see it as a blocker. |
I see! I just wasn't thinking about file-types that emit multiple records per file. For document-type files, we generally have to read the whole thing into memory anyway to get the (singular) record data, but yes 100% - I agree that with CSV's and any source that can send a record at a time, we definitely should parse it serially rather than all at once. So, yes, I'm in favor of this approach for the reasons you mention. 👍 Thanks! |
@flash1293 just curious - are you no longer planning to use |
Still planning to do this, but I ran into some issues and had to switch to other things. I created an issue to track this work here: https://github.com/airbytehq/airbyte-internal-issues/issues/2599 |
Based on #31458
This PR is using the smart_open library instead of the native Google SDK helpers to download the file.
The advantage is the ability to read the file incrementally (which is a feature of smart_open), the downside is that it requires the usage of undocumented parts of the Google library to put together the right headers for the request.