Google drive: Use smart_open library #31866

flash1293 · 2023-10-26T13:38:30Z

Based on #31458

This PR is using the smart_open library instead of the native Google SDK helpers to download the file.
The advantage is the ability to read the file incrementally (which is a feature of smart_open), the downside is that it requires the usage of undocumented parts of the Google library to put together the right headers for the request.

…ogle-drive

…airbyte into flash1293/source-google-drive

…ogle-drive

…airbyte into flash1293/source-google-drive

…ogle-drive

vercel · 2023-10-26T13:38:36Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Oct 31, 2023 4:29pm

github-actions · 2023-10-26T13:38:54Z

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

PR name follows PR naming conventions
Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
Secrets in the connector's spec are annotated with airbyte_secret
All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

Check for hidden checklists in your PR description
Toggle the github label checklist-action-run on/off to re-run the checklist CI.

aaronsteers · 2023-10-27T05:19:26Z

I'm a fan of smart_open so overall, I'd be inclined for this. But for context, don't we need to read the full file anyway? I don't see yet why reading incrementally is helpful for our use cases. Can you clarify for me?

flash1293 · 2023-10-27T07:33:11Z

I don't see yet why reading incrementally is helpful for our use cases. Can you clarify for me?

cc @clnoll in case I'm missing something here, but reading incrementally allows us to process huge files that wouldn't fit into memory at once (like a multi-gb csv file). I agree that for the use cases we imagined (reading a large number of smaller files with a little bit of text in them), it doesn't matter too much, that's why I split it out and don't see it as a blocker.

…ogle-drive

aaronsteers · 2023-10-27T16:49:55Z

I don't see yet why reading incrementally is helpful for our use cases. Can you clarify for me?

cc @clnoll in case I'm missing something here, but reading incrementally allows us to process huge files that wouldn't fit into memory at once (like a multi-gb csv file). I agree that for the use cases we imagined (reading a large number of smaller files with a little bit of text in them), it doesn't matter too much, that's why I split it out and don't see it as a blocker.

I see! I just wasn't thinking about file-types that emit multiple records per file. For document-type files, we generally have to read the whole thing into memory anyway to get the (singular) record data, but yes 100% - I agree that with CSV's and any source that can send a record at a time, we definitely should parse it serially rather than all at once.

So, yes, I'm in favor of this approach for the reasons you mention. 👍 Thanks!

…ogle-drive

…ive-smart-open

clnoll · 2023-11-08T11:13:40Z

@flash1293 just curious - are you no longer planning to use smart_open for google drive?

flash1293 · 2023-11-08T11:14:59Z

Still planning to do this, but I ran into some issues and had to switch to other things. I created an issue to track this work here: https://github.com/airbytehq/airbyte-internal-issues/issues/2599

Joe Reuter and others added 26 commits October 16, 2023 17:59

wip

5313a65

Automated Commit - Formatting Changes

d480cd1

Merge remote-tracking branch 'origin/master' into flash1293/source-go…

450b24e

…ogle-drive

start with oauth

83717fe

Automated Commit - Formatting Changes

84cbbde

make oauth work

688f74e

handle google docs and loops in the folder structure

044816d

Automated Commit - Formatting Changes

a417345

clean up

d55c445

Automated Commit - Formatting Changes

c3787c1

fixes and tests

ddc0837

Merge remote-tracking branch 'origin/master' into flash1293/source-go…

63d61a4

…ogle-drive

Merge branch 'flash1293/source-google-drive' of github.com:airbytehq/…

ea938aa

…airbyte into flash1293/source-google-drive

start to add tests

9c6fb0a

work on tests

c54003a

Automated Commit - Formatting Changes

54157ef

stuff

893ee5c

create tests

890d643

Merge remote-tracking branch 'origin/master' into flash1293/source-go…

9304b4a

…ogle-drive

format

3ac08cc

fix rest of the tests

3f66228

Merge branch 'flash1293/source-google-drive' of github.com:airbytehq/…

b01cc92

…airbyte into flash1293/source-google-drive

fixes

0412a44

Merge remote-tracking branch 'origin/master' into flash1293/source-go…

ca81058

…ogle-drive

use smart_open

51b3b2a

add acceptance test for oauth

e45f359

octavia-squidington-iii added area/connectors Connector related issues connectors/source/google-drive labels Oct 26, 2023

flash1293 mentioned this pull request Oct 26, 2023

✨ Google Drive Source #31458

Merged

Joe Reuter and others added 2 commits October 26, 2023 15:51

remove wrong section from the docs

30ef21c

Automated Commit - Formatting Changes

7e0978f

Joe Reuter added 3 commits October 27, 2023 11:22

Merge remote-tracking branch 'origin/master' into flash1293/source-go…

a144bf6

…ogle-drive

review comments

33f0ea9

review comments

0ffa50a

Joe Reuter added 8 commits October 30, 2023 12:25

respect prefixes

3f6a8ea

fix stuff

0a48536

Merge remote-tracking branch 'origin/master' into flash1293/source-go…

3bbbc86

…ogle-drive

add pptx support

2918058

Merge remote-tracking branch 'origin/master' into flash1293/source-go…

018a740

…ogle-drive

review comments

f399b66

Merge branch 'flash1293/source-google-drive' into flash1293/google-dr…

552fadc

…ive-smart-open

use smart_open

bd8cbfb

Base automatically changed from flash1293/source-google-drive to master November 2, 2023 13:28

flash1293 closed this Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google drive: Use smart_open library #31866

Google drive: Use smart_open library #31866

flash1293 commented Oct 26, 2023

vercel bot commented Oct 26, 2023 •

edited

Loading

github-actions bot commented Oct 26, 2023

aaronsteers commented Oct 27, 2023

flash1293 commented Oct 27, 2023

aaronsteers commented Oct 27, 2023

clnoll commented Nov 8, 2023

flash1293 commented Nov 8, 2023

Google drive: Use smart_open library #31866

Google drive: Use smart_open library #31866

Conversation

flash1293 commented Oct 26, 2023

vercel bot commented Oct 26, 2023 • edited Loading

github-actions bot commented Oct 26, 2023

Before Merging a Connector Pull Request

aaronsteers commented Oct 27, 2023

flash1293 commented Oct 27, 2023

aaronsteers commented Oct 27, 2023

clnoll commented Nov 8, 2023

flash1293 commented Nov 8, 2023

vercel bot commented Oct 26, 2023 •

edited

Loading