Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Airbyte CDK (File-based CDK): Stop the sync if the record could not be parsed #32589

Merged
merged 28 commits into from
Jan 11, 2024

Conversation

bazarnov
Copy link
Collaborator

@bazarnov bazarnov commented Nov 16, 2023

What

Resolving: #31605

How

  • Added logic to raise the RecordParseError to stream.default_file_based_stream.
  • Added new class FileBasedErrorCollector to collect the parse_errors and yield them after all the streams are read + raised the AirbyteTracebackException to highlight the issues with the sync, instead of silently succeed with a bunch of logged errors.
  • Corrected existing unit tests to cover the case.

🚨 User Impact 🚨

Some of the users, that have their sync succeeded so far, will hit the error that should make them act to fix their files on their end or change the file input to continue syncing.
This is made on purpose as mentioned in this comment:
#31605 (comment)

Copy link

vercel bot commented Nov 16, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Jan 11, 2024 7:07pm

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Nov 16, 2023
@bazarnov bazarnov marked this pull request as ready for review November 20, 2023 10:12
@bazarnov bazarnov requested a review from a team as a code owner November 20, 2023 10:12
@bazarnov bazarnov requested review from girarda and a team November 20, 2023 10:12
@bazarnov bazarnov removed the request for review from a team November 20, 2023 11:34
@girarda girarda requested a review from clnoll November 21, 2023 05:12
@bazarnov
Copy link
Collaborator Author

bazarnov commented Nov 28, 2023

@clnoll
I'll get back to this after this one: #32345 has another pre-release version with BULK migration.

cc @oustynova @lazebnyi

Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just requesting a few small changes @bazarnov.

@bazarnov bazarnov requested a review from clnoll January 3, 2024 11:15
@clnoll
Copy link
Contributor

clnoll commented Jan 4, 2024

Thanks @bazarnov this is getting really close! In addition to the new comments, can you also upgrade the S3 connector (and other file-based connectors if it's not too much trouble) to use the version of the CDK from this branch, and verify that the CATs pass?

@bazarnov
Copy link
Collaborator Author

bazarnov commented Jan 9, 2024

Thanks @bazarnov this is getting really close! In addition to the new comments, can you also upgrade the S3 connector (and other file-based connectors if it's not too much trouble) to use the version of the CDK from this branch, and verify that the CATs pass?

I'll do my best to test the source-s3 using this branch's updated CDK version. All the dependent sources should be updated in the separate PR respectively.

@bazarnov
Copy link
Collaborator Author

bazarnov commented Jan 9, 2024

Thanks @bazarnov this is getting really close! In addition to the new comments, can you also upgrade the S3 connector (and other file-based connectors if it's not too much trouble) to use the version of the CDK from this branch, and verify that the CATs pass?

The CATs have passed successfully using the CDK 0.58.4 (the version introduced in this PR) build for source-s3. Tested locally with:

airbyte-ci connectors --use-local-cdk --name=source-s3 test

I''ll create the follow-up issue to upgrade the source-s3 and other ones (File-based dependent), after this one is merged. Thanks!

@bazarnov
Copy link
Collaborator Author

bazarnov commented Jan 9, 2024

We have a File-Based dependent:

  • source-s3
  • source-azure-blob-storage
  • source-gcs
  • source-google-drive
  • source-microsoft-onedrive

Which of these should be updated? @clnoll

@lazebnyi lazebnyi linked an issue Jan 11, 2024 that may be closed by this pull request
3 tasks
@bazarnov bazarnov requested a review from clnoll January 11, 2024 18:51
Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@bazarnov bazarnov merged commit cf7f700 into master Jan 11, 2024
21 checks passed
@bazarnov bazarnov deleted the baz/airbyte-cdk/raise-RecordParseError branch January 11, 2024 19:26
@clnoll
Copy link
Contributor

clnoll commented Jan 12, 2024

@bazarnov please start by migrating a couple of the less-used of the connectors (maybe azure-blob-storage first, then gcs) and monitor for errors, before migrating the others.

@lazebnyi
Copy link
Collaborator

@clnoll Thanks for your proposition. We have already generated follow-up issues for all connectors and will include them in upcoming sprints. Thank you.

@clnoll
Copy link
Contributor

clnoll commented Jan 15, 2024

Great, thank you @lazebnyi!

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit team/connectors-python
Projects
No open projects
Status: Implementation in progress
Development

Successfully merging this pull request may close these issues.

🎉 File-based CDK: stop the sync on parse errors
5 participants