Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AirbyteLib: Fix column count mismatch bug #34783

Merged
merged 13 commits into from
Feb 3, 2024

Conversation

aaronsteers
Copy link
Collaborator

@aaronsteers aaronsteers commented Feb 2, 2024

This PR resolves a bug when a fully-null or fully-missing column causes the insert to fail, because the number of named columns in the destination table does not match the number of columns in the parquet files.

The actual implementation required a large bit of refactoring.

  1. File writers now also get initialized with a catalog manager. This allows the file writer to make decisions based upon the catalog schema. In our case, this means detecting which properties are expected for the given stream so they can be appended.
  2. Source's get_records() implementation needed similar treatment for adding any missed properties. Now both Lazy and Cached datasets will have all top-level keys present, even if the source omits them entirely.
  3. I moved some methods like _get_stream_config() and _get_stream_json_schema() down from the SQLCache class into the underlying RecordProcessors base class, so that FileWriters can share the same code.

Drive-by changes:

  1. source_catalog was deprecated from processors but not removed. Now it is removed.
  2. Catalog managers' initialization is cleaned up. Only streams that are 'incoming' will actually have their catalog metadata written.
  3. "streams with data" has been tightened up somewhat.
  4. We had typed "pa.Table | pa.RecordBatch" in several places, even though the type is always pa.Table. Since only Table as the append_column() method, I've now made typing consistent across the board so there's no confusion about the expected type.
  5. I've broken out the large register_source() method in the catalog manager into smaller methods: _update_catalog() and _save_catalog_to_internal_table().

Copy link

vercel bot commented Feb 2, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Feb 3, 2024 0:50am

@aaronsteers aaronsteers marked this pull request as ready for review February 3, 2024 00:26
@aaronsteers aaronsteers merged commit d9b500c into master Feb 3, 2024
19 checks passed
@aaronsteers aaronsteers deleted the aj/airbyte-lib/fix-column-count-mismatch-bug branch February 3, 2024 01:09
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 21, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant