Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Align internal Airbyte columns with Dv2; Feat: Auto-add missing columns to Cache tables. #144

Merged
merged 31 commits into from Apr 3, 2024

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Mar 27, 2024

Resolves: #14
Resolves: #118

This update does a couple things:

  1. Adds these internal columns to all PyAirbyte datasets and cache tables:
    • _airbyte_emitted_at - The timestamp corresponding to emitted_at in the Airbyte Record message.~~
    • _airbyte_loaded_at - The timestamp representing utcnow() at the time the record is received by PyAirbyte. - removed per comments below
    • _airbyte_meta - For now, defaults to an empty dictionary.
    • _airbyte_raw_id - A unique ID per record, assigned as it is received from source. (Unlike Dv2, there is no raw table but I kept the same column name for consistency.)
  2. Auto-adds any missing columns to cache tables if they are declared in the schema. This including the above-named internal columns.

As of this iteration, PyAirbyte will not...

  • ...will not attempt to expand or change column types if they are incorrect or stale.
  • ...will not attempt to align _airbyte_loaded_at with the batch load time.
    • As noted above, the time is when we first start processing the record. In the future, we can try to set this during load time, but doing so is fragile and may not be possible in a universal/generic manner. (Removed, so no longer relevant.)
  • ...will not try to populate _airbyte_meta with any data.

Tests

  • I've added an integration test across all 4 SQL platforms (DuckDB, Postgres, Snowflake, BigQuery) which confirms that PyAirbyte can correctly add declared columns when they are detected missing.

@aaronsteers aaronsteers linked an issue Mar 27, 2024 that may be closed by this pull request
@aaronsteers aaronsteers changed the title [Draft] Align columns with Dv2 Feat: [Draft] Align columns with Dv2 Mar 27, 2024
@aaronsteers aaronsteers changed the title Feat: [Draft] Align columns with Dv2 Feat: Align columns with Dv2 [Draft] Mar 27, 2024
@aaronsteers

This comment was marked as resolved.

@aaronsteers

This comment was marked as resolved.

@aaronsteers

This comment was marked as resolved.

@aaronsteers

This comment was marked as resolved.

@aaronsteers

This comment was marked as off-topic.

@aaronsteers

This comment was marked as resolved.

@aaronsteers aaronsteers marked this pull request as ready for review April 2, 2024 23:45
@aaronsteers aaronsteers requested a review from jbfbell April 2, 2024 23:54
@aaronsteers
Copy link
Contributor Author

aaronsteers commented Apr 3, 2024

@jbfbell, @evantahler - Cc'ing you here for visibility and to confirm the direction...

This PR gets us closer to compatibility with the Dv2 specs, which I've documented / linked to in the related issue:

A couple callouts/questions:

  1. Should we just forget about _airbyte_loaded_at? We don't use a long-running raw table and Dv2 doesn't include it in its final tables. Initially I wanted to include, but I don't know that it adds enough value to warrant a split.
    • I'd love to see Dv3 adding _airbyte_loaded_at into the final table also, and (as discussed) consider dropping the long-running raw table, but again, probably not a high enough priority, and its easy for us to just ignore this column for the time being.
  2. We are lower-casing all columns and table names as of now, but according to Notion, I think you are preserving case for all except Snowflake - is that correct? (Ignoring DuckDB, the others we have covered here are BigQuery and Postgres.)

An important caveat:

  • We're not expecting that the user will be able to simply "continue" a sync from PyAirbyte in Airbyte Cloud. If they want to use the exact same table names, this will probably mean a 'reset' after migrating their jobs to Cloud. The important thing is that their downstream processing shouldn't have to change, or shouldn't have to change much. (Those downstream transformations and pipelines can represent 10+ or in some cases 100+ hours of effort, so as much as we can avoid disrupting it, the better. 🤞)

Update: I just decided to go ahead and drop the loaded-at column and I've added an implementation of _airbyte_raw_id which can be used as a unique identifier per row/record.

@aaronsteers aaronsteers changed the title Feat: Align columns with Dv2 [Draft] Feat: Align internal Airbyte columns with Dv2; Feat: Auto-add missing columns to Cache tables. Apr 3, 2024
@evantahler
Copy link

evantahler commented Apr 3, 2024

Nice work AJ!

Should we just forget about _airbyte_loaded_at? We don't use a long-running raw table and Dv2 doesn't include it in its final tables. Initially I wanted to include, but I don't know that it adds enough value to warrant a split.

We made a funny choice with Dv2 which tried to balance "give the users lots of data about the sync" with "users seem to be mad when we make too many new columns". That's how we ended up showing extracted_at and not loaded_at in the final table. extracted_at is more logically useful as the source time matters far more for analysis - probably what you are going to use in your analysis. Users can still get to loaded_at via the join to the raw table, but we figured that was rarer.

I'd love to see Dv3 adding _airbyte_loaded_at into the final table also, and (as discussed) consider dropping the long-running raw table, but again, probably not a high enough priority, and its easy for us to just ignore this column for the time being.

So the long-running raw table are getting even more important now for the refresh work, especially as we are merging records across generations, and to power truncating refreshes without data-towntime. I think they are with us for the long haul now. If none of those words made sense (because we made them up recently), check out this draft of the updated user-facing doc and join #1-point-0-refrehes.

We are lower-casing all columns and table names as of now, but according to Notion, I think you are preserving case for all except Snowflake - is that correct? (Ignoring DuckDB, the others we have covered here are BigQuery and Postgres.)

Casse-sensitivity is kind of per-destination now. I'll let @jbfbell take that one, as I don't have the latest info any more.

@aaronsteers
Copy link
Contributor Author

@evantahler - Thanks for the review and for this context.

I'll drop _airbyte_loaded_at from scope.

This PR doesn't touch capitalization rules so I think we're ok taking any needed changes to those in a subsequent pass.

@jbfbell - Let me know if you have any other questions/concerns here. I think we're good to go with the addition of _airbyte_extracted_at, _airbyte_meta (always {} for now) and _airbyte_raw_id.

This PR doesn't dive into what to do with other special characters, but if you can point me at source code or a Notion/Docs page, I'll take on any other name normalization rules in a subsequent PR.

Thanks, both!

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Apr 3, 2024

/fix-pr

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

@aaronsteers
Copy link
Contributor Author

@jbfbell and @bindipankhudi - If one or both of you is available to review/approve tomorrow, I'll go ahead and merge when ready.

All tests are passing now. ✅ 😄

@aaronsteers aaronsteers merged commit 0a20aa2 into main Apr 3, 2024
15 checks passed
@aaronsteers aaronsteers deleted the aj/fix14/align-columns-with-dv2 branch April 3, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to add missing columns Align column and table normalization with Dv2 destinations
3 participants