Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load_id to arrow tables in extract step instead of normalize #1449

Merged
merged 7 commits into from
Jun 18, 2024

Conversation

steinitzu
Copy link
Collaborator

@steinitzu steinitzu commented Jun 7, 2024

Description

  • Add _dlt_load_id to arrow tables in extract step before writing file to disk.
  • Remove the load_id adding logic from normalize

Should be backwards compatible, aside from the edge case where you upgrade dlt in between extract and normalize

Will add/check tests

Related Issues

Step 1 of #1317

Additional Context

Copy link

netlify bot commented Jun 7, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 0d4347a
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6671fc488b59f000080c58ca

Comment on lines 232 to 236
# Inject the parts of normalize configuration that are used here
@with_config(
spec=ItemsNormalizerConfiguration, sections=(known_sections.NORMALIZE, "parquet_normalizer")
)
def __init__(self, *args: Any, add_dlt_load_id: bool = False, **kwargs: Any) -> None:
Copy link
Collaborator Author

@steinitzu steinitzu Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my solution to support the old normalize config.
I think it makes sense to keep it and have it consistent with object normalizer config, rather than move to extract section.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm idea is very good.. but maybe you could decorate a method in this class. not init? so you call it just to retrieve config,

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I think we miss some tests. There's for sure test that checks if load_id is added. However now we need test two things:

  1. adding load_id in extract + sensitivity to your config "hack" - just pipeline.extract() arrow table and see if extracted parquet has load_id
  2. load some json data and request parquet in nromalize stage - this will create parquet file in normalizer and we test if load id is added there

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steinitzu thx for the tests!
still a few issues with normalizing column names. pls check

dlt/common/libs/pyarrow.py Outdated Show resolved Hide resolved
dlt/common/libs/pyarrow.py Outdated Show resolved Hide resolved
dlt/extract/extractors.py Outdated Show resolved Hide resolved
@rudolfix rudolfix marked this pull request as ready for review June 13, 2024 22:06
@steinitzu
Copy link
Collaborator Author

Updated with normalized identifiers :)

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steinitzu looks good! but new test is not passing. IMO some column mismatch?

@steinitzu
Copy link
Collaborator Author

@steinitzu looks good! but new test is not passing. IMO some column mismatch?

@rudolfix this is fixed by b1be9c9
The issue was that schema.update_table was adding the column at the front. I'm not sure if it's right to use that method here since it creates the table in schema and that should be left for after all columns are computed?
Added check for _dlt_load_id being last in common tests now too.

@rudolfix
Copy link
Collaborator

@steinitzu looks good! but new test is not passing. IMO some column mismatch?

@rudolfix this is fixed by b1be9c9 The issue was that schema.update_table was adding the column at the front. I'm not sure if it's right to use that method here since it creates the table in schema and that should be left for after all columns are computed? Added check for _dlt_load_id being last in common tests now too.

this is really weird. I added one more tests and the columns are added at the end. current version is btw. better, update_table is done at the end to modify actual schema

@rudolfix rudolfix merged commit b267c70 into devel Jun 18, 2024
46 of 48 checks passed
@rudolfix rudolfix deleted the write-load-id-in-extract branch June 18, 2024 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants