Add load_id to arrow tables in extract step instead of normalize #1449

steinitzu · 2024-06-07T20:04:50Z

Description

Add _dlt_load_id to arrow tables in extract step before writing file to disk.
Remove the load_id adding logic from normalize

Should be backwards compatible, aside from the edge case where you upgrade dlt in between extract and normalize

Will add/check tests

Related Issues

Step 1 of #1317

Additional Context

netlify · 2024-06-07T20:05:06Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`0d4347a`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/6671fc488b59f000080c58ca

steinitzu · 2024-06-07T20:07:29Z

dlt/extract/extractors.py

+    # Inject the parts of normalize configuration that are used here
+    @with_config(
+        spec=ItemsNormalizerConfiguration, sections=(known_sections.NORMALIZE, "parquet_normalizer")
+    )
+    def __init__(self, *args: Any, add_dlt_load_id: bool = False, **kwargs: Any) -> None:


This was my solution to support the old normalize config.
I think it makes sense to keep it and have it consistent with object normalizer config, rather than move to extract section.

hmmm idea is very good.. but maybe you could decorate a method in this class. not init? so you call it just to retrieve config,

rudolfix

LGTM!

I think we miss some tests. There's for sure test that checks if load_id is added. However now we need test two things:

adding load_id in extract + sensitivity to your config "hack" - just pipeline.extract() arrow table and see if extracted parquet has load_id
load some json data and request parquet in nromalize stage - this will create parquet file in normalizer and we test if load id is added there

rudolfix

@steinitzu thx for the tests!
still a few issues with normalizing column names. pls check

dlt/common/libs/pyarrow.py

dlt/extract/extractors.py

steinitzu · 2024-06-17T14:50:52Z

Updated with normalized identifiers :)

rudolfix

@steinitzu looks good! but new test is not passing. IMO some column mismatch?

steinitzu · 2024-06-18T15:19:47Z

@steinitzu looks good! but new test is not passing. IMO some column mismatch?

@rudolfix this is fixed by b1be9c9
The issue was that schema.update_table was adding the column at the front. I'm not sure if it's right to use that method here since it creates the table in schema and that should be left for after all columns are computed?
Added check for _dlt_load_id being last in common tests now too.

…o write-load-id-in-extract

rudolfix · 2024-06-18T21:30:42Z

@steinitzu looks good! but new test is not passing. IMO some column mismatch?

@rudolfix this is fixed by b1be9c9 The issue was that schema.update_table was adding the column at the front. I'm not sure if it's right to use that method here since it creates the table in schema and that should be left for after all columns are computed? Added check for _dlt_load_id being last in common tests now too.

this is really weird. I added one more tests and the columns are added at the end. current version is btw. better, update_table is done at the end to modify actual schema

steinitzu commented Jun 7, 2024

View reviewed changes

rudolfix requested changes Jun 10, 2024

View reviewed changes

steinitzu added 3 commits June 13, 2024 13:01

Add load_id to arrow tables in extract step instead of normalize

712b276

Test arrow load id in extract

dfa86b6

Get normalize config without decorator

706dce5

steinitzu force-pushed the write-load-id-in-extract branch from 536f0f2 to 706dce5 Compare June 13, 2024 17:09

rudolfix requested changes Jun 13, 2024

View reviewed changes

dlt/common/libs/pyarrow.py Outdated Show resolved Hide resolved

dlt/common/libs/pyarrow.py Outdated Show resolved Hide resolved

dlt/extract/extractors.py Outdated Show resolved Hide resolved

rudolfix marked this pull request as ready for review June 13, 2024 22:06

Normalize load ID column name

9f71ee4

rudolfix requested changes Jun 17, 2024

View reviewed changes

Load ID column goes last

b1be9c9

rudolfix added 2 commits June 18, 2024 23:28

adds update_table column order tests

078e3ff

Merge branch 'write-load-id-in-extract' of github.com:dlt-hub/dlt int…

0d4347a

…o write-load-id-in-extract

rudolfix merged commit b267c70 into devel Jun 18, 2024
46 of 48 checks passed

rudolfix deleted the write-load-id-in-extract branch June 18, 2024 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add load_id to arrow tables in extract step instead of normalize #1449

Add load_id to arrow tables in extract step instead of normalize #1449

steinitzu commented Jun 7, 2024 •

edited

netlify bot commented Jun 7, 2024 •

edited

steinitzu Jun 7, 2024 •

edited

rudolfix Jun 10, 2024

rudolfix left a comment

rudolfix left a comment

steinitzu commented Jun 17, 2024

rudolfix left a comment

steinitzu commented Jun 18, 2024

rudolfix commented Jun 18, 2024

Add load_id to arrow tables in extract step instead of normalize #1449

Add load_id to arrow tables in extract step instead of normalize #1449

Conversation

steinitzu commented Jun 7, 2024 • edited

Description

Related Issues

Additional Context

netlify bot commented Jun 7, 2024 • edited

✅ Deploy Preview for dlt-hub-docs canceled.

steinitzu Jun 7, 2024 • edited

Choose a reason for hiding this comment

rudolfix Jun 10, 2024

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

steinitzu commented Jun 17, 2024

rudolfix left a comment

Choose a reason for hiding this comment

steinitzu commented Jun 18, 2024

rudolfix commented Jun 18, 2024

steinitzu commented Jun 7, 2024 •

edited

netlify bot commented Jun 7, 2024 •

edited

steinitzu Jun 7, 2024 •

edited