generate load_id in extract step #756

rudolfix · 2023-11-11T11:39:49Z

Background
We generate load_id during normalization but we should do it in extract step. This will improve data lineage - all step infos will be correlated with it. It also allows to refactor step infos (like LoadInfo) to have the same structure created with load_id as primary identifier.

Currently the load_id is a string but we use unix timestamp of the time at which the package was created. We'll keep it that way but cleanup the code - and possibly let users override that.

Also note that currently the normalize step is forming packages by grouping them by schema name. If many extract steps were done for a single schema and then one normalize step: there will be a single load package. We want to preserve this behavior - even if this complicates our code a little.

Tasks

- in ExtractorStorage rename create_extract_id to create_load_id and use now() to create load_id.
- if we create load id for a schema with name for which load id is already present, reuse existing. this must happen via lookup into existing file structure (mechanism must survive restarts).
- we change normalize.py so it works with load packages created in extract step, currently we copy all files into a flat structure and ask normalize to group those by schema. now we have grouping already done so we just reuse it.
- methods in Pipeline that list extracted files must be revised and modeled after methods to list normalized and loaded packages. we may keep list_extracted_resources that produces a flat list of files for backward compatibilty.
- NormalizeStorage and TParsedNormalizeFileName should be refactored (ie. parse methiod) and we should bump version of normalize storage because we change the internal file layout in it

The text was updated successfully, but these errors were encountered:

This was referenced Nov 11, 2023

refactor NormalizeInfo and LoadInfo #757

Closed

Implement ExtractInfo #754

Closed

rudolfix added the sprint label Nov 12, 2023

rudolfix mentioned this issue Nov 13, 2023

0.4.1 release notes and master merge #763

Closed

rudolfix self-assigned this Nov 13, 2023

rudolfix mentioned this issue Nov 28, 2023

Rfix/load package extract #790

Merged

rudolfix closed this as completed Nov 29, 2023

rudolfix linked a pull request Nov 29, 2023 that will close this issue

Rfix/load package extract #790

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generate load_id in extract step #756

generate load_id in extract step #756

rudolfix commented Nov 11, 2023 •

edited

Loading

generate load_id in extract step #756

generate load_id in extract step #756

Comments

rudolfix commented Nov 11, 2023 • edited Loading

rudolfix commented Nov 11, 2023 •

edited

Loading