Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate load_id in extract step #756

Closed
rudolfix opened this issue Nov 11, 2023 · 0 comments · Fixed by #790
Closed

generate load_id in extract step #756

rudolfix opened this issue Nov 11, 2023 · 0 comments · Fixed by #790
Assignees

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Nov 11, 2023

Background
We generate load_id during normalization but we should do it in extract step. This will improve data lineage - all step infos will be correlated with it. It also allows to refactor step infos (like LoadInfo) to have the same structure created with load_id as primary identifier.

Currently the load_id is a string but we use unix timestamp of the time at which the package was created. We'll keep it that way but cleanup the code - and possibly let users override that.

Also note that currently the normalize step is forming packages by grouping them by schema name. If many extract steps were done for a single schema and then one normalize step: there will be a single load package. We want to preserve this behavior - even if this complicates our code a little.

Tasks

    • in ExtractorStorage rename create_extract_id to create_load_id and use now() to create load_id.
    • if we create load id for a schema with name for which load id is already present, reuse existing. this must happen via lookup into existing file structure (mechanism must survive restarts).
    • we change normalize.py so it works with load packages created in extract step, currently we copy all files into a flat structure and ask normalize to group those by schema. now we have grouping already done so we just reuse it.
    • methods in Pipeline that list extracted files must be revised and modeled after methods to list normalized and loaded packages. we may keep list_extracted_resources that produces a flat list of files for backward compatibilty.
    • NormalizeStorage and TParsedNormalizeFileName should be refactored (ie. parse methiod) and we should bump version of normalize storage because we change the internal file layout in it
@rudolfix rudolfix self-assigned this Nov 13, 2023
@rudolfix rudolfix linked a pull request Nov 29, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant