Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor NormalizeInfo and LoadInfo #757

Closed
rudolfix opened this issue Nov 11, 2023 · 0 comments
Closed

refactor NormalizeInfo and LoadInfo #757

rudolfix opened this issue Nov 11, 2023 · 0 comments
Assignees

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Nov 11, 2023

Background
NormalizeInfo and LoadInfo do not hold all relevant tracking information from normalize and load step. The biggest issue being not aware of many load packages that can be processed in a single step. In this ticket we apply similar structure to the step infos.

Note: each step info:

  • must have a human readable representation with several verbosity levels.
  • must generate an object that is dlt friendly to load
  • must be pickable without any weird and complicated behaviors

ExtractInfo must behave exactly the same (see #754 )

Tasks

    • all step infos must contain pipeline, started_at, load_ids fields which works exactly like in LoadInfo
    • we add completed_at DateTime to have a real moment processing was completed (LoadInfo cheats by using now)
    • load_packages holds the information on load packages - like in LoadInfo, each step deals with load packages (even if they are represented by different tuple types) - see generate load_id in extract step #756
    • when step info is collected after an exception it should contain only information that actually happen and got committed - see LoadInfo

LoadInfo

    • dataset name may be different for each load_id so extend load info to keep this information
    • add schema name and hash per load_id
    • remove elapsed from LoadJobInfo. Instead implement proper metrics for jobs generated in load.py. We should get start, stop times of the jobs initially. mind that you need to persist the metrics as this step is not atomic and may be restarted
    • make sure that when load step raises exception we collect partial LoadInfo (I think it is implemented).

when collecting partial LoadInfo

    • include schema update ONLY if it got committed to db
    • include a flag that schemas were actually updated
  • we'll need to (possibly) store some information (per load id) to survive restarts...

NormalizeInfo

    • add list of normalized packages like in LoadInfo
    • collect row counts per load_id and per file, keep current row count as property that you aggregate from data per file
    • collect processing time per load id and file
    • make sure that when norm step raises exception we collect partial info
  • we may collect partial information via files (normalize is multiprocess)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant