refactor `NormalizeInfo` and `LoadInfo` #757

rudolfix · 2023-11-11T16:12:34Z

Background
NormalizeInfo and LoadInfo do not hold all relevant tracking information from normalize and load step. The biggest issue being not aware of many load packages that can be processed in a single step. In this ticket we apply similar structure to the step infos.

Note: each step info:

must have a human readable representation with several verbosity levels.
must generate an object that is dlt friendly to load
must be pickable without any weird and complicated behaviors

ExtractInfo must behave exactly the same (see #754 )

Tasks

- all step infos must contain pipeline, started_at, load_ids fields which works exactly like in LoadInfo
- we add completed_at DateTime to have a real moment processing was completed (LoadInfo cheats by using now)
- load_packages holds the information on load packages - like in LoadInfo, each step deals with load packages (even if they are represented by different tuple types) - see generate load_id in extract step #756
- when step info is collected after an exception it should contain only information that actually happen and got committed - see LoadInfo

LoadInfo

- use improve destination fingerprinting and info #751 destination info to hold destination and staging infos, convert current fields to properties to keep the interface but not duplicate data
- dataset name may be different for each load_id so extend load info to keep this information
- add schema name and hash per load_id
- remove elapsed from LoadJobInfo. Instead implement proper metrics for jobs generated in load.py. We should get start, stop times of the jobs initially. mind that you need to persist the metrics as this step is not atomic and may be restarted
- make sure that when load step raises exception we collect partial LoadInfo (I think it is implemented).

when collecting partial LoadInfo

- include schema update ONLY if it got committed to db
- ~~include a flag that schemas were actually updated~~

we'll need to (possibly) store some information (per load id) to survive restarts...

NormalizeInfo

- add list of normalized packages like in LoadInfo
- collect row counts per load_id and per file, keep current row count as property that you aggregate from data per file
- collect processing time per load id and file
- make sure that when norm step raises exception we collect partial info

we may collect partial information via files (normalize is multiprocess)

The text was updated successfully, but these errors were encountered:

rudolfix mentioned this issue Nov 11, 2023

Implement ExtractInfo #754

Closed

rudolfix added the sprint label Nov 12, 2023

rudolfix mentioned this issue Nov 13, 2023

0.4.1 release notes and master merge #763

Closed

rudolfix self-assigned this Nov 13, 2023

rudolfix mentioned this issue Nov 21, 2023

prototype platform connection #727

Merged

rudolfix mentioned this issue Dec 3, 2023

step info (extract, normalize, load) refactor #801

Merged

This was referenced Dec 18, 2023

adds extract and normalize traces #838

Closed

adds extract and normalize traces #839

Merged

implement LoadInfo and ExtractInfo missing tracing #853

Open

rudolfix closed this as completed Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor `NormalizeInfo` and `LoadInfo` #757

refactor `NormalizeInfo` and `LoadInfo` #757

rudolfix commented Nov 11, 2023 •

edited

Loading

refactor NormalizeInfo and LoadInfo #757

refactor NormalizeInfo and LoadInfo #757

Comments

rudolfix commented Nov 11, 2023 • edited Loading

refactor `NormalizeInfo` and `LoadInfo` #757

refactor `NormalizeInfo` and `LoadInfo` #757

rudolfix commented Nov 11, 2023 •

edited

Loading