You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background NormalizeInfo and LoadInfo do not hold all relevant tracking information from normalize and load step. The biggest issue being not aware of many load packages that can be processed in a single step. In this ticket we apply similar structure to the step infos.
Note: each step info:
must have a human readable representation with several verbosity levels.
must generate an object that is dlt friendly to load
must be pickable without any weird and complicated behaviors
ExtractInfo must behave exactly the same (see #754 )
Tasks
all step infos must contain pipeline, started_at, load_ids fields which works exactly like in LoadInfo
we add completed_at DateTime to have a real moment processing was completed (LoadInfo cheats by using now)
load_packages holds the information on load packages - like in LoadInfo, each step deals with load packages (even if they are represented by different tuple types) - see generate load_id in extract step #756
when step info is collected after an exception it should contain only information that actually happen and got committed - see LoadInfo
dataset name may be different for each load_id so extend load info to keep this information
add schema name and hash per load_id
remove elapsed from LoadJobInfo. Instead implement proper metrics for jobs generated in load.py. We should get start, stop times of the jobs initially. mind that you need to persist the metrics as this step is not atomic and may be restarted
make sure that when load step raises exception we collect partial LoadInfo (I think it is implemented).
when collecting partial LoadInfo
include schema update ONLY if it got committed to db
include a flag that schemas were actually updated
we'll need to (possibly) store some information (per load id) to survive restarts...
NormalizeInfo
add list of normalized packages like in LoadInfo
collect row counts per load_id and per file, keep current row count as property that you aggregate from data per file
collect processing time per load id and file
make sure that when norm step raises exception we collect partial info
we may collect partial information via files (normalize is multiprocess)
The text was updated successfully, but these errors were encountered:
Background
NormalizeInfo
andLoadInfo
do not hold all relevant tracking information from normalize and load step. The biggest issue being not aware of many load packages that can be processed in a single step. In this ticket we apply similar structure to the step infos.Note: each step info:
ExtractInfo
must behave exactly the same (see #754 )Tasks
pipeline
,started_at
,load_ids
fields which works exactly like inLoadInfo
completed_at
DateTime to have a real moment processing was completed (LoadInfo cheats by using now)load_packages
holds the information on load packages - like inLoadInfo
, each step deals with load packages (even if they are represented by different tuple types) - see generate load_id in extract step #756LoadInfo
LoadInfo
load_id
so extend load info to keep this informationload_id
elapsed
fromLoadJobInfo
. Instead implement proper metrics for jobs generated inload.py
. We should get start, stop times of the jobs initially. mind that you need to persist the metrics as this step is not atomic and may be restartedLoadInfo
(I think it is implemented).when collecting partial
LoadInfo
include a flag that schemas were actually updatedNormalizeInfo
The text was updated successfully, but these errors were encountered: