You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background
We generate load_id during normalization but we should do it in extract step. This will improve data lineage - all step infos will be correlated with it. It also allows to refactor step infos (like LoadInfo) to have the same structure created with load_id as primary identifier.
Currently the load_id is a string but we use unix timestamp of the time at which the package was created. We'll keep it that way but cleanup the code - and possibly let users override that.
Also note that currently the normalize step is forming packages by grouping them by schema name. If many extract steps were done for a single schema and then one normalize step: there will be a single load package. We want to preserve this behavior - even if this complicates our code a little.
Tasks
in ExtractorStorage rename create_extract_id to create_load_id and use now() to create load_id.
if we create load id for a schema with name for which load id is already present, reuse existing. this must happen via lookup into existing file structure (mechanism must survive restarts).
we change normalize.py so it works with load packages created in extract step, currently we copy all files into a flat structure and ask normalize to group those by schema. now we have grouping already done so we just reuse it.
methods in Pipeline that list extracted files must be revised and modeled after methods to list normalized and loaded packages. we may keep list_extracted_resources that produces a flat list of files for backward compatibilty.
NormalizeStorage and TParsedNormalizeFileName should be refactored (ie. parse methiod) and we should bump version of normalize storage because we change the internal file layout in it
The text was updated successfully, but these errors were encountered:
Background
We generate
load_id
during normalization but we should do it in extract step. This will improve data lineage - all step infos will be correlated with it. It also allows to refactor step infos (likeLoadInfo
) to have the same structure created withload_id
as primary identifier.Currently the
load_id
is a string but we use unix timestamp of the time at which the package was created. We'll keep it that way but cleanup the code - and possibly let users override that.Also note that currently the normalize step is forming packages by grouping them by schema name. If many extract steps were done for a single schema and then one normalize step: there will be a single load package. We want to preserve this behavior - even if this complicates our code a little.
Tasks
ExtractorStorage
renamecreate_extract_id
tocreate_load_id
and use now() to createload_id
.Pipeline
that list extracted files must be revised and modeled after methods to list normalized and loaded packages. we may keeplist_extracted_resources
that produces a flat list of files for backward compatibilty.NormalizeStorage
andTParsedNormalizeFileName
should be refactored (ie. parse methiod) and we should bump version of normalize storage because we change the internal file layout in itThe text was updated successfully, but these errors were encountered: