You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Logging and reporting is a crucial aspect of a data factory system like this.
What kind of logs
Log format
Log storage
Log access
Job stories
When a Run is initiated by an Operator they want to see it is running and be notified of application and (meta)data errors as soon as possible, especially “halts” so that they can debug and re-run
If there are a lot of (data) errors I want to examine these in a system that allows me to analyse / view easily (i.e. don’t have my page crash as it tries to load 100k error messages)
I don’t want to receive 100k error emails …
When a scheduled Run happens as an Operator (Sysadmin), I want to be notified afterwards (with a report?) if something went wrong, so that I can do something about it …
When I need to report to my colleagues about the Harvesting system I want to get an overall report of how it is going, how many datasets are harvested etc so that I can tell them
Domain Model
Status info: this is Run is running, it is finished, it took this long …
If the process takes longer that I expect we could show a window with live logs (using the Airflow API). We haven’t yet a status like “running step X”, “running step Y”, “stopped by error”, “finished”. We need to add this to the NG Harvester.
(Raw) Log information …
Logs on run execution (classic INFO, WARN etc logging)
Including handled application errors ERROR
(Meta)data errors (and warnings) => What do these look like?
(Unhandled) Exceptions or errors (caught by parent system)
Reports / Summaries e.g. 200 records processed, 5 errors, 2 warnings, 8 new dataset, 192 existing records updated
4 cases
Run Status Info (Live and Historic)
Who: Someone running a Job in realtime: When something does not work I want to see history of jobs (e.g. when have jobs stopped running) so that I can debug
Provided by: Orchestrator (ie. airflow) TODO: does orchestrator provide historic info (?)
Format: Whatever API that gives
App Log
Who: Someone running a Job (if they want real-time feedback)
Someone debugging a failed job (and a specific source)
Someone creating a new pipeline and wanting to debug it
Provided by: Logging in the code using std log library and either config of the storage location in code or from orchestrator
Format: Regular logs (text format) and a custom JSON file as a final log report
(Meta)Data Quality Warn / Errors
Who: “Owner” of a harvest source who wants to get those corrected
A Harvest Admin who is overseeing the process and wants to know what happened (and maybe how to fix the pipeline)
Google cloud composer already provides a lot of logs. We may be able to create a sink on GCP Operations Logging and redirect the created logs to another service
Logging and reporting is a crucial aspect of a data factory system like this.
Job stories
When a Run is initiated by an Operator they want to see it is running and be notified of application and (meta)data errors as soon as possible, especially “halts” so that they can debug and re-run
If there are a lot of (data) errors I want to examine these in a system that allows me to analyse / view easily (i.e. don’t have my page crash as it tries to load 100k error messages)
I don’t want to receive 100k error emails …
When a scheduled Run happens as an Operator (Sysadmin), I want to be notified afterwards (with a report?) if something went wrong, so that I can do something about it …
When I need to report to my colleagues about the Harvesting system I want to get an overall report of how it is going, how many datasets are harvested etc so that I can tell them
Domain Model
Status info: this is Run is running, it is finished, it took this long …
(Raw) Log information …
Reports / Summaries e.g. 200 records processed, 5 errors, 2 warnings, 8 new dataset, 192 existing records updated
4 cases
The text was updated successfully, but these errors were encountered: