Add ECMWF observability guidelines for logging and metrics#39
Add ECMWF observability guidelines for logging and metrics#39tlmquintino merged 7 commits intomainfrom
Conversation
cfkanesan
left a comment
There was a problem hiding this comment.
Hi @sametd , this is a very nicely written document in my opinion. I added some comments and I think there could be some additional guidelines in the case where you are pushing logs from the application to the logger (as opposed to collecting them from stdout). For instance, that logging should be delegated to a sidecar thread or process to avoid blocking the main thread or process to avoid increasing the latency. Making sure that failure to deliver logs, filling buffers, temporary downtime of the log collection infrastructure will not impact the uptime or performance of the service.
| "severity_text": "INFO", | ||
| "severity_number": 9, | ||
| "body": "Operation completed", | ||
| "resource": { |
There was a problem hiding this comment.
We should also agree with CD on a set of fields for resource here. I'm already seeing different approaches in the test Opensearch infrastructure they're setting up.
There was a problem hiding this comment.
There are a few things which I find it surprising not to be discussed at all, in the context of observability guidelines:
- Access to logs (both how, and who should have access). This all seems to be delegated to something downstream. But it is really rather important - and not something which is the same for every service/universal.
- Log lifetime. We have decades of contiguous log coverage for MARS/ECFS/others. We may not have the same requirements for other services. Currently we maintain and build the archive of logs ourselves - is this intended to be the responsibility of the service developer, or to move this to the logging infrastructure. If the latter, who is in charge of the lifetimes? And long term archival?
- How do we handle outage of the telemetry system? In times of outage (power, network, ...) we have seen that lots of this infrastructure is not top priority when bringing things back up relative to operations. That means it is likely to lag our services. How do we access logs in that timeframe (which may be very urgent), and how do we ensure there are not gaps in our log coverage.
- How do we interact with existing code bases and services. We have large, stable systems whose logs are already being processed and used for various purposes. It is/would be a huge undertaking to migrate MARS for instance (especially the server side) to only output JSON-based logs.
- How do we handle logs for non-request-based tooling (startup/shutdown, housekeeping operations)?
| ### 4.7 Exception and Error Logging | ||
|
|
||
| - Log an exception once at the handling boundary. | ||
| - Avoid duplicate logging of the same error in multiple layers. |
There was a problem hiding this comment.
Hmmz. Typically there is a cascade (e.g. an I/O error causes a decode error causes a user-request failure). Logging that cascade has serious value, no?
There was a problem hiding this comment.
Again, tricky. Section 4.7 was revised to preserve causal-chain context while avoiding duplicate full stack logs at every layer.
|
@simondsmart Appreciate the review, very helpful. I’d kept some topics out initially to keep things short, and I’ve now expanded the document and addressed the majority of your feedback.
I added Section 3.2 “Log Access and Ownership” with access-path requirements, role approval, IAM process alignment, emergency access expectations, and ownership split across Development / Platform / Production. Please check if this matches the operational model you had in mind.
I added Section 3.3 “Log Retention and Archival” with retention declaration at onboarding, default retention from central logging, override process, long-term archival declaration, and ownership responsibilities for implementation/review.
This was really something I was planning to add in the next iteration. Now, I added Section 3.4 “Telemetry Outage and Recovery” covering degraded-mode behavior for logs/metrics/traces, buffering/retry/backfill expectations, gap detection/reporting, and runbook requirements for urgent access during outages.
Tricky one. I added Section 4.11 “Legacy Compatibility and Migration” to state target model vs transition model explicitly: no immediate JSON-only requirement for stable legacy services; collector/pipeline mapping is acceptable during phased migration.
I think guideline does not limit logging to request-bound events: startup/shutdown/housekeeping and other background tooling logs are first-class logs, with the same structure/severity/redaction expectations. Correlation fields such as trace_id are optional when no tracing/request context exists. |
cauzm
left a comment
There was a problem hiding this comment.
I don’t have any further comments from my side. This is not my main area of expertise, and Simon has already provided a number of relevant comments
- Rename severity_text/severity_number to severityText/severityNumber - Rename trace_id/span_id to traceId/spanId and move to top-level LogRecord fields - Upgrade traceId requirement from SHOULD to MUST when available - Split required fields table into LogRecord fields and Resource attributes - Downgrade deployment.environment from MUST to SHOULD - Add TRACE severity level and severityText-to-severityNumber mapping table - Add OTel exception attributes: exception.type, exception.message, exception.stacktrace - Remove deprecated event.domain attribute - Replace error.message with exception.message per OTel semantic conventions - Fix MUST not -> MUST NOT in library logging rules - Fix lowercase normative keywords (should -> SHOULD/MUST NOT) - Update event.name examples to follow three-part domain.action.result format - Align 4.10 ownership table with deployment.environment SHOULD requirement
tlmquintino
left a comment
There was a problem hiding this comment.
The previous reviews have been excellent and very complete.
I have nothing to add.
But please reorganise the file into: Development Practices/Observability.md
Context
This is a draft proposal to align observability guidelines across ECMWF software and services.
The objective of this PR is to collect broad feedback early, before we finalize requirements and structure.
Review approach
Please focus on whether the proposed direction is workable for your teams and platforms.
You are welcome to add colleagues as reviewers where you think it is useful.
Discussion scope for this PR
Please avoid deep implementation debates in this PR thread.
If needed, we can open follow-up issues/PRs for detailed technical discussions.
Next steps
After this review round: