Whitepaper on cloud-native observability #16

AloisReitbauer · 2020-05-28T10:32:30Z

Goal is to support users in implementing observability and monitoring for their cloud native workloads

Target: End users building cloud native applications

Scope: Define basic concepts of data collection and analysis and how CNCF projects can be used for this. Maybe add 1 - 3 real world reference examples

Details:

Data collection: Logs, Metrics, Traces
- What to use which data source for
- Examples with CNCF projects, Prometheus, Jaeger, OpenTelemetry, ...
Make your Kubernetes cluster - and the apps running on it - obseravable
Storage backends for data
- Options based on CNCF projects
- Enterprise requirements: HA, RBAC, ....
Data analysis patterns:
- Alerting, Anomaly detection, trace analytics, log analytics

mhausenblas · 2020-05-28T10:35:21Z

That sounds like a really good idea to me and I'm happy to contribute to this.

Vlaaaaaaad · 2020-05-28T12:36:10Z

+1 to this.

A good first step may be a landscape, just to have all "players", from open-source and cloud-native, all the way to hosted SaaS options.

AloisReitbauer · 2020-06-05T07:30:17Z

We would need to define the categories for the landscape first. Observability is a too broad and abstract term.

fktkrt · 2020-08-19T12:45:21Z

I am also happy to contribute to this.
Is there any reason to use different categories than the categories defined in the official CNCF landscape's Observability & Analysis category?
Monitoring, logging and tracing can cover most aspects of the term, maybe Chaos Engineering is a bit of an outlier, but also can be included if this SIG would want to cover that as well.

danielkhan · 2020-11-13T14:24:49Z

@mhausenblas, @sferlin, @ArthurSens, and @danielkhan met today to agree on a rough outline and schedule.

We plan to have the draft ready for review by January 19th and a final version in February.
@ArthurSens already kicked off a google doc and we will continue to work in this doc.

ArthurSens · 2020-11-13T16:02:21Z

I've finished reorganizing the sessions. Several comments were added to the doc with a brief explanation about the newly added sessions.
@mhausenblas @sferlin @danielkhan (or anyone else interested), feel free to review and to ping me on slack if there are any doubts.

ArthurSens · 2020-11-13T16:50:56Z

It's worth mentioning that the organization of the sessions was inspired by the whitepaper that is under work by the SIG-Security

ArthurSens · 2020-11-24T17:18:54Z

"Goals" and "Target Audience" ready for review :)

ArthurSens · 2020-12-12T20:53:39Z

"Introduction" ready for review 🙂

rakyll · 2021-03-12T17:11:52Z

A couple of thoughts...

In the current structure, there is so much emphasis on different signals (metrics, traces, logs and more) but not a lot of structure to explain how observability goals make you change the way you collect these signals. For example, you collect things consistently with the same labels (dimensions) and later can narrow down the data by filtering the same labels.
Context propagation is very critical component in observability from the perspective of propagated labels and trace/request context. The paper may need a section dedicated to context (both in-process and on wire).
The paper should include a high-level section what observability enables. For example, response time to an incident can be improved from hours to minutes. Observability can help onboarding new engineers in relative large systems. Observability can contribute to the efficiency of your infrastructure and help you optimize (not just performance-wise but also costs-wise). It can help identifying malicious cases and security attacks. A section with a list of various things observability can contribute would be great.
The paper also can have a more compelling section of software lifecycle and how observability fits. From development, to deployment, to production, observability has unique contributions. A high-level overview would be great to capture.

danielkhan · 2021-03-16T11:02:35Z

Thank you for the feedback, Jaana.

Context propagation is very critical component in observability from the perspective of propagated labels and trace/request context. The paper may need a section dedicated to context (both in-process and on wire).

I could not participate as it got unexpectedly busy but I can take a stab on this within the next two weeks if no one else is already working on this.

halcyondude · 2021-05-12T10:33:14Z

The coordination and collaboration is happening in Google Doc and Slack, using this issue for high level tracking.

dominick-blue · 2022-04-29T17:41:03Z

Hi everyone,

I would love to contribute to this. I see the last actions were last taken in April 2021. Where is the Whitepaper at in the process?

bwplotka · 2023-06-19T11:54:29Z

Hi all, hope you are all great!

Thanks everybody for help and feedback. It's finally time to claim that v1.0 version of the whitepaper we started long time ago!

To do so we have to perform final touches and review. For that, we created this document, which is now open for review. The aim is to have the final version of whitepaper with addressed TODOs done by 1st August. After that we will share it wider, save as official v1 and open WIP version for v1.1, so we can continuously evolve it with more content and updates community can bring next months! (:

Final review document holds the latest whitepaper version (copied from this version) available for collaborative review and addressing TODOs. Feel free to add comments & suggestions. For bigger suggestions (e.g. further details or new sections), we might pull them new GH issues for whitepaper v1.1. The doc also outlines all feedback items we got so far to consider. Feel free to help in any of those TODOs and, generally, in final review of this paper 🤗

Thanks! 💪🏽

larryck · 2023-07-13T09:14:06Z

Is "Make your Kubernetes cluster - and the apps running on it - obseravable" mentioned above still a goal of the observability whitepaper?

bwplotka · 2023-08-01T09:48:15Z

Is "Make your Kubernetes cluster - and the apps running on it - obseravable" mentioned above still a goal of the observability whitepaper?

Not sure, it feels like a separate tutorial might be useful here (we can then link from whitepaper). Happy to be told otherwise here (:

Added #131 to track that particular idea.

bwplotka · 2023-08-01T09:50:53Z

All the points here were either addressed in the final 1.0 review period or added as TODO (help wanted!) and tracked in the individual issues with cn-o11y-whitepaper-v1.1 label.

PR with the changes from the review period to main branch will follow. Closing for now, feel free to keep discussion going, ideally in separate issue so it's easier to track and address 💪🏽

Thanks everybody for epic work on reviewing, suggesting and contributing!

bwplotka · 2023-08-01T11:26:36Z

See: #132

Thanks for your ideas and feedback in #16 Signed-off-by: bwplotka <bwplotka@gmail.com>

…ion period. (#132) * whitepaper: Syncing changes from the 1.0 community review & contribution period. Thanks everyone for amazing feedback! Apologies for a bit short period, but the paper was sitting for 2y without changes, so it made sense to time box 1.0 and allow structured work on further iterations. Still, within [this community review & contribution period](https://docs.google.com/document/d/19am_KCYWU28ebLiIXv_P3ji96edxCTscVb4CzemXV5A/edit) I counted 67 individual contributions (count of comment/suggestion bubbles, excluding my own and not counting individual discussion comments) from 7 new contributors. High level changes (suggested by community, but also clean up by me): * Note on aggregatability and volume of metrics. * Added non goals * Changing example from temperature to memory gauge * Add reasons for metric efficiency * Added info about cardinality (new section) * Mentioned metric data models * Added info about types * Metric time series vs count * Addressing feedback on logs, traces * Adding profile screenshot * Cleaning up, simplifying correlation section * Removing how to setup Prometheus with exemplars * Transitions * Box based monitoring refactor - changing "closed box" traditional * Clarified SLO/SLA * Added image and figure captions * More automatic and non-intrusive instrumentation solutions in OSS * Linking ebay paper * Added gap around streaming API, not enough DBs and standarized query language * Did grammarly pass for typos. ...and more. As mentioned in [the doc](https://docs.google.com/document/d/19am_KCYWU28ebLiIXv_P3ji96edxCTscVb4CzemXV5A/edit) I went through all additional and old feedback. It's now either addressed in this PR or added as todos in [separate issues](https://github.com/cncf/tag-observability/issues?q=is%3Aissue+is%3Aopen+label%3Acn-o11y-whitepaper-v1.1) I admit, it was fun to process that doc! Reminds me of year ago when I was, fully focused, writing my [book](https://www.oreilly.com/library/view/efficient-go/9781098105709/) on sligthly different topic. Signed-off-by: bwplotka <bwplotka@gmail.com> * Added Jaana and Alois as contributors. Thanks for your ideas and feedback in #16 Signed-off-by: bwplotka <bwplotka@gmail.com> * Apply suggestions from Richi's code review Co-authored-by: RichiH-travel <146006063+RichiH-travel@users.noreply.github.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> * Apply suggestions from code review Co-authored-by: RichiH-travel <146006063+RichiH-travel@users.noreply.github.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> * Fixed references, added tip version. Signed-off-by: bwplotka <bwplotka@gmail.com> --------- Signed-off-by: bwplotka <bwplotka@gmail.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> Co-authored-by: RichiH-travel <146006063+RichiH-travel@users.noreply.github.com>

…ion period. (cncf#132) * whitepaper: Syncing changes from the 1.0 community review & contribution period. Thanks everyone for amazing feedback! Apologies for a bit short period, but the paper was sitting for 2y without changes, so it made sense to time box 1.0 and allow structured work on further iterations. Still, within [this community review & contribution period](https://docs.google.com/document/d/19am_KCYWU28ebLiIXv_P3ji96edxCTscVb4CzemXV5A/edit) I counted 67 individual contributions (count of comment/suggestion bubbles, excluding my own and not counting individual discussion comments) from 7 new contributors. High level changes (suggested by community, but also clean up by me): * Note on aggregatability and volume of metrics. * Added non goals * Changing example from temperature to memory gauge * Add reasons for metric efficiency * Added info about cardinality (new section) * Mentioned metric data models * Added info about types * Metric time series vs count * Addressing feedback on logs, traces * Adding profile screenshot * Cleaning up, simplifying correlation section * Removing how to setup Prometheus with exemplars * Transitions * Box based monitoring refactor - changing "closed box" traditional * Clarified SLO/SLA * Added image and figure captions * More automatic and non-intrusive instrumentation solutions in OSS * Linking ebay paper * Added gap around streaming API, not enough DBs and standarized query language * Did grammarly pass for typos. ...and more. As mentioned in [the doc](https://docs.google.com/document/d/19am_KCYWU28ebLiIXv_P3ji96edxCTscVb4CzemXV5A/edit) I went through all additional and old feedback. It's now either addressed in this PR or added as todos in [separate issues](https://github.com/cncf/tag-observability/issues?q=is%3Aissue+is%3Aopen+label%3Acn-o11y-whitepaper-v1.1) I admit, it was fun to process that doc! Reminds me of year ago when I was, fully focused, writing my [book](https://www.oreilly.com/library/view/efficient-go/9781098105709/) on sligthly different topic. Signed-off-by: bwplotka <bwplotka@gmail.com> * Added Jaana and Alois as contributors. Thanks for your ideas and feedback in cncf#16 Signed-off-by: bwplotka <bwplotka@gmail.com> * Apply suggestions from Richi's code review Co-authored-by: RichiH-travel <146006063+RichiH-travel@users.noreply.github.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> * Apply suggestions from code review Co-authored-by: RichiH-travel <146006063+RichiH-travel@users.noreply.github.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> * Fixed references, added tip version. Signed-off-by: bwplotka <bwplotka@gmail.com> --------- Signed-off-by: bwplotka <bwplotka@gmail.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> Co-authored-by: RichiH-travel <146006063+RichiH-travel@users.noreply.github.com> Signed-off-by: Roman Nikolaev <nikolaev.rd@yandex.ru>

halcyondude added this to Backlog (scoped and planned) in TAG Observability Kanban via automation Oct 13, 2020

sferlin mentioned this issue Nov 10, 2020

Draft on observability topics/aspects for a white paper #28

Closed

halcyondude moved this from Backlog (scoped and planned) to In Progress in TAG Observability Kanban Dec 1, 2020

halcyondude added the documentation Improvements or additions to documentation label May 12, 2021

halcyondude removed this from In Progress in TAG Observability Kanban May 12, 2021

halcyondude added this to Active in TAG Observability Working Groups May 12, 2021

halcyondude assigned ArthurSens May 12, 2021

alolita added the cn-o11y-whitepaper label Jun 14, 2023

bwplotka removed the cn-o11y-whitepaper label Aug 1, 2023

bwplotka closed this as completed Aug 1, 2023

bwplotka added a commit that referenced this issue Aug 1, 2023

Added Jaana and Alois as contributors.

ac7dfce

Thanks for your ideas and feedback in #16 Signed-off-by: bwplotka <bwplotka@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitepaper on cloud-native observability #16

Whitepaper on cloud-native observability #16

AloisReitbauer commented May 28, 2020

mhausenblas commented May 28, 2020

Vlaaaaaaad commented May 28, 2020

AloisReitbauer commented Jun 5, 2020

fktkrt commented Aug 19, 2020

danielkhan commented Nov 13, 2020

ArthurSens commented Nov 13, 2020

ArthurSens commented Nov 13, 2020

ArthurSens commented Nov 24, 2020

ArthurSens commented Dec 12, 2020

rakyll commented Mar 12, 2021

danielkhan commented Mar 16, 2021

halcyondude commented May 12, 2021

dominick-blue commented Apr 29, 2022

bwplotka commented Jun 19, 2023

larryck commented Jul 13, 2023

bwplotka commented Aug 1, 2023

bwplotka commented Aug 1, 2023 •

edited

bwplotka commented Aug 1, 2023

Whitepaper on cloud-native observability #16

Whitepaper on cloud-native observability #16

Comments

AloisReitbauer commented May 28, 2020

mhausenblas commented May 28, 2020

Vlaaaaaaad commented May 28, 2020

AloisReitbauer commented Jun 5, 2020

fktkrt commented Aug 19, 2020

danielkhan commented Nov 13, 2020

ArthurSens commented Nov 13, 2020

ArthurSens commented Nov 13, 2020

ArthurSens commented Nov 24, 2020

ArthurSens commented Dec 12, 2020

rakyll commented Mar 12, 2021

danielkhan commented Mar 16, 2021

halcyondude commented May 12, 2021

dominick-blue commented Apr 29, 2022

bwplotka commented Jun 19, 2023

larryck commented Jul 13, 2023

bwplotka commented Aug 1, 2023

bwplotka commented Aug 1, 2023 • edited

bwplotka commented Aug 1, 2023

bwplotka commented Aug 1, 2023 •

edited