Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define new incident events for the DORA metrics #59

Closed
afrittoli opened this issue Aug 23, 2022 · 11 comments · Fixed by #107
Closed

Define new incident events for the DORA metrics #59

afrittoli opened this issue Aug 23, 2022 · 11 comments · Fixed by #107
Assignees
Labels
roadmap Items on the roadmap
Milestone

Comments

@afrittoli
Copy link
Contributor

No description provided.

@afrittoli afrittoli added this to the v0.1 milestone Aug 23, 2022
@afrittoli afrittoli removed this from the v0.1 milestone Oct 4, 2022
@afrittoli afrittoli added the roadmap Items on the roadmap label Oct 6, 2022
@e-backmark-ericsson e-backmark-ericsson added this to the v0.2 milestone Oct 21, 2022
@afrittoli afrittoli self-assigned this Dec 13, 2022
@menehune23
Copy link

Agreed that this would be super helpful, especially for those interested in using CDEvents for DORA metrics.

Here's a possible starting point. It's a new incident subject with reported and resolved predicates:

{
  "context": {
    "version": "0.1.x",
    "type": "dev.cdevents.incident.reported.0.1.x",
    "id": "<incident_id>",
    "source": "<source>",
    "timestamp": "<reported_time>"
  },
  "subject": {
    "id": "<incident_id>",
    "type": "incident",
    "content":  {
      "environment": { "id": "<environment>" },
      "artifactId": "some-package-url"
    }
  }
}

And similarly for dev.cdevents.incident.resolved.0.1.x

@menehune23
Copy link

menehune23 commented Jan 12, 2023

I could see one being able to link an incident to service deployments and artifacts using the artifactId and environment values.

@afrittoli
Copy link
Contributor Author

Notes from the CDEvents WG on Nov 23rd, 2022:

  • Incident events
    • Incident subjects
      • Fields: ID, source, Environment, Service, Kind
        • Source can be monitoring system, the application itself, a ticketing system, an SRE
        • ID is a UUID
        • Environment and Service are a references
        • Kind could be something that describes the kind of degradation that was detected, with a fixed set of keywords
          • Response time, reliability, functional
        • SLA breached
        • Best practices SIG / Supply Chain SIG - Maturity Workflow
          • Working on describing metrics associated with key activices
        • Priority
      • Predicate:
        • Update / Report
        • Finish / Restore
        • No start event?
        • Remediation
        • Ignored / accepted
          • Technical debt / accepted risk
          • Capturing this is important for metrics / audit trail
        • Upgrade / downgrade / triaged
      • Incident report can be a risk or an issue
      • The time the incident is reported is not necessarily the time the incident is started
    • Remediation subject?
      • Important to capture decisions not to act based on an incident
      • Keep track of what was done before to address an incident / issue
    • Metric?
      • We could use events to report changes in a metric
        • A change beyond a certain threshold would be an incident
      • Maybe OpenTelemetry is best used for that
        • OpenTelemetry reports values of metrics
        • Event could be used to react to a change

@menehune23
Copy link

menehune23 commented Jan 13, 2023

@afrittoli Those notes are fantastic! Some really good thoughts in there.

It might be good to narrow down what the minimum pieces of data are for the initial hack at this, since we both now that specs are easier to grow than to shrink ;) And since it would be faster to get out. Then incremental additive changes could be driven by feedback (e.g. reported issues/feature requests).

I could see the absolute minimum for an incident at this point being:

  • A new subject type: incident
  • Two new predicates: reported and resolved
  • The following attributes (outside of the base CDEvents requirements like timestamp, etc.):
    • Environment
    • Artifact (useful for consumers wanting to link to a specific service/deployment)

I agree that you might want an additional predicate to distinguish reported from began, but I would consider that more an additive change to make at a later time if it becomes a need.

@menehune23
Copy link

It might be that the way consumers report incidents is so varied that simpler is better here, and there's always customData for those with special use cases.

@e-backmark-ericsson
Copy link
Contributor

I believe we also need to have a crisp definition of what an "incident" is. I believe there might be some gray areas there between incidents and issues/requirements.

One example could be CVEs. Are they incidents? And what about a "request to step a 3pp" even if there is no CVE related to it? It could be a downstream user that argue that a 3pp should be stepped in the upstream application due to some misbehavior, but not really due to an "incident". How could we capture that?

One way around it that could be to go for the GitHub and GitLab nomenclature and call all of them "issues".

@afrittoli
Copy link
Contributor Author

@afrittoli Those notes are fantastic! Some really good thoughts in there.

It might be good to narrow down what the minimum pieces of data are for the initial hack at this, since we both now that specs are easier to grow than to shrink ;) And since it would be faster to get out. Then incremental additive changes could be driven by feedback (e.g. reported issues/feature requests).

Thanks @menehune23 - yes, definitely I agree we should start small.
But at the same time I would like to avoid starting in a direction that will require significant changes to grow.

I could see the absolute minimum for an incident at this point being:

  • A new subject type: incident

  • Two new predicates: reported and resolved

  • The following attributes (outside of the base CDEvents requirements like timestamp, etc.):

    • Environment
    • Artifact (useful for consumers wanting to link to a specific service/deployment)

I would prefer to use the service subject here, which is then associated with an artifact.
The incident itself is a degradation of some kind of a service, and it might not be associated to any specific artifact.

I agree that you might want an additional predicate to distinguish reported from began, but I would consider that more an additive change to make at a later time if it becomes a need.

+1

I was envisioning including a kind attribute in the initial data model, but that's also something we can add incrementally.

@afrittoli
Copy link
Contributor Author

I believe we also need to have a crisp definition of what an "incident" is. I believe there might be some gray areas there between incidents and issues/requirements.

Thanks @e-backmark-ericsson for highlighting this, I agree there are grey areas and we should have a clear definition.
From my POV incident events model measurable disruptions in a production environment, like service not being accessible, degradation in response time, functionality etc.

One example could be CVEs. Are they incidents? And what about a "request to step a 3pp" even if there is no CVE related to it? It could be a downstream user that argue that a 3pp should be stepped in the upstream application due to some misbehavior, but not really due to an "incident". How could we capture that?

A CVE being exploited could be the root cause of an incident but I would not consider it an incident in itself.
I think CVEs, change requests and feature requests are something we could try and model in a different bucket .

One way around it that could be to go for the GitHub and GitLab nomenclature and call all of them "issues".

I think I would prefer incidents to be separated from issues. An issue could be used to report an incident.

@afrittoli
Copy link
Contributor Author

@menehune23 you're very welcome to join the CDEvents WG if that is something of interest.

@e-backmark-ericsson
Copy link
Contributor

From my POV incident events model measurable disruptions in a production environment, like service not being accessible, degradation in response time, functionality etc.

Ok, so "bugs" are not necessarily incidents either then, but could be used as a type on an issue reporting an incident potentially?

@afrittoli afrittoli changed the title Define new incident (name TBD) events for the DORA metrics Define new incident events for the DORA metrics Jan 17, 2023
@afrittoli
Copy link
Contributor Author

@e-backmark-ericsson @menehune23 @erkist @salaboy I started detailing the new incident events in an hackmd - please let me know if you have comments. If we can get some agreement on this as a starting point, I'd be happy to create a PR for it.

afrittoli added a commit to afrittoli/cdevents-spec that referenced this issue Jan 24, 2023
Introduce incident events.

TBD: schema and README updates

Partially-fixes: cdevents#59

Signed-off-by: Andrea Frittoli <andrea.frittoli@gmail.com>
@afrittoli afrittoli mentioned this issue Jan 24, 2023
4 tasks
afrittoli added a commit to afrittoli/cdevents-spec that referenced this issue Jan 24, 2023
Introduce incident events.

TBD: schema and README updates

Partially-fixes: cdevents#59

Signed-off-by: Andrea Frittoli <andrea.frittoli@gmail.com>
afrittoli added a commit to afrittoli/cdevents-spec that referenced this issue Jan 30, 2023
Introduce incident events.

Partially-fixes: cdevents#59

Signed-off-by: Andrea Frittoli <andrea.frittoli@gmail.com>
afrittoli added a commit that referenced this issue Mar 9, 2023
Introduce incident events.

Partially-fixes: #59

Signed-off-by: Andrea Frittoli <andrea.frittoli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
roadmap Items on the roadmap
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants