Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crons: Linking monitors to errors #43647

Open
evanpurkhiser opened this issue Jan 24, 2023 · 9 comments
Open

Crons: Linking monitors to errors #43647

evanpurkhiser opened this issue Jan 24, 2023 · 9 comments
Assignees

Comments

@evanpurkhiser
Copy link
Member

evanpurkhiser commented Jan 24, 2023

Problem Statment

When using the sentry Crons feature errors produced during execution of a monitored task should be associated to the monitor ID. We should also associate errors and transactions to the specific check-in that is reported.

What exists now?

Currently monitors support linking to errors that occur during execution via the monitor.id tag.

This tag is promoted from the MonitorContext.

Right now we are manually setting up this context in a few different places.

  • When a checkin fails it sets the context on the created “checkin failed” error event

    🤔 Note that it also manually sets the monitor.id in the tags. I believe this may be a mistake since the context promotes the id to a tag

  • The sentry monitor utils that are used in our celery tasks set the monitor context to associate errors to the monitor

So what’s the problem

This works but it has a few issues

  1. Checkin failure events do not track which particular checkin triggered the failure
  2. Users have to manually set the monitors context to associate an error to a particular monitor
  3. Even when the monitor.id context is manually set, there is no association of error to checkin.

Proposed checkin association strategy

We should use our Trace ID to associate checkins to errors and transactions, realistically this means doing the following

  1. A new trace_id UUID column should be added to the MonitorCheckin table. This should be indexed as we will use it to lookup associated checkins

  2. The monitor context should continue to exist on errors so that we can easily query for errors that we know are part of a monitor

  3. The trace_id should be provided as part of the checkin API request. It will not be required

    🤔 Open Question: Should we generate a trace ID if they no not pass it, and if so should we return that as part of the result API response.

Proposed integration with sentry-cli and SDKs

Both the checkin and SDKs producing errors and traces must be aware of the monitor context and the Trace ID being used for the monitor run. Here’s what needs to happen

  1. The sentry-cli should generate a Trace ID and send that with the checkins.
  2. The sentry-cli monitor run <monitor_id> -- <command> should set two environment variables during execution of <command>
    1. SENTRY_TRACE_ID: The Trace ID generated by the CLI
      feat(monitors): Pass SENTRY_TRACE_ID down execution path sentry-cli#1441
    2. SENTRY_MONITOR_ID: The Monitor UUID
      feat(monitors): Pass SENTRY_MONITOR_ID to executing process sentry-cli#1438
  3. Sentry SDKs should be updated to understand both of these environment variables
    1. SENTRY_TRACE_ID should be used in place of the SDK generating it’s own trace ID
    2. SENTRY_MONITOR_ID should be used to setup the monitor context of any events that occur during execution
      feat(crons): Add CronMonitorIntegration sentry-python#1866

Updates

  1. At the moment propagating a SENTRY_TRACE_ID is more involved than the scope of this ticket, see Crons: Linking monitors to errors #43647 (comment)
@getsantry
Copy link
Contributor

getsantry bot commented Jan 24, 2023

Routing to @getsentry/crons for triage, due by Thu, January 26th at 13:08 (sfo). ⏲️

evanpurkhiser added a commit to getsentry/sentry-cli that referenced this issue Jan 24, 2023
This adds `SENTRY_MONITOR_ID` to the environ of the executing command.

This is in support of getsentry/sentry#43647
and will allow SDKs embedded in the executing command to associate the
monitor context to any events produced by the SDK.
@mitsuhiko
Copy link
Member

mitsuhiko commented Jan 26, 2023

This is a change to this proposal:

Sentry SDKs should be updated to understand both of these environment variables

SENTRY_TRACE_ID should be used in place of the SDK generating it’s own trace ID

For the combination of dynamic sampling and trace propagation to work we need three pieces of information:

  • The entire traceparent (sentry-trace see docs here) which contains trace_id, parent_span_id, and the sampling flags which at present is basically the sampling decision (where 1 means sampled)
  • The sentry- prefixed items of the baggage (called dynamic sampling context or DSC). These are needed to achieve consistent head based sampling decisions across multiple separated segments of the trace. The DSC is documented here.

We have defined ways how this information is carried in HTTP headers, Celery tasks and I believe kafka headers. I do not believe we have defined a format for how this is to be transmitted in environment variables but that should be trivial enough. I believe it's probably a good idea to just define environment variables and to use the http serialization. This would imply:

SENTRY_BAGGAGE="other-vendor-value-1=foo;bar;baz, sentry-trace_id=771a43a4192642f0b136d5159a501700, sentry-public_key=49d0f7386ad645858ae85020e393bef3, sentry-sample_rate=0.01337, sentry-user_id=Am%C3%A9lie, other-vendor-value-2=foo;bar;"
SENTRY_TRACE="771a43a4192642f0b136d5159a501700-6b945b814e6c2434-1"

The rest of the system would work the same way as documented.

The monitor id itself could become part of the baggage with sentry-monitor-id=... if that plays a role.

@evanpurkhiser
Copy link
Member Author

evanpurkhiser commented Jan 31, 2023

Had some discussion with @cleptric and @AbhiPrasad for the SDK side of this.

It's looking like the trace propagation part of this plan is a bit more work than expected.

Right now the trace identifier needs to be associated to a span in order to propagate in the SDK, there is no mechanism to hold the trace ID in any sort of context without a span. SDKs that implement tracing will always either start a trace from a new transaction/root-span OR will continue a trace and immediately start a new span (such as when continuing from a frontend request)

Because a monitor run should not start a new span, there's currently no way to propagate a Trace ID from the CLI.

I see two paths forward for associating the checkin with errors:

  1. @mitsuhiko is working with SDK teams now to see if we can come up with a plan for SDKs to hold a trace ID outside of a transaction/span, using this we'll be able to propagate a trace from sentry-cli monitor run

  2. We can use the same monitor context that we're storing the monitor ID on to also store a checkin ID. This gives us the same functionality of associating checkins → errors, but is less "correct" in the sense that it really is a trace.

I am leaning towards implementing number two while we work towards number one. We're already using the monitor context to maintain the monitor ID, no reason we can't associate the checkin ID as well.

@dcramer
Copy link
Member

dcramer commented Jan 31, 2023

@mitsuhiko do we have ETA on when we could solve the trace ID propagation? goal is to push weekly-level progress w/ Crons given its stage, so if thats going to take us ages we need to find a faster path to the cron linking (so maybe in near term we just do this monitor thing.. but even an SDK release still takes time so hoping we could accelerate)

@mitsuhiko
Copy link
Member

The RFC for the trace propagation was merged. Maybe @antonpirker can give some updates here on how long it takes to implement this for Python.

https://github.com/getsentry/rfcs/blob/main/text/0071-continue-trace-over-process-boundaries.md

@evanpurkhiser
Copy link
Member Author

We'll also need to implement some type of trace generation within Sentry CLI. I assume the rust SDK doesn't already expose anything to do that right?

@evanpurkhiser
Copy link
Member Author

evanpurkhiser commented Feb 21, 2023

Going to move this back into planned. We will be focusing on an improved ingestion experience which will allow this work to be done in a simpler way without requiring each SDK to be updated to send monitor contexts.

@getsantry
Copy link
Contributor

getsantry bot commented Sep 7, 2023

Routing to @getsentry/product-owners-crons for triage ⏲️

@antonpirker
Copy link
Member

Sorry for not answering your request @mitsuhiko
FYI: Loading trace information from environment headers was released in version 1.16.0 of the Python SDK: getsentry/sentry-python#2176

Documentation is still missing, I think (at least I was not able to find it right now...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Beta Availability
Status: No status
Development

No branches or pull requests

5 participants