Skip to content

Limit monitor in-progress concurrency #47674

@evanpurkhiser

Description

@evanpurkhiser

At the moment there is no limit to the number of "concurrent checkins" that can occur at the same time (asside from what the rate limit enforces).

This can lead to a single monitor have a large number of in-progress checkins at once. As it is currently implemented, the check monitors task looks at all in progress checkins to determine if a monitor is past it's maximum runtime (This may be improved with #49283). Because of this current implementation, a monitor having many in progress checkins becomes a strain on the system.

In addition to the system concerns, from a user perspective, it may not always be desirable for a monitor to have overlapping checkins.

We should consider limiting the number of possible in progress checkins.

There are two approaches to this

  1. Enforce a hard limit on concurrency, we could start by saying you can only have 5 in progress checkins.

  2. Configurable concurrency -- This could look like an option that the user sets for "do not allow overlapping jobs". Ticking this would disable monitors receiving new checkins while there are IN_PROGRESS checkins.

    [!!]: In theory we could now have our APIs work such that creating a new checkin would always affect the most recent in progress checkin when this option is on. But we probably don't want to do this, since it would make the API more confusing.

Proposed implementation

To actually implement this will look like

### Tasks
- [ ] https://github.com/getsentry/sentry/pull/48308
- [ ] Update ingestion APIs (APIs, consumer) to query if the monitor environment has any `IN_PROGRESS` checkins. Depending on logic for how many concurrent checkins are allowed we can the.. Decide if the checkin is not going to complete or update an existing `IN_PROGRESS` checkin then we can reject that checkin.

Additional considerations

  • We may want to update the checkins list with a way to explicitly mark checkins as canceled from within the UI.

    This is to cover the case that if a customer sends a checkin and fails to keep track of the UUID of the checkin, and they have a very high timeout set for it (which after Crons: Introduce timeout_at column on checkins #49283 could not be reconfigured after the checkin starts), then they would be unable to mark it as failed / done / canceled, and all their other checkins would fail until it passes it's timeout time.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions