Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server] track more startWorkspace failures #12378

Merged
merged 1 commit into from Aug 26, 2022

Conversation

svenefftinge
Copy link
Member

Counts additional reasons for startWorkspace failures.

Related Issue(s)

fixes #12332

Release Notes

NONE
  • /werft with-preview

@svenefftinge svenefftinge requested a review from a team August 25, 2022 10:38
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Aug 25, 2022
@svenefftinge svenefftinge force-pushed the sefftinge/create-an-alert-for-failing-12332 branch from 585dbb6 to 3678ffd Compare August 25, 2022 12:24
@@ -244,6 +248,12 @@ export async function getWorkspaceClassForInstance(
}
}

class StartInstanceError extends Error {
constructor(public readonly reason: FailedInstanceStartReason, public readonly cause: Error) {
super("Starting workspace instance failed: " + cause.message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the reason value assigned to this inherited type before it is used below?

if (err instanceof StartInstanceError) {
    failedReason = err.reason;
}

I don't have that much experience with Errors in JS/TS but here it looks like we're just assigning it to a possibly variable string. My worry with this is that for the metric in prometheus, we should ideally have a fixed cardinality of the label which means we wouldn't expect the variance that could come from this constructor taking on arbitrary message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FailedInstanceStartReason type is restricted to only three values.

Copy link
Member

@easyCZ easyCZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke on a call. Sven explained that public readonly reason: FailedInstanceStartReason actually makes reason a property of the object. I was not familiar with that syntax before.

This resolves my question.

Thanks for the changes!

@roboquat roboquat merged commit dda2ebd into main Aug 26, 2022
@roboquat roboquat deleted the sefftinge/create-an-alert-for-failing-12332 branch August 26, 2022 13:06
@roboquat roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels Aug 29, 2022
@@ -414,6 +424,11 @@ export class WorkspaceStarter {
forceRebuild,
);
} catch (e) {
let failedReason: FailedInstanceStartReason = "other";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geropl this is the value for the alerts rate(gitpod_server_instance_starts_failed_total{}[$__rate_interval]).
Not sure this makes sense to have other if that's the default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're 💯 right, it doesn't (with the current quality of errors we're receiving). PR to fix this: #12574

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed Change is completely running in production release-note-none size/M team: webapp Issue belongs to the WebApp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create an alert for failing image-builds (in app clusters)
5 participants