Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production - [Alerting] Android devices disconnected #1383

Closed
dotnet-eng-status bot opened this issue Nov 8, 2023 · 32 comments
Closed

Production - [Alerting] Android devices disconnected #1383

dotnet-eng-status bot opened this issue Nov 8, 2023 · 32 comments
Assignees
Labels
Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging)

Comments

@dotnet-eng-status
Copy link

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-105} 100

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-35f560112f7a4bfabf9fd69bc1bd76fa

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging) labels Nov 8, 2023
@dkurepa
Copy link
Member

dkurepa commented Nov 8, 2023

DNCENGWIN-105 has been disabled, and here's the ICM: https://portal.microsofticm.com/imp/v3/incidents/incident/439700036/summary

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-105} 100

Go to rule

4 similar comments
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-105} 100

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-105} 100

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-105} 100

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-105} 100

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Nov 11, 2023
Copy link
Author

💚 Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

@dotnet-eng-status dotnet-eng-status bot removed the Inactive Alert Issues from Grafana alerts that are now "OK" label Nov 11, 2023
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

@dotnet-eng-status dotnet-eng-status bot added the Active Alert Issues from Grafana alerts that are now active label Nov 11, 2023
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

5 similar comments
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-103} 100
  • FailureRate {Machine=DNCENGWIN-106} 100
  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

@missymessa
Copy link
Member

Re-enabled DNCENGWIN-105 per DDFUN's update: https://portal.microsofticm.com/imp/v3/incidents/incident/439700036/summary

@missymessa
Copy link
Member

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-103} 100
  • FailureRate {Machine=DNCENGWIN-106} 100
  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-103} 100
  • FailureRate {Machine=DNCENGWIN-106} 100
  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

@missymessa
Copy link
Member

Added DNCENGWIN-103 and DNCENGWIN-106 back to the queue. DDFUN needs additional intervention with DNCENGWIN-115, however.

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

3 similar comments
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-115} 100

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Nov 18, 2023
Copy link
Author

💚 Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active and removed Inactive Alert Issues from Grafana alerts that are now "OK" labels Nov 20, 2023
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-038} 100

Go to rule

@dougbu
Copy link
Member

dougbu commented Nov 20, 2023

offlined DNCENGWIN-038. filed a new ticket similar to the ones mentioned above: https://portal.microsofticm.com/imp/v3/incidents/incident/443602374/summary

Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-038} 100

Go to rule

1 similar comment
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-038} 100

Go to rule

@AlitzelMendez AlitzelMendez modified the milestone: Autoscaler vNext Nov 21, 2023
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-038} 100

Go to rule

1 similar comment
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=DNCENGWIN-038} 100

Go to rule

@dougbu
Copy link
Member

dougbu commented Nov 22, 2023

ticket resolved. reenabled the machine and opportunistically closing this issue since that should leave nothing disabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging)
Projects
None yet
Development

No branches or pull requests

4 participants