Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,12 @@ Refer to the [general Runbook for investigating cert-manager alerts]({{< ref "ce
- **certManagerAbsent**: cert-manager is down or not reachable by Prometheus.
- **certManagerHittingRateLimits**: cert-manager is hitting LetsEncrypt rate limits, which may prevent certificate generation for up to a week.

### DNS

coreDNS alerts are defined in our [aurora-platform-charts](https://github.com/gccloudone-aurora/aurora-platform-charts/tree/main/stable/aurora-platform/charts/aurora-core/conf/prometheus_rules/kube_prometheus_stack/coredns_alerts)

Refer to the [general Runbook for investigating DNS-related alerts]({{< ref "dns" >}}).

### Miscellaneous

These alerts are defined in our [aurora-platform-charts](https://github.com/gccloudone-aurora/aurora-platform-charts/tree/main/stable/aurora-platform/charts/aurora-core/conf/prometheus_rules/kube_prometheus_stack).
Expand Down
34 changes: 34 additions & 0 deletions content/en/team/monitoring-alerts/alert-cluster-level/dns.en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: "DNS Alerts"
linkTitle: "DNS Alerts"
weight: 5
aliases: ["/team/monitoring/clusteralerts/dns"]
draft: false
---

<gcds-alert alert-role="danger" container="full" heading="Avis de traduction" hide-close-btn="true" hide-role-icon="false" is-fixed="false" class="hydrated mb-400">
<gcds-text>Veuillez noter que ce document est actuellement en cours de développement actif et pourrait être sujet à des révisions. Une fois terminé, il sera entièrement traduit en français et mis à disposition dans sa version finale.</gcds-text>
</gcds-alert>

## Overview

DNS-related issues can occur at two levels:

- Cloud Service Provider (CSP) networking layer
- Kubernetes level (usually an issue with coreDNS).

If the application logs show DNS lookup errors for trying to resolve something like `google.com` the issue most likely stems from the DNS servers configured from the CSP. Collect all relevant information such as the environment, the SourceIP of the Pod & the host that is not resolving correctly and present them to the point-of-contact responsible for managing the CSP infrastructure for Aurora.

If application logs include logs similar to:

```Could not resolve host: someservice.namespace.svc.cluster.local```

This indicates that coreDNS is failing to resolve a Service. Refer to the information below to investigate & resolve the issue.

## Alert: CoreDNSDown

Check the `coredns` deployment in the `kube-system` namespace and ensure that there is at least one healthy replica.

## Other DNS related issues

If the DNS issue is at the Kubernetes level, refer to [this runbook](https://containersolutions.github.io/runbooks/posts/kubernetes/dns-failures/#overview) (credit to Ian Miell) for investigating & resolving DNS failures.