Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy in information that came up in the failure screnarios #1262

Merged
merged 3 commits into from
Feb 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/configuration/environment-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ A list of strings representing the host/domain names that this Django site can s

!!! warning "Deployment configuration"

You may change this setting when deploying the app to a non-localhost domain
Do not enable this in production

!!! tldr "Django docs"

Expand Down
2 changes: 2 additions & 0 deletions docs/deployment/infrastructure.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,8 @@ The following things in Azure are managed by the California Department of Techno
- [Resource Groups](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal)
- Networking
- Front Door
- Web Application Firewall (WAF)
- Distributed denial-of-service (DDoS) protection
- IAM
- Service connections

Expand Down
46 changes: 38 additions & 8 deletions docs/deployment/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
# Troubleshooting

## Monitoring
## Tools

### Monitoring

We have [ping tests](https://docs.microsoft.com/en-us/azure/azure-monitor/app/monitor-web-app-availability) set up to notify about availability of each [environment](../infrastructure/#environments). Alerts go to [#benefits-notify](https://cal-itp.slack.com/archives/C022HHSEE3F).

## Logs
### Logs

Logs can be found a couple of places:

### Azure App Service Logs
#### Azure App Service Logs

[Open the `Logs` for the environment you are interested in.](https://docs.google.com/document/d/11EPDIROBvg7cRtU2V42c6VBxcW_o8HhcyORALNtL_XY/edit#heading=h.6pxjhslhxwvj) The following tables are likely of interest:

Expand All @@ -18,7 +20,7 @@ Logs can be found a couple of places:

For some pre-defined queries, click `Queries`, then `Group by: Query type`, and look under `Query pack queries`.

### [Azure Monitor Logs](https://docs.microsoft.com/en-us/azure/azure-monitor/logs/data-platform-logs)
#### [Azure Monitor Logs](https://docs.microsoft.com/en-us/azure/azure-monitor/logs/data-platform-logs)

[Open the `Logs` for the environment you are interested in.](https://docs.google.com/document/d/11EPDIROBvg7cRtU2V42c6VBxcW_o8HhcyORALNtL_XY/edit#heading=h.n0oq4r1jo7zs)

Expand All @@ -31,19 +33,23 @@ In the latter two, you should see recent log output. Note [there is some latency

See [`Failures`](https://docs.microsoft.com/en-us/azure/azure-monitor/app/asp-net-exceptions#diagnose-failures-using-the-azure-portal) in the sidebar (or `exceptions` under `Logs`) for application errors/exceptions.

### Live tail
#### Live tail

After [setting up the Azure CLI](#making-changes), you can use the following command to [stream live logs](https://docs.microsoft.com/en-us/azure/app-service/troubleshoot-diagnostic-logs#in-local-terminal):

```sh
az webapp log tail --resource-group RG-CDT-PUB-VIP-CALITP-P-001 --name AS-CDT-PUB-VIP-CALITP-P-001 2>&1 | grep -v /healthcheck
```

### SCM
#### SCM

<https://as-cdt-pub-vip-calitp-p-001-dev.scm.azurewebsites.net/api/logs/docker>

## Terraform lock
## Specific issues

This section serves as the [runbook](https://www.pagerduty.com/resources/learn/what-is-a-runbook/) for Benefits.

### Terraform lock

[General info](https://developer.hashicorp.com/terraform/language/state/locking)

Expand All @@ -54,7 +60,29 @@ If Terraform commands fail (locally or in the Pipeline) due to an `Error acquiri
1. **Do any engineers have a Terrafrom command running locally?** You'll need to ask them. For example: They may have started an `apply` and it's sitting waiting for them to [approve](https://developer.hashicorp.com/terraform/cli/commands/apply#automatic-plan-mode) it. They will need to (gracefully) exit for the lock to be released.
1. **If none of the steps above identified the source of the lock**, and especially if the `Created` time is more than ten minutes ago, that probably means the last Terraform command didn't release the lock. You'll need to grab the `ID` from the `Lock Info` output and [force unlock](https://developer.hashicorp.com/terraform/language/state/locking#force-unlock).

## Eligibility Server
### App fails to start

If the container fails to start, you should see a [downtime alert](#monitoring). Assuming this app version was working in another [environment](../infrastructure/#environments), the issue is likely due to misconfiguration. Some things you can do:

- Check the [logs](#logs)
- Ensure the [environment variables](../../configuration/environment-variables/) and [configuration data](../../configuration/data/) are set properly.
- [Turn on debugging](../../configuration/environment-variables/#django_debug)
- Force-push/revert the [environment](../infrastructure/#environments) branch back to the old version to roll back

### Littlepay API issue

Littlepay API issues may show up as:

- The [monitor](https://github.com/cal-itp/benefits/actions/workflows/check-api.yml) failing
- The `Connect your card` button doesn't work

A common problem that causes Littlepay API failures is that the certificate expired. To resolve:

1. Reach out to <support@littlepay.com>
1. Receive a new certificate
1. Put that certificate into the [configuration data](../../configuration/data/) and/or the [GitHub Actions secrets](https://github.com/cal-itp/benefits/settings/secrets/actions)

### Eligibility Server

If the Benefits application gets a 403 error when trying to make API calls to the [Eligibility Server](https://docs.calitp.org/eligibility-server/), it may be because the outbound IP addresses changed, and the Eligibility Server firewall is still restricting access to the old IP ranges.

Expand All @@ -64,3 +92,5 @@ If the Benefits application gets a 403 error when trying to make API calls to th
1. Click `Edit`
1. Click `Variables`
1. Update the relevant variable with the new list of CIDRs

Note there is nightly downtime as the Eligibility Server restarts and loads new data.