Skip to content

Technical Troubleshooting

Ian Liu edited this page May 27, 2024 · 10 revisions

This page is for technical tips and troubleshooting information regarding FAM. (This can later be expanded into developer/support guide / playbooks.)

Terraform deployment error - Failed to request discovery document: 500 Internal Server Error

This seems to be a temporary error involving connectivity/availability of Terraform Cloud (see https://github.com/hashicorp/terraform/issues/22774). Restart the job that failed, and it will very likely not encounter this error again.

Terraform deployment error - Decode Terraform AWS encoded error message.

Under the environment (Tools/DEV/TEST/PROD) that the terraform runs, if it happens to have an error message but is encoded (similar like this: Encoded authorization failure message: juGPyqKA3Cwc4xRFXlpzBxSpr9ia46l5PaNAgHwiTRezn8bq0eecGxtKq5zEkrVXZc..., then you can decode it with these steps (easy way):

  • Log on to the AWS console on the environment for which the error happens.
  • Open the "CloudShell": image
  • Use following commands (replace with your error message)
    msg='[Your Error Message]'
    aws sts decode-authorization-message --encoded-message "$msg" --output text | sed 's/,/\n\r/g' | sed 's/{//g' | sed 's/}//g' | sed 's/"//g'
    
  • It will be decoded similar like this: image

Terraform server deployment success - But failed for Smoke Test.

The fail message may look something like this: image Not sure for the reason, but it seems the AWS API Gateway integration lambda might still be waking up.

  • Wait for a few minutes. Or hitting the smoke test endpoint several times.
  • The re-run failed pipeline, it should pass.

Terraform deployment issues related to secrets.

Terraform Destroy

Most often at AWS "Tools" space, it might be decided to do terraform "destroy" and rebuild entire infrastructure.

  • Encounter "not authorized to perform: kms:ScheduleKeyDeletion"

    It probably will encounter an error from KMS key deletion. FAM current has a key designed for BCSC encryption set up in KMS store. For AWS policy, the deletion of KMS key is not permitted. If this happens, simply it can be ignored and the rest of deletion should be done.

│ Error: deleting KMS Key (26dff151-a814-4ad0-a729-f4b8e4c41eda): operation error KMS: ScheduleKeyDeletion, https response error StatusCode: 400, RequestID: 8959773a-8e80-468c-b1e1-e429072e3393, api error AccessDeniedException: User: arn:aws:sts::377481750915:assumed-role/FAM_GHA_ROLE/server-tools-deployment is not authorized to perform: kms:ScheduleKeyDeletion on resource: arn:aws:kms:ca-central-1:377481750915:key/26dff151-a814-4ad0-a729-f4b8e4c41eda with an explicit deny in a service control policy

Terraform Deployment (Plan/Apply)

  • Error: a secret with this name is already scheduled for deletion.

    The secret at the AWS Secret Manager can be deleted (not manually) but it is default to minimum 7 days to schedule to be deleted. If previously you apply the "destroy" for the infrastructure, the secret will be scheduled to be deleted. And when deployment again, it will have error of "You can't create this secret because a secret with this name is already scheduled for deletion." image Depending on scenarios, you may like to exam if the secret should be deleted and either:

    • Cancel deletion (from AWS console)
    • Leave it for deletion (but need to wait for 7 days minimum)
    • Contact AWS platform support to see if possible to be deleted right away.
  • Error: the secret famdb_auth_lambda_creds_secret already exists..

    Some secret, like "famdb_auth_lambda_creds_secret" is not appended with random-pet string for its name, so it cannot be recreated if it exists. Encounter this "secret ... already exists" during deployment probably isn't often, because if it was not previously being destroyed it should still be in Terraform state file. This error situation is likely because the secret was destroyed (see above error) and being cancelled for deletion (the secret is back), however, due to the 'cancel' was done from AWS console, so the Terraform state file for that secret resource is no longer there and during deployment Terraform tries to recreate the same secret. To fix this issue:

    • Setup local Terraform workspace and sync with Terraform state, then backup the state file (see instruction from Managing-Terraform-State-from-Local)

    • Do a "Terraform/Terragrunt import" For example, to bring back the existing secret (famdb_auth_lambda_creds_secret) to the Teffaform state file, use following sample commands:

      • Find the secret ID (from AWS console), and make sure you are in the right workspace (e.g., "Tools")

      • Execute command: terragrunt import aws_secretsmanager_secret.famdb_auth_lambda_creds_secret arn:aws:secretsmanager:ca-central-1:377481750915:secret:famdb_auth_lambda_creds_secret-nddahx image

      • Find the secret version (and its secret ARN)

      • Execute command: terragrunt import aws_secretsmanager_secret_version.famdb_auth_lambda_creds_secret_version "arn:aws:secretsmanager:ca-central-1:377481750915:secret:famdb_auth_lambda_creds_secret-nddahx|9E2C2ABE-2C28-4963-A684-B6BFAC0D8B0E" image (Note: the secret version is appended at the end of secret ARN with double quotes)

      The above "import" command will sync/write the resource entry into state file. When the next deployment run it will not try to recreate the secret again.

AWS deployment success but frontend has CORS issue from API Gateway.

This happens in prod and in tools space. After Terraform deployment, everything in pipeline deployed successfully. However, when accessing from frontend the result from backend server "fam-admin-management-api-lambda-tools-gateway" is a CORS error. image

To fix this error (although nothing wrong with AWS configuration, even with CORS enable in API Gateway):

  • Simply "Deploy" the API Gateway version. image image

Flyway data migration failing or incorrect

If a flyway data migration runs in dev and is incorrect, we can change the existing migration but need to delete or rollback the database, as flyway will throw an error and fail if an existing already-run migration is altered.

We should not change migrations that go to test (or especially prod). We ideally test migrations running against an existing database (not freshly recreated like dev) before running in prod, so this is a reason not to completely delete the test database once we are live in prod. But this means if a migration is wrong, either doing a rollback in test, or creating a new migration to fix the problem.

The reason why we don't want to create new migrations for dev necessarily is avoiding having too many migrations.

Note that currently we have no way to roll back an individual flyway migration on AWS (but could create a github action that does this via flyway, I believe). So rollback for now would mean a full database restore from a backup snapshot. The easier option is to simply delete the development database (e.g. via the destroy dev backend environment).

Manual fix flyway version conflict

In the case that we apply the V21 and then V23 in a hotfix deployment, and then want to apply V22 from a regular deployment, flyway will complain the scripts is out of order. There are two ways to fix this kind of problem.

  • If we know exactly what data has changed, and which version run we want to remove, then we could check the flyway migration histrory table and remove the row for that version
    • create an EC2 instance and connect to the postgres database using the sysadmin user
    • SET schema 'flyway'; and then check the flyway migration histroy table, and remove the row we want to get rid of that version
    • then we need to remove all the records in our tables that has been added by the flyway script version we just removed, remove records in reverse order, cause the foreign keys are all connected
  • If it's really compliated situation, and we don't know the relationship about the data, we could create a new version for example V24 to do the same thing as V22, and remove V22. But then we need to be careful that if other enviornments can get the same flyway order.

Local-Dev: Fix AWS Cognito Not Accessible Due to Wrong IDs

Occasionally, FAM DEV AWS environment might need to be Destroyed and Rebuilt from development team. When that happens, AWS FAM DEV environment Cognito "COGNITO_USER_POOL_ID" and "COGNITO_CLIENT_ID" will be outdated. For fast remedy locally on your development branch, you need to search and update the two ids in several files used locally based on most current FAM DEV Cognito environment (please access the console and find it out).

Files to be updated:

local-dev.env
pytest.ini
env.json

Useful Standard OIDC IDP Information

It might be useful to know this "well known" path on how to get configuration information of OIDC IDP in case on troubleshooting. For example: https://dev.loginproxy.gov.bc.ca/auth/realms/standard/ .well-known/openid-configuration image

Clone this wiki locally