Skip to content
This repository has been archived by the owner on Feb 3, 2021. It is now read-only.

cellgeolocation e2e tests fail very often #455

Closed
coderbyheart opened this issue Oct 13, 2020 · 9 comments
Closed

cellgeolocation e2e tests fail very often #455

coderbyheart opened this issue Oct 13, 2020 · 9 comments
Assignees
Labels
bug Something isn't working
Projects

Comments

@coderbyheart
Copy link
Member

Most of the failed e2e tests are due to the cellgeolocation API.

This should be improved.

@coderbyheart coderbyheart added the bug Something isn't working label Oct 13, 2020
@coderbyheart coderbyheart self-assigned this Oct 13, 2020
@coderbyheart coderbyheart added this to Software in Development Oct 13, 2020
@coderbyheart
Copy link
Member Author

Funny, that right after I created this issue, the next commit passed: a96e119

coderbyheart added a commit that referenced this issue Oct 20, 2020
coderbyheart added a commit that referenced this issue Oct 20, 2020
coderbyheart added a commit that referenced this issue Oct 20, 2020
@coderbyheart
Copy link
Member Author

After adding enabling logging on the HTTP API, here is an example where the requests fail with a Not found error message:

   ℹ  step I GET /cellgeolocation?cell={cellId}&area={area}&mccmnc={mccmnc} ⏱ +1ms
[RestClient] 3a479ef5-eedd-47f4-8b82-400e4fc9c335 {
  "request": {
    "url": "https://p0dzlj55mk.execute-api.eu-central-1.amazonaws.com/v1/cellgeolocation?cell=98691297&area=167&mccmnc=14434",
    "options": {
      "method": "GET",
      "headers": {
        "Accept": "application/json"
      }
    }
  }
}
[RestClient] 3a479ef5-eedd-47f4-8b82-400e4fc9c335 {
  "error": "The content-type \"\" of the response is not JSON!",
  "headers": {
    "apigw-requestid": "Uw0Lph8BliAEJpg=",
    "connection": "close",
    "content-length": "23",
    "date": "Wed, 21 Oct 2020 13:13:08 GMT"
  },
  "body": "{\"message\":\"Not Found\"}"
}
   ℹ  step error Error: The content-type "" of the response is not JSON! ⏱ +294ms

However the request is not logged in the API Gateway logs, so it seems to not reach the API Gateway.

@coderbyheart
Copy link
Member Author

This cache setting seems to be the culprit:

expires: 86400,

In case the cell is not found in the first request (which is entirely possible), all subsequent requests will return the same answer, because it is cached on API Gateway. Adding the ability to bust the cache could solve the issues.

coderbyheart added a commit that referenced this issue Oct 21, 2020
coderbyheart added a commit that referenced this issue Oct 21, 2020
This adds the ability to bust the cache of the cell geolocation GET endpoint.

See #455
coderbyheart added a commit that referenced this issue Oct 21, 2020
coderbyheart added a commit that referenced this issue Oct 22, 2020
Check if API is ready before testing

See #455
@coderbyheart
Copy link
Member Author

Right now it is not reliably possible to enable logging for ApiGatewayV2: https://github.com/bifravst/aws/runs/1291851657?check_suite_focus=true

7/90 | 10:09:19 AM | CREATE_FAILED        | AWS::ApiGatewayV2::Stage                 | cellGeolocationApi/stage (cellGeolocationApistage3E9E29D9) Insufficient permissions to enable logging (Service: AmazonApiGatewayV2; Status Code: 400; Error Code: BadRequestException; Request ID: 2ad3304d-8cb6-4922-9c54-c24504dac900; Proxy: null)
	new CellGeolocationApi (/home/runner/work/aws/aws/dist/cdk/resources/CellGeolocationApi.js:70:22)
	\_ new BifravstStack (/home/runner/work/aws/aws/dist/cdk/stacks/Bifravst.js:345:28)
	\_ new TestApp (/home/runner/work/aws/aws/dist/cdk/apps/Test.js:10:9)
	\_ /home/runner/work/aws/aws/dist/cdk/cloudformation-test.js:22:21

AWS Support ticket: https://console.aws.amazon.com/support/home#/case/?displayId=7520944471&language=en (in Account 777112256734).

See also https://forums.aws.amazon.com/thread.jspa?messageID=939643&#939643

@coderbyheart
Copy link
Member Author

coderbyheart commented Oct 22, 2020

The cause of this is likely due to a known issue with the API Gateway Service, caused by a large number of different log groups associated to various HTTP APIs in an account. Our internal team is aware of this and is working on a fix.

Please see below for steps to manually resolve the issue:

  1. Run this to find the associated CloudWatch Logs resource policy:
    aws logs describe-resource-policies please see this document for more information on this command[1]

  2. Look for AWSLogDeliveryWrite20150319 - you'll notice this policy has a large number of associated LogGroup resources. You have three options:

  • Adjust this policy by removing some of the potentially unused entries.
    -Change the resource list to "*"
    -You can add another policy, based on this policy, and split the resource records between them.
  1. Apply updates via:
    aws logs put-resource-policy please see this document for more information on this command [2]

[1] https://docs.aws.amazon.com/cli/latest/reference/logs/describe-resource-policies.html
[2] https://docs.aws.amazon.com/cli/latest/reference/logs/put-resource-policy.html

coderbyheart added a commit that referenced this issue Oct 23, 2020
This allows to check that the API itself is working before running functional tests.

See #455
@coderbyheart
Copy link
Member Author

coderbyheart commented Oct 23, 2020

I now have logs and runs that should help to figure out the cause of the issue, so I've created another support case:

AWS Support ticket: https://console.aws.amazon.com/support/home#/case/?displayId=7526446591&language=en (in Account 777112256734).

Hei,

I am running integration tests, where I set up a new HTTP API for a test run (for example in this stack arn:aws:cloudformation:eu-central-1:777112256734:stack/bifravst-macOS-7a23a75b/1cea8380-1520-11eb-8d5c-063c4f89ebe4).

This includes a HTTP API which then will receive requests right after the stack has been completed creation.

Sometimes requests are not reaching the lambda and are not even logged to the log group.
The condition is flaky, but happens continuously roughly every 4 out of 10 runs (you can see the test runs here on PRs here, some pass, some fail: https://github.com/bifravst/aws/pulls).

Example:

First the test issues a GET request, which is handled by the HTTP API:

Fri, 23 Oct 2020 11:14:56 GMT

   ℹ  RestClient 96309a16-2eba-4b13-b9a0-77679b681a2c
       {
         "request": {
           "url": "https://2xc710mgq0.execute-api.eu-central-1.amazonaws.com/v1/cellgeolocation?cell=90966665&area=211&mccmnc=26201&ts=1603451696350",
           "options": {
             "method": "GET",
             "headers": {
               "Accept": "application/json"
             }
           }
         }
       }
   ℹ  RestClient 96309a16-2eba-4b13-b9a0-77679b681a2c
       {
         "response": {
           "statusCode": 200,
           "headers": {
             "access-control-allow-origin": "*",
             "apigw-requestid": "U3IvmhbhFiAEQrg=",
             "cache-control": "public, max-age=86400",
             "connection": "close",
             "content-length": "66",
             "content-type": "application/json",
             "date": "Fri, 23 Oct 2020 11:14:56 GMT",
             "expires": "Sat, 24 Oct 2020 11:14:56 GMT",
             "x-bifravst-version": "9e45f1f5a8f917f918faae7ab926db8e420b57ce"
           },
           "body": {
             "lat": 5.523368480572486,
             "lng": 55.809917243993546,
             "accuracy": 5000
           }
         }
       }

The request is logged in the access logs:
arn:aws:logs:eu-central-1:777112256734:log-group:/bifravst-macOS-7a23a75b/cellGeolocationApiAccessLogs:*

{
    "requestId": "U3IvmhbhFiAEQrg=",
    "awsEndpointRequestId": "c4113350-f644-4149-9b8e-5bfb93c30cf3",
    "requestTime": "23/Oct/2020:11:14:56 +0000",
    "ip": "143.55.64.23",
    "protocol": "HTTP/1.1",
    "routeKey": "GET /cellgeolocation",
    "status": "200",
    "responseLength": "66",
    "integrationLatency": "41",
    "integrationStatus": "200",
    "integrationErrorMessage": "-",
    "integration": {
        "status": "200"
    }
}

Now at the same time (Fri, 23 Oct 2020 11:14:56 GMT) a second request is not properly handled:

 ℹ  RestClient 3ecbcfdf-7188-4f20-82a0-0fd812a21950
       {
         "request": {
           "url": "https://2xc710mgq0.execute-api.eu-central-1.amazonaws.com/v1/cellgeolocation",
           "options": {
             "method": "POST",
             "headers": {
               "Accept": "application/json"
             },
             "body": "{\"cell\":4902666,\"area\":105,\"mccmnc\":18216,\"lat\":53.92319098033689,\"lng\":145.2875897837064,\"accuracy\":11157.32130194882}"
           }
         }
       }
   🚨  RestClient 3ecbcfdf-7188-4f20-82a0-0fd812a21950
       {
         "error": "The content-type \"\" of the response is not JSON!",
         "statusCode": 404,
         "headers": {
           "apigw-requestid": "U3IvpiWuFiAEPfA=",
           "connection": "close",
           "content-length": "23",
           "date": "Fri, 23 Oct 2020 11:14:56 GMT"
         },
         "body": "{\"message\":\"Not Found\"}"
       }

Note that the request ID U3IvpiWuFiAEPfA= cannot be found in the API Gateway access logs.

What could be the issue here?

@coderbyheart
Copy link
Member Author

coderbyheart commented Oct 29, 2020

My takeaways from the screensharing session with the AWS support are:

After adding the $default route (with no integration) to the HTTP API stage, the request which previously did not work, started to work. It also kept working after removing the $default stage. This points to either

  • the HTTP API Gateway needing a $default stage to properly route requests
  • the act of adding the $default stage through the UI triggering another deployment to the HTTP API which might have not happened when it was created through CloudFormation or CDK.

So, the first attempt to solve this is to add a blank $default stage (see da0bf06) and see if this improves the flaky behaviour.

The second option would be to try to trigger another HTTP API stage deployment after CloudFormation has finished setting up the deployment through the AWS CLI to ensure a complete deployment exists on the API.

@coderbyheart
Copy link
Member Author

da0bf06 did not improve the situation, there were more failed runs during the night.

Adding a route after CloudFormation has completed is the other possible fix: ff04fbb, so far after 6 builds it looks promising.

@coderbyheart
Copy link
Member Author

Still no failed tests because of CellGeolocation and 4 releases, so this workaround seems to have fixed the problem.

Development automation moved this from Software to Done Nov 3, 2020
coderbyheart added a commit to bifravst/bifravst that referenced this issue Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
No open projects
Development
  
Done
Development

No branches or pull requests

1 participant