cellgeolocation e2e tests fail very often #455

coderbyheart · 2020-10-13T12:46:31Z

Most of the failed e2e tests are due to the cellgeolocation API.

This should be improved.

coderbyheart · 2020-10-13T15:03:47Z

Funny, that right after I created this issue, the next commit passed: a96e119

See #455

coderbyheart · 2020-10-21T13:32:03Z

After adding enabling logging on the HTTP API, here is an example where the requests fail with a Not found error message:

   ℹ  step I GET /cellgeolocation?cell={cellId}&area={area}&mccmnc={mccmnc} ⏱ +1ms
[RestClient] 3a479ef5-eedd-47f4-8b82-400e4fc9c335 {
  "request": {
    "url": "https://p0dzlj55mk.execute-api.eu-central-1.amazonaws.com/v1/cellgeolocation?cell=98691297&area=167&mccmnc=14434",
    "options": {
      "method": "GET",
      "headers": {
        "Accept": "application/json"
      }
    }
  }
}
[RestClient] 3a479ef5-eedd-47f4-8b82-400e4fc9c335 {
  "error": "The content-type \"\" of the response is not JSON!",
  "headers": {
    "apigw-requestid": "Uw0Lph8BliAEJpg=",
    "connection": "close",
    "content-length": "23",
    "date": "Wed, 21 Oct 2020 13:13:08 GMT"
  },
  "body": "{\"message\":\"Not Found\"}"
}
   ℹ  step error Error: The content-type "" of the response is not JSON! ⏱ +294ms

However the request is not logged in the API Gateway logs, so it seems to not reach the API Gateway.

coderbyheart · 2020-10-21T14:05:57Z

This cache setting seems to be the culprit:

aws/cellGeolocation/httpApi/cell.ts

Line 100 in 6d7d169

expires: 86400,

In case the cell is not found in the first request (which is entirely possible), all subsequent requests will return the same answer, because it is cached on API Gateway. Adding the ability to bust the cache could solve the issues.

See #455

This adds the ability to bust the cache of the cell geolocation GET endpoint. See #455

See #455

Check if API is ready before testing See #455

coderbyheart · 2020-10-22T10:21:04Z

Right now it is not reliably possible to enable logging for ApiGatewayV2: https://github.com/bifravst/aws/runs/1291851657?check_suite_focus=true

7/90 | 10:09:19 AM | CREATE_FAILED        | AWS::ApiGatewayV2::Stage                 | cellGeolocationApi/stage (cellGeolocationApistage3E9E29D9) Insufficient permissions to enable logging (Service: AmazonApiGatewayV2; Status Code: 400; Error Code: BadRequestException; Request ID: 2ad3304d-8cb6-4922-9c54-c24504dac900; Proxy: null)
	new CellGeolocationApi (/home/runner/work/aws/aws/dist/cdk/resources/CellGeolocationApi.js:70:22)
	\_ new BifravstStack (/home/runner/work/aws/aws/dist/cdk/stacks/Bifravst.js:345:28)
	\_ new TestApp (/home/runner/work/aws/aws/dist/cdk/apps/Test.js:10:9)
	\_ /home/runner/work/aws/aws/dist/cdk/cloudformation-test.js:22:21

AWS Support ticket: https://console.aws.amazon.com/support/home#/case/?displayId=7520944471&language=en (in Account 777112256734).

See also https://forums.aws.amazon.com/thread.jspa?messageID=939643&#939643

coderbyheart · 2020-10-22T14:42:25Z

The cause of this is likely due to a known issue with the API Gateway Service, caused by a large number of different log groups associated to various HTTP APIs in an account. Our internal team is aware of this and is working on a fix.

Please see below for steps to manually resolve the issue:

Run this to find the associated CloudWatch Logs resource policy:
aws logs describe-resource-policies please see this document for more information on this command[1]

Look for AWSLogDeliveryWrite20150319 - you'll notice this policy has a large number of associated LogGroup resources. You have three options:

Adjust this policy by removing some of the potentially unused entries.
-Change the resource list to "*"
-You can add another policy, based on this policy, and split the resource records between them.

Apply updates via:
aws logs put-resource-policy please see this document for more information on this command [2]

[1] https://docs.aws.amazon.com/cli/latest/reference/logs/describe-resource-policies.html
[2] https://docs.aws.amazon.com/cli/latest/reference/logs/put-resource-policy.html

This allows to check that the API itself is working before running functional tests. See #455

coderbyheart · 2020-10-23T11:34:46Z

I now have logs and runs that should help to figure out the cause of the issue, so I've created another support case:

AWS Support ticket: https://console.aws.amazon.com/support/home#/case/?displayId=7526446591&language=en (in Account 777112256734).

Hei,

I am running integration tests, where I set up a new HTTP API for a test run (for example in this stack arn:aws:cloudformation:eu-central-1:777112256734:stack/bifravst-macOS-7a23a75b/1cea8380-1520-11eb-8d5c-063c4f89ebe4).

This includes a HTTP API which then will receive requests right after the stack has been completed creation.

Sometimes requests are not reaching the lambda and are not even logged to the log group.
The condition is flaky, but happens continuously roughly every 4 out of 10 runs (you can see the test runs here on PRs here, some pass, some fail: https://github.com/bifravst/aws/pulls).

Example:

First the test issues a GET request, which is handled by the HTTP API:

Fri, 23 Oct 2020 11:14:56 GMT

   ℹ  RestClient 96309a16-2eba-4b13-b9a0-77679b681a2c
       {
         "request": {
           "url": "https://2xc710mgq0.execute-api.eu-central-1.amazonaws.com/v1/cellgeolocation?cell=90966665&area=211&mccmnc=26201&ts=1603451696350",
           "options": {
             "method": "GET",
             "headers": {
               "Accept": "application/json"
             }
           }
         }
       }
   ℹ  RestClient 96309a16-2eba-4b13-b9a0-77679b681a2c
       {
         "response": {
           "statusCode": 200,
           "headers": {
             "access-control-allow-origin": "*",
             "apigw-requestid": "U3IvmhbhFiAEQrg=",
             "cache-control": "public, max-age=86400",
             "connection": "close",
             "content-length": "66",
             "content-type": "application/json",
             "date": "Fri, 23 Oct 2020 11:14:56 GMT",
             "expires": "Sat, 24 Oct 2020 11:14:56 GMT",
             "x-bifravst-version": "9e45f1f5a8f917f918faae7ab926db8e420b57ce"
           },
           "body": {
             "lat": 5.523368480572486,
             "lng": 55.809917243993546,
             "accuracy": 5000
           }
         }
       }

The request is logged in the access logs:
arn:aws:logs:eu-central-1:777112256734:log-group:/bifravst-macOS-7a23a75b/cellGeolocationApiAccessLogs:*

{
    "requestId": "U3IvmhbhFiAEQrg=",
    "awsEndpointRequestId": "c4113350-f644-4149-9b8e-5bfb93c30cf3",
    "requestTime": "23/Oct/2020:11:14:56 +0000",
    "ip": "143.55.64.23",
    "protocol": "HTTP/1.1",
    "routeKey": "GET /cellgeolocation",
    "status": "200",
    "responseLength": "66",
    "integrationLatency": "41",
    "integrationStatus": "200",
    "integrationErrorMessage": "-",
    "integration": {
        "status": "200"
    }
}

Now at the same time (Fri, 23 Oct 2020 11:14:56 GMT) a second request is not properly handled:

 ℹ  RestClient 3ecbcfdf-7188-4f20-82a0-0fd812a21950
       {
         "request": {
           "url": "https://2xc710mgq0.execute-api.eu-central-1.amazonaws.com/v1/cellgeolocation",
           "options": {
             "method": "POST",
             "headers": {
               "Accept": "application/json"
             },
             "body": "{\"cell\":4902666,\"area\":105,\"mccmnc\":18216,\"lat\":53.92319098033689,\"lng\":145.2875897837064,\"accuracy\":11157.32130194882}"
           }
         }
       }
   🚨  RestClient 3ecbcfdf-7188-4f20-82a0-0fd812a21950
       {
         "error": "The content-type \"\" of the response is not JSON!",
         "statusCode": 404,
         "headers": {
           "apigw-requestid": "U3IvpiWuFiAEPfA=",
           "connection": "close",
           "content-length": "23",
           "date": "Fri, 23 Oct 2020 11:14:56 GMT"
         },
         "body": "{\"message\":\"Not Found\"}"
       }

Note that the request ID U3IvpiWuFiAEPfA= cannot be found in the API Gateway access logs.

What could be the issue here?

See #455 [skip-cleanup]

coderbyheart · 2020-10-29T14:57:09Z

My takeaways from the screensharing session with the AWS support are:

After adding the $default route (with no integration) to the HTTP API stage, the request which previously did not work, started to work. It also kept working after removing the $default stage. This points to either

the HTTP API Gateway needing a $default stage to properly route requests
the act of adding the $default stage through the UI triggering another deployment to the HTTP API which might have not happened when it was created through CloudFormation or CDK.

So, the first attempt to solve this is to add a blank $default stage (see da0bf06) and see if this improves the flaky behaviour.

The second option would be to try to trigger another HTTP API stage deployment after CloudFormation has finished setting up the deployment through the AWS CLI to ensure a complete deployment exists on the API.

See #455

coderbyheart · 2020-10-30T11:08:46Z

da0bf06 did not improve the situation, there were more failed runs during the night.

Adding a route after CloudFormation has completed is the other possible fix: ff04fbb, so far after 6 builds it looks promising.

coderbyheart · 2020-11-03T09:27:59Z

Still no failed tests because of CellGeolocation and 4 releases, so this workaround seems to have fixed the problem.

See bifravst/aws#455

coderbyheart added the bug Something isn't working label Oct 13, 2020

coderbyheart self-assigned this Oct 13, 2020

coderbyheart added this to Software in Development Oct 13, 2020

coderbyheart added a commit that referenced this issue Oct 20, 2020

feat(cellgeo): enable logging

1b1868a

See #455

coderbyheart mentioned this issue Oct 20, 2020

Fix flaky cellgeo e2e tests #496

Merged

coderbyheart added a commit that referenced this issue Oct 20, 2020

feat(cellgeo): enable logging

679b050

See #455

coderbyheart added a commit that referenced this issue Oct 20, 2020

feat(cellgeo): enable logging

138f4f3

See #455

coderbyheart added a commit that referenced this issue Oct 21, 2020

feat(cellgeo): enable logging

93b8404

See #455

coderbyheart added a commit that referenced this issue Oct 21, 2020

test(e2e): fix flaky cell geolocation test (#496)

096753c

This adds the ability to bust the cache of the cell geolocation GET endpoint. See #455

coderbyheart added a commit that referenced this issue Oct 21, 2020

fix(cellgeo): remove access logs

b9895f6

See #455

coderbyheart added a commit that referenced this issue Oct 22, 2020

feat(cellgeo): add GET __health endpoint

d13b4bd

Check if API is ready before testing See #455

coderbyheart mentioned this issue Oct 22, 2020

Attempt 2 to fix flaky cellgeo end-to-end test #508

Merged

coderbyheart added a commit that referenced this issue Oct 23, 2020

test(e2e): add health check to cell geo API

a095610

This allows to check that the API itself is working before running functional tests. See #455

coderbyheart added a commit that referenced this issue Oct 29, 2020

refactor(cellgeo): use integration URI format recommended by support

3de8e5a

See #455 [skip-cleanup]

coderbyheart added a commit that referenced this issue Oct 29, 2020

fix(cellgeo): add missing $default route

da0bf06

See #455

coderbyheart added a commit that referenced this issue Oct 30, 2020

test(e2e): create fake route after setup

e4a4631

See #455

coderbyheart mentioned this issue Oct 30, 2020

test(e2e): create fake route after setup #566

Merged

coderbyheart added a commit that referenced this issue Oct 30, 2020

test(e2e): create fake route after setup (#566)

ff04fbb

See #455

coderbyheart closed this as completed Nov 3, 2020

Development automation moved this from Software to Done Nov 3, 2020

coderbyheart added a commit to bifravst/bifravst that referenced this issue Dec 8, 2020

fix(aws): add command to fix HTTP API

10d35ea

See bifravst/aws#455

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cellgeolocation e2e tests fail very often #455

cellgeolocation e2e tests fail very often #455

coderbyheart commented Oct 13, 2020

coderbyheart commented Oct 13, 2020

coderbyheart commented Oct 21, 2020

coderbyheart commented Oct 21, 2020

coderbyheart commented Oct 22, 2020

coderbyheart commented Oct 22, 2020 •

edited

coderbyheart commented Oct 23, 2020 •

edited

coderbyheart commented Oct 29, 2020 •

edited

coderbyheart commented Oct 30, 2020

coderbyheart commented Nov 3, 2020

cellgeolocation e2e tests fail very often #455

cellgeolocation e2e tests fail very often #455

Comments

coderbyheart commented Oct 13, 2020

coderbyheart commented Oct 13, 2020

coderbyheart commented Oct 21, 2020

coderbyheart commented Oct 21, 2020

coderbyheart commented Oct 22, 2020

coderbyheart commented Oct 22, 2020 • edited

coderbyheart commented Oct 23, 2020 • edited

coderbyheart commented Oct 29, 2020 • edited

coderbyheart commented Oct 30, 2020

coderbyheart commented Nov 3, 2020

coderbyheart commented Oct 22, 2020 •

edited

coderbyheart commented Oct 23, 2020 •

edited

coderbyheart commented Oct 29, 2020 •

edited