Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Job failing on GCP but works on AWS #2057
When we run our integration tests for our application it frequently (i.e., ~90% of the time) will fail due to what appears to be a networking issue.
Our application uses RSpec/Capybara/Headless Chrome in order to run the integration tests. As part of the tests, we call out to a script to seed the app with data via its API. The architecture from the perspective of the tests is as follows:
The failure mode appears to be that at some point, the API gateway loses the ability to talk to the microservices behind it. There is not an apparent pattern for when this happens - 10% of the time, the tests will complete without error. Sometimes, it will fail on the very first request to write seed data to the API. Other times, it will fail halfway through seeding data.
For example, the API gateway will see a request and map it correctly:
And the corresponding microservice will not see that request being made before it is terminated due to the test suite failing (note tthat he GET to /health is the only request received by the microservice):
The seeding script will then die with a read timeout:
We run all of the processes on the same container so all networking should be via the loopback interface.
We perform a GET request directly to /health endpoint on each of the microservices while they are booting, in order to determine that they have come up successfully and that the integration test can begin. This seems to work reliably.
What we tried