Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: with ECS, /etc/hosts file in application container pollutes the envoy container as well #471

Closed
cormac-ainc opened this issue Jul 7, 2023 · 1 comment
Labels
Bug Something isn't working

Comments

@cormac-ainc
Copy link

cormac-ainc commented Jul 7, 2023

Summary
Following the Troubleshooting guide, I appended some lines to /etc/hosts in my application container so that the envoy proxy could pick up those requests. It turns out that, in awsvpc networking mode, /etc/hosts is actually a volume mount, and so modifying the /etc/hosts file on the application container ALSO modifies it for the envoy proxy container.

This prevents envoy from discovering the correct IP addresses for the backend hosts. Instead, it uses whatever random non-loopback address you put in the hosts file, and returns 404 for every request you ever do.

Steps to Reproduce

  1. Make an ECS cluster + service. Enable the enable-execute-command flag.
  2. (Optional: Register a virtual service in an App Mesh cluster called backend-service.mesh.local, and spin up a task that registers it there.)
  3. Define an application container + envoy sidecar in the task definition etc etc etc as per usual. Tell the envoy sidecar to register in a virtual node that has backend-service.mesh.local as one of its backends. The application container should be:
FROM nginx:alpine
RUN echo "10.10.10.10\tbackend-service.mesh.local" >> /etc/hosts

Then

aws ecs execute-command --cluster <cluster> --task <task> --container app --interactive --command '/bin/sh'
   $ cat /etc/hosts
   ...
   10.10.10.10    backend-service.mesh.local
   # that's all as expected.... but...
   
   $ apk add curl
   $ curl http://backend-service.mesh.local -vvv
   ...
   < HTTP/1.1 404 Not Found
   < date: Fri, 07 Jul 2023 05:42:50 GMT
   < server: envoy
   < content-length: 0

   # envoy is looking very confused
   $ curl http://localhost:9901/clusters
   ...
   cds_egress_mesh_service_http_80::10.10.10.10:80::hostname::backend-service.mesh.local
   cds_egress_mesh_service_http_80::10.10.10.10:80::health_flags::/failed_active_hc/active_hc_timeout
   
   # it really is a shared volume mount of some kind...
   $ echo "hello" > /etc/hosts
aws ecs execute-command --cluster <cluster> --task <task> --container envoy --interactive --command '/bin/sh'
   $ cat /etc/hosts
   ...
   10.10.10.10    backend-service.mesh.local
   hello

Are you currently working around this issue?
I am seriously thinking of using a competitor? Idk, this cluster isn't in production yet.

  • The only other options according to the docs are spinning a DNS resolver, or a Route 53 private hosted zone
  • I do not feel like spinning up anything more complicated than a hosts file to serve a dummy IP address. It is a single 32-bit number. Please.
  • I am not convinced that making a private hosted zone won't also affect the envoy container, leading to exactly the same problem, so I almost can't be bothered trying.
@cormac-ainc cormac-ainc added the Bug Something isn't working label Jul 7, 2023
@cormac-ainc
Copy link
Author

Never mind. I missed one single line of terraform, setting the user id of the envoy container to 1337 to match the IgnoredUID in the proxy configuration on the task definition, so envoy itself was also subject to the proxying. Man, that was tough to diagnose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant