Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e test to validate hitless reload of Envoy Gateway #1503

Open
arkodg opened this issue Jun 8, 2023 · 24 comments
Open

Add e2e test to validate hitless reload of Envoy Gateway #1503

arkodg opened this issue Jun 8, 2023 · 24 comments
Assignees
Milestone

Comments

@arkodg
Copy link
Contributor

arkodg commented Jun 8, 2023

Description:
Add a test case in our e2e framework https://github.com/envoyproxy/gateway/tree/main/test/e2e where

  • a client continuously sends traffic to a backend
  • Envoy Gateway is restarted multiple times
  • client successfully sends traffic using existing connection
@arkodg arkodg added good first issue Good for newcomers help wanted Extra attention is needed area/testing labels Jun 8, 2023
@arkodg arkodg added this to the Backlog milestone Jun 8, 2023
@dboslee
Copy link
Contributor

dboslee commented Jun 8, 2023

Some background on this, I hit an issue where after restarting the controller or after a resync period envoy begins draining the existing listeners. I believe this is caused by a re-ordering of the routes sent over xDS.

In my case I had configured 2 TLSRoute resources and without any changes envoy began draining and eventually sent TCP resets to the long lived connections. I think a key part of this test is having several routes configured in order to reproduce the undesired behavior.

@arkodg
Copy link
Contributor Author

arkodg commented Jun 8, 2023

@dboslee long lived connections might be a separate sub issue related to keep alive timeouts (currently disabled by default)

@dboslee
Copy link
Contributor

dboslee commented Jun 9, 2023

long lived connections might be a separate sub issue related to keep alive timeouts (currently disabled by default)

To clarify in my case the connections were hitting the default 600 second drain timeout (which can be configured in envoy via --drain-time-s). But this only happens when a given filter chain gets drained. So 10 minutes after restarting the controller all the connections get reset.

@arkodg arkodg modified the milestones: Backlog, 0.5.0-rc1 Jun 30, 2023
@Ronnie-personal
Copy link
Contributor

Ronnie-personal commented Jul 6, 2023

@dboslee
I would like to understand more about the new e2e test requirement.

What would be the short name and description for this new test? How about these:

Short name: GatewayRestartDraining
Description: Verify that the client can successfully send traffic using an existing connection even after the Envoy gateway/controller restarts.

Do we expect the client to fail to send requests after 10 minutes following the restart, as the connection is reset after 10 minutes?

Thanks!

@Ronnie-personal
Copy link
Contributor

@arkodg
Do we have any sample test code that I can mimic? I see two existing tests, ratelimit and accesslog. Which one would be a better baseline code to copy for this specific requirement?
Thanks!

@arkodg
Copy link
Contributor Author

arkodg commented Jul 6, 2023

Hi @Ronnie-personal thanks for the looking into this issue, assigning this you for now

@Ronnie-personal
Copy link
Contributor

@arkodg
Would you help me on one question?
When we say reload Envoy gateway, we mean delete the gateway resource and recreate it?
For example, delete gateway named 'same-namespace' in namespace 'gateway-conformance-infra' and recreate it?

Thanks!

@Ronnie-personal
Copy link
Contributor

@arkodg
I have some ideas on /config_dump, would you mind checking if I'm on the right track?

Here are the steps in the reload.go test code,

  • Obtain Kubernetes client from cSuite
  • Use the client to get Envoy Proxy service by name 'proxy-config' and namespace 'envoy-gateway-system'
    (e.g --selector=gateway.envoyproxy.io/owning-gateway-namespace=envoy-gateway-system,gateway.envoyproxy.io/owning-gateway-name=proxy-config)
  • Get admin url (host/ip and port) from the service
  • Send http get request to the url/config_dump, save the response.
  • Send http get request again and compare the config dump with previous response.

Thanks!

@arkodg
Copy link
Contributor Author

arkodg commented Jul 7, 2023

hey @Ronnie-personal that looks right ! updating some steps

Obtain Kubernetes client from cSuite

  1. Use the client to get Envoy Proxy service by name 'proxy-config' and namespace 'envoy-gateway-system'
    (e.g --selector=gateway.envoyproxy.io/owning-gateway-namespace=envoy-gateway-system,gateway.envoyproxy.io/owning-gateway-name=proxy-config)
  2. port forward to envoy proxy pod, since the admin host is 127.0.0.1
  3. Send http get request to the url/config_dump, save the response.
  4. restart Envoy Gateway using kubernetes client
  5. Send http get request again and compare the config dump with previous response.

@Ronnie-personal
Copy link
Contributor

@arkodg Would you help me on one question? When we say reload Envoy gateway, we mean delete the gateway resource and recreate it? For example, delete gateway named 'same-namespace' in namespace 'gateway-conformance-infra' and recreate it?

Thanks!

@arkodg Thanks for the prompt reply!
Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.

Thanks!

@dboslee
Copy link
Contributor

dboslee commented Jul 7, 2023

Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.

This means restarting the envoy gateway controller. Which can be done by deleting the controller pods or updating the envoy gateway controller deployment in such a way that it triggers a rollout of new pods for you.

@dboslee
Copy link
Contributor

dboslee commented Jul 7, 2023

Also update on the issue which sparked this conversation, while I initially thought the issue was caused by route order changing this was just a trigger for the actual root cause. Which is that the TCP listener name is set to name of the first route processed by the controller so when routes get processed in a different order or the route the TCP listener is named after is deleted the TCP listener name changes and the listener is drained.

This happens here

xdsListener = buildXdsTCPListener(tcpListener.Name, tcpListener.Address, tcpListener.Port, accesslog)
where ir.TCPListener.Name is actually the name of a specific TCPRoute or TLSRoute. 

@Ronnie-personal
Copy link
Contributor

Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.

This means restarting the envoy gateway controller. Which can be done by deleting the controller pods or updating the envoy gateway controller deployment in such a way that it triggers a rollout of new pods for you.

Thanks for the clarification.
Correct me if I'm wrong, so I need to locate the envoy gateway service, delete it and recreate?
If you happen to know where the envoy gateway gets created in the code stack, please let me know.
I have looked at https://github.com/envoyproxy/gateway/blob/main/test/config/gatewayclass.yaml and https://github.com/envoyproxy/gateway/blob/main/test/e2e/base/manifests.yaml. But none of them defines envoy gateway.

Thanks!

@Ronnie-personal
Copy link
Contributor

Also update on the issue which sparked this conversation, while I initially thought the issue was caused by route order changing this was just a trigger for the actual root cause. Which is that the TCP listener name is set to name of the first route processed by the controller so when routes get processed in a different order or the route the TCP listener is named after is deleted the TCP listener name changes and the listener is drained.

This happens here

xdsListener = buildXdsTCPListener(tcpListener.Name, tcpListener.Address, tcpListener.Port, accesslog)

where ir.TCPListener.Name is actually the name of a specific TCPRoute or TLSRoute.

Do we consider it as a bug? What would be the options to fix this root issue?

@arkodg
Copy link
Contributor Author

arkodg commented Jul 7, 2023

thanks for highlighting this @dboslee !
as @Ronnie-personal pointed out, we do incorporate the route name in the tcp listener (refer to #696) since the tcp listener directly maps to a cluster / destination https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/tcp_proxy/v3/tcp_proxy.proto#envoy-v3-api-msg-extensions-filters-network-tcp-proxy-v3-tcpproxy .
So as you've shared, if the routes are processed in a different order, the order for the tcp listeners will also change causing the hot restart https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/hot_restart .

  • Sorting pre or post in the gateway-api layer might be a good enough solution here

  • Another thing to ensure mainly for performance reasons is to prevent retranslation all together in gateway-api by ensuring the the watchable message for resources can skip updating to subscriber (gateway-api here) published by the producer (provider) if the contents of the resource list hasn't changed, even if the order of the resource list has.

will raise GH issues for each of these so this GH issue around E2E is not forgotten :)

@Ronnie-personal
Copy link
Contributor

Ronnie-personal commented Jul 8, 2023

I did a quick manual test locally using following steps,

  • I created envoy gateway and quick-start resources.
  • I saved the envoy proxy config_dump from 127.0.0.1:19000/config_dump
  • Delete the envoy gateway pod, kubectl get pods --selector=control-plane=envoy-gateway -n envoy-gateway-system
  • Another envoy gateway pod is up and running
  • Obtain config_dump and compare it with the previous one
  • The fields in the 'metadata' section are reordered, even though the actual data is exactly the same. (please see screenshot)
    image

My questions are:
Is this an expected behavior?
Should we ignore the 'metadata' section when comparing multiple config_dumps?
Which sections of the config_dump really matter for the comparison?

@arkodg
Copy link
Contributor Author

arkodg commented Jul 10, 2023

  • this is the right approach @Ronnie-personal, you can skip comparison of some fields using cmp.Diff which we use in this project, here's an example.
    from the above image, it looks like content is same, some config fields order has changed (might be due to usage of a map which lacks order), unsure if cmp.Diff handles this w/o any extra work or might require additional options like cmpopts.SortMaps to be passed to it
  • its fine to compare the entire config_dump

@Ronnie-personal
Copy link
Contributor

Ronnie-personal commented Jul 15, 2023

@arkodg
I have completed the initial coding work, I'm able to test the major logic as a standalone go code locally.
I'm not sure how to test this new e2e test under the entire envoy gateway code stack.

Here is the 'reload' test code main...Ronnie-personal:gateway:e2ereload1503
Please feel free to give a quick review.

Thanks,
Ronnie

@arkodg
Copy link
Contributor Author

arkodg commented Jul 18, 2023

hey @Ronnie-personal thanks for continuing to work on this, can you raise this as a PR so others in the community can also review, tia !

@arkodg arkodg added this to the 0.6.0-rc1 milestone Jul 31, 2023
@arkodg arkodg removed the help wanted Extra attention is needed label Jul 31, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Aug 31, 2023
@arkodg arkodg modified the milestones: 0.6.0-rc1, Backlog Oct 20, 2023
@github-actions github-actions bot removed the stale label Oct 21, 2023
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Nov 20, 2023
@Xunzhuo Xunzhuo modified the milestones: Backlog, v1.0.0-rc1 Dec 7, 2023
@github-actions github-actions bot removed the stale label Dec 7, 2023
Copy link

github-actions bot commented Jan 6, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Jan 6, 2024
@arkodg arkodg modified the milestones: v1.0.0-rc1, Backlog, v1.0.0 Feb 8, 2024
@github-actions github-actions bot removed the stale label Feb 8, 2024
Copy link

github-actions bot commented Mar 9, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Mar 9, 2024
@arkodg arkodg modified the milestones: v1.0.0, v1.1.0-rc1 Mar 28, 2024
@github-actions github-actions bot removed the stale label Mar 28, 2024
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Apr 27, 2024
@arkodg arkodg modified the milestones: v1.1.0-rc1, v1.1.0 May 23, 2024
@github-actions github-actions bot removed the stale label May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants