Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Delta xDS #251

Closed
alecholmez opened this issue Dec 17, 2019 · 27 comments
Closed

Implement Delta xDS #251

alecholmez opened this issue Dec 17, 2019 · 27 comments

Comments

@alecholmez
Copy link
Contributor

I figured I could take this but I'd like to implement the Delta xDS functions in the server. I understand this will require different logic and a new subscribing mechanism, is there anything else I should be aware of @kyessenov ? Any input would be helpful here

@kyessenov
Copy link
Contributor

SG. Please make sure not to break existing non-incremental xDS, and keep in mind there's some churn in envoy with the refactors of incremental xDS subscription mechanism.

@alecholmez
Copy link
Contributor Author

@kyessenov sounds good, do you have any resources I could see involving the refactors? I haven't seen anything so far.

But will do I intended on keeping them decently compartmentalized so they didn't interfere with each other

@mattklein123
Copy link
Member

cc @lita can you loop in the various Lyft folks that might be interested in following along here?

@lita
Copy link

lita commented Dec 19, 2019

@tomwans @jyotimahapatra

@alecholmez Are you providing a spec on how you will implement this? I'm particularly interested in how you will be creating the subscribing mechanism and handling failures when an update is missed.

Lyft is very interested in using this feature in the future so I would like to involved in the process of building this.

@justincely
Copy link

justincely commented Dec 19, 2019

@lita we've got some initial specs done up for this which @alecholmez and I are planning to go over early next week. The holidays are coming up, so things might be slowed down a bit, but we'll get something out here for discussion soon so you and others can be involved and help us nail it down.

@lita
Copy link

lita commented Dec 19, 2019

Sounds great! Looking forward to reading it after the holidays!

@alecholmez
Copy link
Contributor Author

alecholmez commented Jan 2, 2020

Hi all, just wanted to provide a status update, here is a current work in progress for our proposal on the incremental xDS implementation:

https://gist.github.com/alecholmez/fa34dbe94b278defb9c2e46b40acb044

Would love feedback as well as participation from everyone! I have bullet points for things I haven't full thought out yet so wherever you see a dot just make note I haven't designed an implementation for that yet. Would love some help!

@alecholmez
Copy link
Contributor Author

@lita I've added quite a bit more to the proposal, would love to have you review again. A lot of the code snippets aren't accurate yet for the subscription section as they're just placeholders right now. But the logic would look similar.

As for the failure scenarios, from my understanding it seems incremental xDS is rather failure tolerant. I could be misunderstanding a few things but would love some help on that section if I do have things that are incorrect

@mattklein123
Copy link
Member

@alecholmez is it possible for you to move the design doc into a google doc? It's much easier for everyone to comment on, etc. Thank you.

@alecholmez
Copy link
Contributor Author

@mattklein123 sure thing, here's the l ink to the google doc I've moved everything over.

https://docs.google.com/document/d/1BCooMLuiMgWxyyQPci0j7e1KnXZqh2RtgtFYCJ5HQZU/edit?usp=sharing

Let me know if you guys don't have the ability to edit the doc.

@alecholmez
Copy link
Contributor Author

I'm going to go ahead and start on some implementation I think that'll help me flesh this out even more. Can definitely adjust the logic as the proposal gets more eyes on it. Thanks for everyones thoughts and suggestions!

@alecholmez
Copy link
Contributor Author

How does one actually tell Envoy to use incremental xDS in it's config? Is there a new option to define when you specify the xds_cluster and dynamic resource blocks in the bootstrap config? I don't see anything in the docs

@jyotimahapatra
Copy link
Contributor

I guess specifying delta_grpc will enable.

@banks
Copy link

banks commented Apr 9, 2020

Hi @alecholmez I'm really interested in your progress here. I've read through the Doc but it looks like there was still some uncertainty about what Envoy's behaviour was there.

It now seems like Envoy's implementation is complete but I can't find any concrete docs on details of that - e.g. does Envoy already support on-demand loading of Endpoints when a request for a cluster comes in?

Is this still something you're working on? Lazily populating endpoints would be a huge win for Consul as well as the efficiency gains of delta streaming.

@alecholmez
Copy link
Contributor Author

@banks sorry for being in the dark about this! I have a fork of go-control-plane where all my progress is.

Here is the link: https://github.com/alecholmez/go-control-plane/tree/incremental

Would love to have your help on this if you're willing! I can certainly add you as a contributor as well to my fork.

Envoy does support that feature now when we enable the delta xDS feature, so once we can finalize this implementation other projects can inherit that functionality

@alecholmez
Copy link
Contributor Author

@banks I forgot to mention earlier I'm in the envoy slack, feel free to DM me or give me a shoutout in any of the control plane or xDS channels. Also would be available to chat over hangouts or something anytime!

@banks
Copy link

banks commented Apr 14, 2020

@alecholmez thanks for the info! I'll check out your fork.

I'm currently researching ahead of our roadmap for future stuff so not sure if I or another team member will have a lot of time to dedicate to helping out immediately but if I can get a PoC working I think this would be a big win for our product to support this and will let you know if I hit any issues with your fork in the process!

Thanks for your effort here.

@alecholmez
Copy link
Contributor Author

Hi all, I've brought in the v3 changes into my fork and have the project successfully building again. Server unit tests are passing for incremental xDS, and I'm currently working on adding them for the snapshot cache. All my changes were targeted for v2 so I will continue my work, and when the support is fully functional, I will port the changes to v3. Currently the server is communicating with envoy, but no resources are being sent from the server (a bug on my part):

[2020-04-23 14:45:52.888][1958207][debug][config] [external/envoy/source/common/config/new_grpc_mux_impl.cc:62] Received DeltaDiscoveryResponse for type.googleapis.com/envoy.api.v2.Cluster at version 
[2020-04-23 14:45:52.888][1958207][debug][config] [external/envoy/source/common/config/delta_subscription_state.cc:114] Delta config for type.googleapis.com/envoy.api.v2.Cluster accepted with 0 resources added, 0 removed
[2020-04-23 14:45:52.894][1958207][debug][config] [external/envoy/source/common/config/new_grpc_mux_impl.cc:62] Received DeltaDiscoveryResponse for type.googleapis.com/envoy.api.v2.Cluster at version 
[2020-04-23 14:45:52.894][1958207][debug][config] [external/envoy/source/common/config/delta_subscription_state.cc:114] Delta config for type.googleapis.com/envoy.api.v2.Cluster accepted with 0 resources added, 0 removed
[2020-04-23 14:45:52.899][1958207][debug][config] [external/envoy/source/common/config/new_grpc_mux_impl.cc:62] Received DeltaDiscoveryResponse for type.googleapis.com/envoy.api.v2.Cluster at version 
[2020-04-23 14:45:52.899][1958207][debug][config] [external/envoy/source/common/config/delta_subscription_state.cc:114] Delta config for type.googleapis.com/envoy.api.v2.Cluster accepted with 0 resources added, 0 removed
[2020-04-23 14:45:52.904][1958207][debug][config] [external/envoy/source/common/config/new_grpc_mux_impl.cc:62] Received DeltaDiscoveryResponse for type.googleapis.com/envoy.api.v2.Cluster at version 

This loop will not end and I believe it is because a NACK is being sent back from the snapshot cache. I think this has something to do with how Im treating the watches/keeping state/or how i'm dealing with the nonce but I will continue to investigate and hopefully the unit tests bring something to light.

@phylake
Copy link

phylake commented May 13, 2020

All my changes were targeted for v2 so I will continue my work, and when the support is fully functional

@alecholmez is support fully functional yet? I checked out your fork but wondering if there's enough to start building on

@alecholmez
Copy link
Contributor Author

@phylake support is not functional yet, myself and friend have been working hard at this the past few weeks and we have made significant progress but have made a few realizations in the process. The fork is being updated and it'd be great if you'd check out the tests as we are focused on those currently. Would appreciate any help if you have the time! Thanks for checking out the fork 👍 , as soon as incremental is fully functional, or 90% of the way there I intend on opening a WIP PR against this repo so stay on the lookout for that.

@alecholmez
Copy link
Contributor Author

Hi all, just wanted to provide a status update: I have finally gotten all my local tests passing for the new incremental logic. I'm beginning testing with Envoy but will soon open a PR against this repo and will make changes according to everyones reviews. I have yet to go back and make performance modifications, as I planned that to be the second phase of this effort. Thanks for all the interest!

@alecholmez
Copy link
Contributor Author

Hi all, just opened the WIP PR for incremental! The integration test is functional with some odd behavior in a special case, and the unit tests are intermittently failing but I'm currently investigating that

@asishrs
Copy link

asishrs commented Sep 17, 2020

@alecholmez can you share update on this effort? Thanks a lot for putting your time on this.

@alecholmez
Copy link
Contributor Author

Sorry for keeping those following the issue in the dark! The PR is open, I have removed the WIP tag from it as i've just recently gotten the integration fully running and successful! All unit tests pass with the exception of the opaque resource handling, so it is ready for review! It's rather large so I can do whatever the maintainers need me to do to help with that ache. Otherwise it's functional and the PR is open!

@asishrs

@github-actions
Copy link

github-actions bot commented Apr 6, 2021

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Apr 6, 2021
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@snowp snowp reopened this Apr 14, 2021
@snowp snowp added no stalebot and removed stale labels Apr 14, 2021
@alecholmez
Copy link
Contributor Author

I think we can close this now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants