New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds concept documentation for CiliumEndpointSlice #17430
Conversation
fd3b4f4
to
629e77c
Compare
cab7136
to
d13c98f
Compare
This pull request can only be merged once #17658 has been merged, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for documenting this, looks good overall. Just a few minor nits below.
A mitigation strategy is to upgrade the operator from non-CES to CES first. | ||
This will let it start creating CESs right away, while the agents are still | ||
watching CEP resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this needs to be a bit more explicit, at the very least something like this:
A mitigation strategy is to upgrade the operator from non-CES to CES first. | |
This will let it start creating CESs right away, while the agents are still | |
watching CEP resources. | |
A mitigation strategy is to upgrade the operator from non-CES to CES before | |
upgrading agents on each node. | |
This will let it start creating CESs right away, while the agents are still | |
watching CEP resources. |
But also the next questions are "how long is long enough to wait after upgrading the operator?" and "What are the concrete steps to ensure that this upgrade occurs in this order?
I think the second one of these questions may take some effort to think through and come up with the concrete steps, which maybe shouldn't be necessary at this point with a beta version.. But also, by the time this goes GA, I will imagine that we want full, reliable instructions in the main upgrade guide for how to implement this upgrade safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long is long enough to wait after upgrading the operator?
There's really no tell in current implementation, but I imagine we could either let the operator output some signal, or better, we could implement a check in the agent to see if CES store has all the CESs required to represent current CEPs, and only then start watching CES.
What are the concrete steps to ensure that this upgrade occurs in this order
I think going GA, what we can do is to employ a 2-step upgrade again here. We put the operator change into some future 1.11 ver and let it create CES by default. Then in 1.12 release, we toggle agent watch by default. If we do this, then maybe there's no point implementing the checks I said above...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh also took your suggestion, edited text :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's really no tell in current implementation, but I imagine we could either let the operator output some signal, or better, we could implement a check in the agent to see if CES store has all the CESs required to represent current CEPs, and only then start watching CES.
I think that regardless of what else we do, this kind of signal would be really nice to have for debugability. Assuming that folks did the upgrade correctly and everything is fine works alright if everything is operating correctly and there's no unexpected bugs or issues in particular environments. But introspecting the actual state at a point in time and being able to make assessments on the health of the migration can be super helpful.
What are the concrete steps to ensure that this upgrade occurs in this order
I think going GA, what we can do is to employ a 2-step upgrade again here. We put the operator change into some future 1.11 ver and let it create CES by default. Then in 1.12 release, we toggle agent watch by default. If we do this, then maybe there's no point implementing the checks I said above...
In principle I like the idea, but I would be a bit concerned about silently turning this on in a future 1.11.x bugfix release. Enabling it in v1.11.0 and finding out that there is some issue with it (eg leaking resources, GC doesn't work, too high CPU, unexpected apiserver interactions, whatever the bugs may be) is not too bad. Introducing that later in the release cycle would increase the risk for upgrades that users will otherwise expect to be low-risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that regardless of what else we do, this kind of signal would be really nice to have for debugability.
I agree, created #17850 as a follow up item.
Enabling it in v1.11.0 and finding out that there is some issue with it (eg leaking resources, GC doesn't work, too high CPU, unexpected apiserver interactions, whatever the bugs may be) is not too bad.
Makes sense, should we enable operator piece by default now then (maybe in a quick follow up PR before 1.11 cut)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah please do follow up with a dedicated PR after we merge the core implementation and we can have that discussion / gauge the benefit vs. risk.
In current implementation, we added a delay (default: 1s) before sending | ||
out the DELETE event. This should greatly reduce the probability of | ||
connection disruption in most cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, tone:
In current implementation, we added a delay (default: 1s) before sending | |
out the DELETE event. This should greatly reduce the probability of | |
connection disruption in most cases. | |
In current implementation, Cilium adds a delay (default: 1s) before sending | |
out the DELETE event. This should greatly reduce the probability of | |
connection disruption in most cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
During this period, new endpoints might experience longer delay before | ||
their identities will propagate to remote nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can add another sentence after this to describe the impact here, something like "If a newly created pod is affected by this delay, then it may temporarily fail to establish connectivity to remote pods in the cluster."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
.. code-block:: shell-session | ||
|
||
kubectl get pods -n kube-system -l k8s-app=cilium | ||
NAMESPACE NAME READY STATUS RESTARTS AGE | ||
kube-system cilium-7tnkq 1/1 Running 0 1d | ||
|
||
kubectl get pods -n kube-system -l io.cilium/app=operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if this is shell-session with combined commands + output then it needs the shell symbols:
.. code-block:: shell-session | |
kubectl get pods -n kube-system -l k8s-app=cilium | |
NAMESPACE NAME READY STATUS RESTARTS AGE | |
kube-system cilium-7tnkq 1/1 Running 0 1d | |
kubectl get pods -n kube-system -l io.cilium/app=operator | |
.. code-block:: shell-session | |
$ kubectl get pods -n kube-system -l k8s-app=cilium | |
NAMESPACE NAME READY STATUS RESTARTS AGE | |
kube-system cilium-7tnkq 1/1 Running 0 1d | |
$ kubectl get pods -n kube-system -l io.cilium/app=operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same below, you should be able to see the difference in the Netlify render output or by building the docs locally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
helm install cilium |CHART_RELEASE| \\ | ||
--set enableCiliumEndpointSlice=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this indented the same level as the first column character of parsed-literal
? It looks like the indentation is not enough, which usually causes sphinx formatting issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Yes, I think so, thanks for setting the flag :) |
|
||
Egress Gateway Won't Work with CES | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
This is a temporary issue until Github Issue #17669 is fixed. If you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a standard way to reference github issues in the docs, I think :gh-issue:
17669should do the trick. Might need to double-check how that renders in the middle of the sentence but you can also grep around for
gh-issue` to understand how we typically reference github issues in other parts of the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one more nit on linking the github issue for one of these limitations.
Signed-off-by: Weilong Cui <cuiwl@google.com>
Blocking PR #17658 has merged. |
See commit msg.