Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds concept documentation for CiliumEndpointSlice #17430

Merged
merged 1 commit into from Nov 12, 2021
Merged

Conversation

Weil0ng
Copy link
Contributor

@Weil0ng Weil0ng commented Sep 17, 2021

See commit msg.

@Weil0ng Weil0ng added the area/documentation Impacts the documentation, including textual changes, sphinx, or other doc generation code. label Sep 17, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 17, 2021
@Weil0ng Weil0ng added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Sep 17, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 17, 2021
@Weil0ng Weil0ng force-pushed the cesdoc branch 2 times, most recently from fd3b4f4 to 629e77c Compare November 3, 2021 23:04
@pchaigno pchaigno added release-note/misc This PR makes changes that have no direct user impact. and removed release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Nov 5, 2021
@Weil0ng Weil0ng force-pushed the cesdoc branch 2 times, most recently from cab7136 to d13c98f Compare November 9, 2021 00:39
@Weil0ng Weil0ng requested a review from aanm November 9, 2021 00:54
@Weil0ng Weil0ng marked this pull request as ready for review November 9, 2021 00:54
@Weil0ng Weil0ng requested a review from a team as a code owner November 9, 2021 00:54
@Weil0ng Weil0ng requested a review from a team November 9, 2021 00:54
@pchaigno
Copy link
Member

pchaigno commented Nov 9, 2021

This pull request can only be merged once #17658 has been merged, right?

@pchaigno pchaigno added the dont-merge/blocked Another PR must be merged before this one. label Nov 9, 2021
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for documenting this, looks good overall. Just a few minor nits below.

Comment on lines 93 to 95
A mitigation strategy is to upgrade the operator from non-CES to CES first.
This will let it start creating CESs right away, while the agents are still
watching CEP resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this needs to be a bit more explicit, at the very least something like this:

Suggested change
A mitigation strategy is to upgrade the operator from non-CES to CES first.
This will let it start creating CESs right away, while the agents are still
watching CEP resources.
A mitigation strategy is to upgrade the operator from non-CES to CES before
upgrading agents on each node.
This will let it start creating CESs right away, while the agents are still
watching CEP resources.

But also the next questions are "how long is long enough to wait after upgrading the operator?" and "What are the concrete steps to ensure that this upgrade occurs in this order?

I think the second one of these questions may take some effort to think through and come up with the concrete steps, which maybe shouldn't be necessary at this point with a beta version.. But also, by the time this goes GA, I will imagine that we want full, reliable instructions in the main upgrade guide for how to implement this upgrade safely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long is long enough to wait after upgrading the operator?

There's really no tell in current implementation, but I imagine we could either let the operator output some signal, or better, we could implement a check in the agent to see if CES store has all the CESs required to represent current CEPs, and only then start watching CES.

What are the concrete steps to ensure that this upgrade occurs in this order

I think going GA, what we can do is to employ a 2-step upgrade again here. We put the operator change into some future 1.11 ver and let it create CES by default. Then in 1.12 release, we toggle agent watch by default. If we do this, then maybe there's no point implementing the checks I said above...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh also took your suggestion, edited text :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's really no tell in current implementation, but I imagine we could either let the operator output some signal, or better, we could implement a check in the agent to see if CES store has all the CESs required to represent current CEPs, and only then start watching CES.

I think that regardless of what else we do, this kind of signal would be really nice to have for debugability. Assuming that folks did the upgrade correctly and everything is fine works alright if everything is operating correctly and there's no unexpected bugs or issues in particular environments. But introspecting the actual state at a point in time and being able to make assessments on the health of the migration can be super helpful.

What are the concrete steps to ensure that this upgrade occurs in this order

I think going GA, what we can do is to employ a 2-step upgrade again here. We put the operator change into some future 1.11 ver and let it create CES by default. Then in 1.12 release, we toggle agent watch by default. If we do this, then maybe there's no point implementing the checks I said above...

In principle I like the idea, but I would be a bit concerned about silently turning this on in a future 1.11.x bugfix release. Enabling it in v1.11.0 and finding out that there is some issue with it (eg leaking resources, GC doesn't work, too high CPU, unexpected apiserver interactions, whatever the bugs may be) is not too bad. Introducing that later in the release cycle would increase the risk for upgrades that users will otherwise expect to be low-risk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that regardless of what else we do, this kind of signal would be really nice to have for debugability.

I agree, created #17850 as a follow up item.

Enabling it in v1.11.0 and finding out that there is some issue with it (eg leaking resources, GC doesn't work, too high CPU, unexpected apiserver interactions, whatever the bugs may be) is not too bad.

Makes sense, should we enable operator piece by default now then (maybe in a quick follow up PR before 1.11 cut)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah please do follow up with a dedicated PR after we merge the core implementation and we can have that discussion / gauge the benefit vs. risk.

Comment on lines 80 to 82
In current implementation, we added a delay (default: 1s) before sending
out the DELETE event. This should greatly reduce the probability of
connection disruption in most cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, tone:

Suggested change
In current implementation, we added a delay (default: 1s) before sending
out the DELETE event. This should greatly reduce the probability of
connection disruption in most cases.
In current implementation, Cilium adds a delay (default: 1s) before sending
out the DELETE event. This should greatly reduce the probability of
connection disruption in most cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 90 to 91
During this period, new endpoints might experience longer delay before
their identities will propagate to remote nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can add another sentence after this to describe the impact here, something like "If a newly created pod is affected by this delay, then it may temporarily fail to establish connectivity to remote pods in the cluster."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 45 to 51
.. code-block:: shell-session

kubectl get pods -n kube-system -l k8s-app=cilium
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cilium-7tnkq 1/1 Running 0 1d

kubectl get pods -n kube-system -l io.cilium/app=operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if this is shell-session with combined commands + output then it needs the shell symbols:

Suggested change
.. code-block:: shell-session
kubectl get pods -n kube-system -l k8s-app=cilium
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cilium-7tnkq 1/1 Running 0 1d
kubectl get pods -n kube-system -l io.cilium/app=operator
.. code-block:: shell-session
$ kubectl get pods -n kube-system -l k8s-app=cilium
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cilium-7tnkq 1/1 Running 0 1d
$ kubectl get pods -n kube-system -l io.cilium/app=operator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same below, you should be able to see the difference in the Netlify render output or by building the docs locally).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 40 to 41
helm install cilium |CHART_RELEASE| \\
--set enableCiliumEndpointSlice=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this indented the same level as the first column character of parsed-literal? It looks like the indentation is not enough, which usually causes sphinx formatting issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Nov 10, 2021

This pull request can only be merged once #17658 has been merged, right?

Yes, I think so, thanks for setting the flag :)


Egress Gateway Won't Work with CES
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is a temporary issue until Github Issue #17669 is fixed. If you
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a standard way to reference github issues in the docs, I think :gh-issue:17669should do the trick. Might need to double-check how that renders in the middle of the sentence but you can also grep around forgh-issue` to understand how we typically reference github issues in other parts of the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more nit on linking the github issue for one of these limitations.

Signed-off-by: Weilong Cui <cuiwl@google.com>
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Nov 11, 2021

Blocking PR #17658 has merged.

@Weil0ng Weil0ng removed the dont-merge/blocked Another PR must be merged before this one. label Nov 11, 2021
@aanm aanm merged commit 0c02e32 into cilium:master Nov 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Impacts the documentation, including textual changes, sphinx, or other doc generation code. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants