New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for tracking map pressure for NAT map #27001
Conversation
2225b01
to
37c4b20
Compare
/test |
@derailed Thank you for the PR! Some random thoughts: By default, the GC interval for the ct map is calculated dynamically based on maxDeleteRatio. From the GC perspective, that's OK, but it's not OK from the perspective of the user who wants to monitor the map pressure. This is not entirely accurate given that this takes time, and the map is changing while the dump occurs. I think we should use a fixed interval for the GC when the map pressure for the ct map is enabled. cilium/pkg/maps/ctmap/ctmap.go Line 840 in 3eaca5d
How do we export the map pressure of the nat map regularly? As far as I know, the flush nat map doesn't occur regularly, so we can't use the same approach as for the ct map. One possible solution is adding a go routine that exports the map pressure of the nat map regularly. Since it is predicted that the CPU load will increase due to these changes, it seems necessary to perform them efficiently. I want to try BPF_MAP_LOOKUP_BATCH and see if we can do these efficiently. |
@ysksuzuki I believe the challenge with a fixed interval is that it will put extra pressure on the CPU. Walking these maps is very expensive on the CPU. While counting the entries during the GC interval is not perfect, the significant advantage is that we can now have visibility into the entry count without any extra CPU overhead, which is a really nice property. Plus, due to the dynamic GC interval, updating the metric occurs as pressure increases on the map. So the metric is updated essentially when the user should start paying attention to the map count. |
I think that's why @joestringer suggests trying BPF_MAP_LOOKUP_BATCH to perform the calculation efficiently, and I'm planning to do some PoC with it. I'm not so sure how much that could reduce the CPU cost, though.
But we can already visualize the entry count with Shouldn't we start by writing a design document, discussing it, and aligning our expectations for this issue? |
@ysksuzuki @chris Thank you both for weighting in!
|
@derailed Wow, some people are already working on this issue. I didn't know that. Can I see the information about those efforts somewhere? (BPF_MAP_LOOKUP_BATCH prototype and get the counts directly from the kernel) If you need a temporary solution in the interim, I don't want to block you. Should I unassign myself from this issue for now? Please let me know if there is anything I can do to help! |
@ysksuzuki - I'll let @ti-mo and @aspsk pipe in here with the gory details. Thank you!! |
Hi @ysksuzuki
For getting the counts from the kernel: there is currently two "apis" (in quotes, because they are implemented using
Note that this is currently supported only in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the below and @ysksuzuki's comment, I can see that it doesn't quite give us much value if we just duplicate whatever we do for the existing CT GC metric and cilium_bpf_map_pressure
. Now I understand the proposal to rather do the counting on a fixed interval and utilized batch lookups to mitigate CPU spike concerns.
@ysksuzuki Does it make sense to keep this PR just for the NAT map since it doesn't have any metrics? Then in another PR, we can propose batch lookups + fixed interval counting. |
2543feb
to
5124171
Compare
/test |
@derailed Thanks for updating the PR. I took a deep look at what this PR is doing and the NAT GC cycle. I think the changes made in One complication though with the current implementation is that internal calls to |
@christarazi Great catch! So let me see if I understood this correctly. The ctmap manages the content of an associated nat map during its GC cycles. |
bf5e5da
to
5cba7ed
Compare
The stats are defined here. From what I can tell, their usages are strictly only in
Yes that's correct. However, relying solely on that will cause problems by overwriting the calculation that we will do as discussed above for LRU maps in general. This is because LRU maps have evictions and there's no way for the userspace caching side of the maps to know that eviction occurred (because it occurs on the kernel/bpf side), so essentially Cilium would not be able to decrement the count properly. This is why we rely on walking the maps to get the real count for LRUs. Hence why we need to stop LRU maps from having the automatic map pressure updates triggered, but rather allow them to be updated via Does this make sense? |
/test |
@christarazi I think I understand but still unclear... From what I can tell during the orphan call we do keep a tally only for ingress/egress entries but these are only 2 of the types of entries that could be in the nat map? |
@derailed Ah, now I understand your question. It's basically, can there be more than 2 unique flags per NAT map key. From looking at the datapath code ( So in summary,
should be accurate. We can request @cilium/sig-datapath to have a look to confirm, but so far I'm fairly confident that this is correct. |
@christarazi - Brilliant! Thank you for the clarification! I'll make it so. Given that we do check for these specific key types in the callback, in the future might be best to just have a total tally especially in light that these counters are only used for logging... |
cf6fcbe
to
92b78ca
Compare
/test |
e41b04a
to
2f587df
Compare
/test |
2f587df
to
400be0b
Compare
/test |
@derailed It seems that most of CI is failing presumably due to some fatal crash or something of the sort. I would suggest downloading a sysdump from one of the runs and investigate the cause. For example, this is the smoke test's sysdump: https://github.com/cilium/cilium/suites/14891492852/artifacts/848345104 |
400be0b
to
5ce12a9
Compare
Taps into LRU maps GC cycle to surface map pressure metric based on the resulting computed map size. Add initial map pressure metrics support for the following LRU maps: - NAT > Note: This is a temporary solution until lru map counts are surfaced in the kernel Issue: [cilium#20069](cilium#20069) Signed-off-by: Fernand Galiana <fernand.galiana@isovalent.com>
5ce12a9
to
c6da705
Compare
/test |
I'm not in favor of adding anything to |
all required test passing. Merging |
…ap (@derailed) Once this PR is merged, you can update the PR labels via: ```upstream-prs $ for pr in 27001; do contrib/backporting/set-labels.py $pr done 1.12; done ```
Taps into LRU maps GC cycle to surface map pressure metric based on the resulting computed map size.
Add initial map pressure metrics support for the following LRU maps:
Related: #20069, #21508
Signed-off-by: Fernand Galiana fernand.galiana@isovalent.com