Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for tracking map pressure for NAT map #27001

Merged
merged 1 commit into from Aug 8, 2023

Conversation

derailed
Copy link
Contributor

@derailed derailed commented Jul 21, 2023

Taps into LRU maps GC cycle to surface map pressure metric based on the resulting computed map size.

Add initial map pressure metrics support for the following LRU maps:

  • NAT

Note: This is a temporary solution until LRU map counts are surfaced
in the kernel.

Related: #20069, #21508

Add Prometheus map pressure metrics for NAT maps

Signed-off-by: Fernand Galiana fernand.galiana@isovalent.com

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 21, 2023
@derailed
Copy link
Contributor Author

/test

@ysksuzuki
Copy link
Member

ysksuzuki commented Jul 22, 2023

@derailed Thank you for the PR!

Some random thoughts:

By default, the GC interval for the ct map is calculated dynamically based on maxDeleteRatio. From the GC perspective, that's OK, but it's not OK from the perspective of the user who wants to monitor the map pressure. This is not entirely accurate given that this takes time, and the map is changing while the dump occurs. I think we should use a fixed interval for the GC when the map pressure for the ct map is enabled.

func GetInterval(maxDeleteRatio float64) (interval time.Duration) {

How do we export the map pressure of the nat map regularly? As far as I know, the flush nat map doesn't occur regularly, so we can't use the same approach as for the ct map. One possible solution is adding a go routine that exports the map pressure of the nat map regularly.

Since it is predicted that the CPU load will increase due to these changes, it seems necessary to perform them efficiently. I want to try BPF_MAP_LOOKUP_BATCH and see if we can do these efficiently.

@christarazi
Copy link
Member

christarazi commented Jul 22, 2023

By default, the GC interval for the ct map is calculated dynamically based on maxDeleteRatio. From the GC perspective, that's OK, but it's not OK from the perspective of the user who wants to monitor the map pressure. This is not entirely accurate given that this takes time, and the map is changing while the dump occurs. I think we should use a fixed interval for the GC when the map pressure for the ct map is enabled.

@ysksuzuki I believe the challenge with a fixed interval is that it will put extra pressure on the CPU. Walking these maps is very expensive on the CPU. While counting the entries during the GC interval is not perfect, the significant advantage is that we can now have visibility into the entry count without any extra CPU overhead, which is a really nice property. Plus, due to the dynamic GC interval, updating the metric occurs as pressure increases on the map. So the metric is updated essentially when the user should start paying attention to the map count.

@ysksuzuki
Copy link
Member

ysksuzuki commented Jul 22, 2023

@christarazi

I believe the challenge with a fixed interval is that it will put extra pressure on the CPU. Walking these maps is very expensive on the CPU.

I think that's why @joestringer suggests trying BPF_MAP_LOOKUP_BATCH to perform the calculation efficiently, and I'm planning to do some PoC with it. I'm not so sure how much that could reduce the CPU cost, though.

the significant advantage is that we can now have visibility into the entry count without any extra CPU overhead, which is a really nice property

But we can already visualize the entry count with cilium_datapath_conntrack_gc_entries{status="alive"}. Given that we've
already had a similar metric, I don't see the significant value in it. It just exports the same value as cilium_datapath_conntrack_gc_entries as cilium_bpf_map_pressure, right? If we expose it as cilium_bpf_map_pressure, I want it to be more reliable. Otherwise, it's hard for the monitoring system to respond quickly when a sudden high load happens, for example.

Shouldn't we start by writing a design document, discussing it, and aligning our expectations for this issue?

@derailed
Copy link
Contributor Author

derailed commented Jul 22, 2023

@ysksuzuki @chris Thank you both for weighting in!

  1. I did see that size metrics are reported for both ct and nat maps but the mechanic for map_pressure is a little bit different as this metric is registerered/unregistered when over a given threshold which would be used to trigger an alert when present vs just reporting on a map size. Not to mention, consistency as it uses the very same mechanic as other map types currently use.
  2. There is a correlation between ct and nat map so tho the gc cycles are different, sanitization would be performed on nat maps to ensure the entries are in sync as part of the ct gc.
  3. As stated in the PR this won't accurately represent pressure but thought might be viable in the interim...
  4. I think @ti-mo has already experimented with this as he has a decent prototype using BPF_MAP_LOOKUP_BATCH and got very good benchmarked results. The eternal trade off here cpu vs mem as the map need to be pre-allocated for that call to return the count.
  5. Also there is great progress per the excellent work of @aspsk to get the counts directly from the kernel which is going to be ideal but will take time to trickle down to cilium and its customers. So we need something viable in the interim...

@ysksuzuki
Copy link
Member

@derailed Wow, some people are already working on this issue. I didn't know that. Can I see the information about those efforts somewhere? (BPF_MAP_LOOKUP_BATCH prototype and get the counts directly from the kernel)

If you need a temporary solution in the interim, I don't want to block you. Should I unassign myself from this issue for now? Please let me know if there is anything I can do to help!

@derailed
Copy link
Contributor Author

@ysksuzuki - I'll let @ti-mo and @aspsk pipe in here with the gory details.
As far as the issue, I think you should still own it as you are much more knowledgeable than I in this area.
It would be great if you could review this PR and see if I missed anything for this interim solution.

Thank you!!

@aspsk
Copy link
Contributor

aspsk commented Jul 26, 2023

Hi @ysksuzuki

@derailed Wow, some people are already working on this issue. I didn't know that. Can I see the information about those efforts somewhere? (BPF_MAP_LOOKUP_BATCH prototype and get the counts directly from the kernel)

For getting the counts from the kernel: there is currently two "apis" (in quotes, because they are implemented using kfuncs and thus is not uAPI, formally):

  • Use map iterator (useful when you're getting the count for all maps periodically). See an example (in C) here.

  • Use a kfunc from a BPF program, see this commit description as an example and source of truth.

Note that this is currently supported only in bpf-next (=> in Linux >= 6.6).

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the below and @ysksuzuki's comment, I can see that it doesn't quite give us much value if we just duplicate whatever we do for the existing CT GC metric and cilium_bpf_map_pressure. Now I understand the proposal to rather do the counting on a fixed interval and utilized batch lookups to mitigate CPU spike concerns.

pkg/maps/ctmap/ctmap.go Outdated Show resolved Hide resolved
@christarazi
Copy link
Member

@ysksuzuki Does it make sense to keep this PR just for the NAT map since it doesn't have any metrics? Then in another PR, we can propose batch lookups + fixed interval counting.

@derailed derailed force-pushed the feat/map_pressure branch 2 times, most recently from 2543feb to 5124171 Compare August 2, 2023 19:46
@derailed
Copy link
Contributor Author

derailed commented Aug 2, 2023

/test

@christarazi
Copy link
Member

@derailed Thanks for updating the PR.

I took a deep look at what this PR is doing and the NAT GC cycle. I think the changes made in pkg/bpf/map_linux.go are good to keep as those are the foundational pieces that we need to build off of. However, the changes in pkg/maps/nat/nat.go are not needed as is. The current problem with it is that the pressure metric calculation logic relies on Flush() being called, which upon investigation is only called from the CLI in response to a user issuing cilium bpf ... nat flush. Taking a closer look at how the NAT map works, we find that the datapath will insert into the map, but deletion happens during CT map GC. The fact that deletion happens during CT map GC is a bit confusing, but the code for that is here. You'll notice that the delete is issued for the NAT map in that code as well. Conveniently, the logic is also counting the number of alive entries, which is exactly what we need to report as the pressure metric.

One complication though with the current implementation is that internal calls to updatePressureMetric() will now call UpdatePressureMetricWithSize(len(cache)) which will overwrite the value that we would pass in from PurgeOrphanNATEntries(). So I think we need some mechanism to prevent this from happening. Thoughts? Once we do that, then I think this should accurately count the entries in the NAT map.

@derailed
Copy link
Contributor Author

derailed commented Aug 3, 2023

@christarazi Great catch! So let me see if I understood this correctly.

The ctmap manages the content of an associated nat map during its GC cycles.
When the purgeOrphan call is made, we end up calling DumpReliablyWithCallback which performs house keeping on the associated nat map. Tho we are tracking stats for the number of alive ingress/egress these don't seem relevant when establishing map pressure?? But since pressure metrics is now enabled on nat maps and we're maintaining a cache, deletes would trigger a nat map pressure updates.
Or am I missing it?

@derailed derailed force-pushed the feat/map_pressure branch 4 times, most recently from bf5e5da to 5cba7ed Compare August 3, 2023 17:51
@christarazi
Copy link
Member

christarazi commented Aug 3, 2023

@derailed

tracking stats for the number of alive ingress/egress these don't seem relevant when establishing map pressure

The stats are defined here. From what I can tell, their usages are strictly only in PurgeOrphanNATEntries(). Therefore, they are counting the cleaned up and alive entries from the NAT map. The alive entries are relevant for the map pressure metric because they tell us the live count of the NAT map.

But since pressure metrics is now enabled on nat maps and we're maintaining a cache, deletes would trigger a nat map pressure updates.
Or am I missing it?

Yes that's correct. However, relying solely on that will cause problems by overwriting the calculation that we will do as discussed above for LRU maps in general. This is because LRU maps have evictions and there's no way for the userspace caching side of the maps to know that eviction occurred (because it occurs on the kernel/bpf side), so essentially Cilium would not be able to decrement the count properly. This is why we rely on walking the maps to get the real count for LRUs. Hence why we need to stop LRU maps from having the automatic map pressure updates triggered, but rather allow them to be updated via UpdatePressureMetricWithSize() "manually" inside PurgeOrphanNATEntries().

Does this make sense?

@derailed
Copy link
Contributor Author

derailed commented Aug 3, 2023

/test

@derailed
Copy link
Contributor Author

derailed commented Aug 3, 2023

@christarazi I think I understand but still unclear... From what I can tell during the orphan call we do keep a tally only for ingress/egress entries but these are only 2 of the types of entries that could be in the nat map?
Looking at the bit flag there could be other types of entries and thus triggering the map pressure update call would not yield the correct size ie UpdatePressureWithSize(stats.IngressAlive+stats.EgressAlive).
If this is correct, we would need to keep another tally of alive-entries and make the call with that computed total.
Am I wrong about that?

@christarazi
Copy link
Member

christarazi commented Aug 3, 2023

@derailed Ah, now I understand your question. It's basically, can there be more than 2 unique flags per NAT map key. From looking at the datapath code (bpf/lib/nat.h) for NAT and tracing the call paths, I only see TUPLE_F_OUT (egress) and TUPLE_F_IN (ingress) being set. Which makes sense given that there can only be two directions and we're talking about NAT. While there's theoretically more flags that can be set, the non-directional ones are only used in the CT map keys and not the NAT map keys. It's a bit confusing because both the keys types are the same for both maps.

So in summary,

UpdatePressureWithSize(stats.IngressAlive+stats.EgressAlive).

should be accurate. We can request @cilium/sig-datapath to have a look to confirm, but so far I'm fairly confident that this is correct.

@derailed
Copy link
Contributor Author

derailed commented Aug 3, 2023

@christarazi - Brilliant! Thank you for the clarification! I'll make it so.

Given that we do check for these specific key types in the callback, in the future might be best to just have a total tally especially in light that these counters are only used for logging...

@derailed derailed force-pushed the feat/map_pressure branch 2 times, most recently from cf6fcbe to 92b78ca Compare August 3, 2023 22:27
@derailed
Copy link
Contributor Author

derailed commented Aug 3, 2023

/test

@derailed derailed force-pushed the feat/map_pressure branch 3 times, most recently from e41b04a to 2f587df Compare August 7, 2023 14:36
@derailed
Copy link
Contributor Author

derailed commented Aug 7, 2023

/test

@christarazi christarazi added kind/enhancement This would improve or streamline existing functionality. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/daemon Impacts operation of the Cilium daemon. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Aug 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 7, 2023
@christarazi christarazi changed the title Add support for tracking map pressure for LRU type maps Add support for tracking map pressure for NAT map Aug 7, 2023
@derailed
Copy link
Contributor Author

derailed commented Aug 7, 2023

/test

@christarazi
Copy link
Member

@derailed It seems that most of CI is failing presumably due to some fatal crash or something of the sort. I would suggest downloading a sysdump from one of the runs and investigate the cause. For example, this is the smoke test's sysdump: https://github.com/cilium/cilium/suites/14891492852/artifacts/848345104

Taps into LRU maps GC cycle to surface map pressure metric based on the resulting
computed map size.

Add initial map pressure metrics support for the following LRU maps:
- NAT

> Note: This is a temporary solution until lru map counts are surfaced
in the kernel

Issue: [cilium#20069](cilium#20069)

Signed-off-by: Fernand Galiana <fernand.galiana@isovalent.com>
@christarazi
Copy link
Member

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 8, 2023
@ti-mo
Copy link
Contributor

ti-mo commented Aug 8, 2023

I'm not in favor of adding anything to struct Map, but given the lack of alternative and the change is small, it's fine for now.

@ldelossa
Copy link
Contributor

ldelossa commented Aug 8, 2023

all required test passing. Merging

@ldelossa ldelossa merged commit 89fad63 into cilium:main Aug 8, 2023
58 of 59 checks passed
@julianwiedmann julianwiedmann added affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch labels Nov 6, 2023
derailed added a commit to derailed/cilium that referenced this pull request Nov 28, 2023
…ap (@derailed)

Once this PR is merged, you can update the PR labels via:
```upstream-prs
$ for pr in 27001; do contrib/backporting/set-labels.py $pr done 1.12; done
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch area/daemon Impacts operation of the Cilium daemon. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants