Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: DNS Cache datasource #3679

Merged
merged 4 commits into from
Nov 29, 2023

Conversation

NDStrahilevitz
Copy link
Collaborator

@NDStrahilevitz NDStrahilevitz commented Nov 9, 2023

1. Explain what the PR does

9ebacee docs: add dns cache
0b78579 feature: add a dns:ip cache
96b581f fix(dns): wrong dns argument type
0bfbe8f feature: add generic set data structure

2. Explain how to test it

An instrumentation test and in-code e2e test were added.

3. Other comments

The PR currently uses a processor on net_packet_dns event to populate the cache. This meant that I had to add an additional processEvent call in the deriveEvents stage of the pipeline.
A possible improvement in the future is to migrate (or add) a control plane signal for this cache.

The cache currently uses the second node based design proposed.

Resolves #3678

@rafaeldtinoco
Copy link
Contributor

Ill review this tomorrow, just wanted a heads up on the feature flag. Alon and I discussed about having a feature flag per "enabled feature that would enrich the proc tree" or "just enriching if that event is being traced".

I have also thought about a multi-var feature-flag like for proctree enriching... Something like --proctree enrich:socket,dns,file,bleh,blah together with all other proctree cmdline options available.

Long story short: I don't think we should have feature flag for this particular dns cache feature. Its either one that would us allow to enable many proctree enrichers (such as this one) OR none needed (tracing the event that causes the enrichment would be enough).

What is your thought ?

@NDStrahilevitz
Copy link
Collaborator Author

Long story short: I don't think we should have feature flag for this particular dns cache feature. Its either one that would us allow to enable many proctree enrichers (such as this one) OR none needed (tracing the event that causes the enrichment would be enough).

What is your thought ?

I think that since this doesn't enrich or touch the process tree (maybe only currently), it has no reason to be a part of its feature flags. I'm not sure where this would plug into the process tree (at least not directly), as the closest the process tree could get to it would be through open fds maybe, and then if those fds are network sockets you could enrich the relevant dns of the connection.

I also don't like just equating the enable flag to enabling the relevant dns event (and ideally i would have a signal eBPF program here, but I didn't want to delay the PR further), because it obfuscates how the dns cache should be enabled. If a user tries to access the dns cache but it's empty because the net_packet_dns_response event wasn't enabled in his policy, that's just kind of annoying and unintuitive. The PR is still missing docs, but if it had docs, that would be the only place where this "enabling" method could be located, it's not self documenting.

@rafaeldtinoco
Copy link
Contributor

I'm not sure where this would plug into the process tree (at least not directly), as the closest the process tree could get to it would be through open fds maybe, and then if those fds are network sockets you could enrich the relevant dns of the connection.

Having a per process list of resolved DNS entries would allow blocking network connections of a process that has solved a blacklisted DNS name (just one example, there are many). Think of the process tree as a place to save all this information (opened files, queries names, IPs communicated with, etc).

@rafaeldtinoco
Copy link
Contributor

If a user tries to access the dns cache but it's empty because the net_packet_dns_response event wasn't enabled in his policy, that's just kind of annoying and unintuitive. The PR is still missing docs, but if it had docs, that would be the only place where this "enabling" method could be located, it's not self documenting.

This is the problem we need to solve (for data sources & process tree features). I believe we should solve this now or else it might be too late when we have multiple features done.

@rafaeldtinoco
Copy link
Contributor

(and ideally i would have a signal eBPF program here, but I didn't want to delay the PR further)

We also need to discuss signal events and their future. We need to think what we really want from them. For the process tree, they are helpful for busy pipelines, for example, but couldn't live without the regular events logic in place as well. Either way, when we agreed to have them, we missed the discussion of "up to what point" and "what for".

@rafaeldtinoco
Copy link
Contributor

Alright, gave a good first look. I believe its solid. I left comments here and there mostly for the Set utils package. I believe you could have avoided string parsing for AAAA CNAME PTR and things like that if you had used the common net_packet_dns event (instead of the response, which was made for keeping signatures working mostly).

Either way, the code per se is good IMO. There are major discussions about why keep data, what for, etc, but the logic is solid and makes sense.

Copy link
Contributor

@rafaeldtinoco rafaeldtinoco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left comments

pkg/ebpf/processor.go Show resolved Hide resolved
pkg/ebpf/events_pipeline.go Show resolved Hide resolved
pkg/utils/set/set.go Outdated Show resolved Hide resolved
pkg/utils/set/set.go Outdated Show resolved Hide resolved
pkg/utils/set/set.go Outdated Show resolved Hide resolved
pkg/dnscache/dnscache.go Outdated Show resolved Hide resolved
pkg/dnscache/dnscache.go Outdated Show resolved Hide resolved
pkg/dnscache/dnscache.go Outdated Show resolved Hide resolved
@NDStrahilevitz
Copy link
Collaborator Author

I want to add configuration for the cache LRU expiry time and max size before merging.

@NDStrahilevitz
Copy link
Collaborator Author

@rafaeldtinoco I've added an alternative implementation of the cache backend using a tree with indexed nodes (tell me if there is a better/more correct way to describe it). Please review and contrast with the original implementation, which i've also iterated upon. I believe the tree cache is more correct in how it caches results, but i'm not sure if it will be good enough performance wise on the querying side. Anyway, let's decide and i'll remove the irrelevant commits before merging.

@NDStrahilevitz NDStrahilevitz force-pushed the dns_cache_datasource branch 3 times, most recently from 223b3ac to 77dd275 Compare November 20, 2023 15:15
@NDStrahilevitz NDStrahilevitz force-pushed the dns_cache_datasource branch 2 times, most recently from 31424d8 to a964997 Compare November 20, 2023 15:46
pkg/dnscache/events Outdated Show resolved Hide resolved
@rafaeldtinoco
Copy link
Contributor

I like the entire code but the node and nodednscache part.

I don't like we are trying to optimize using multiple data structures, before even knowing this is needed, and mixing the index from queries and answers (raised a thought in the code). Particularly worried about all possible queries/answers combinations.

I also think the code needs better documentation (about your intents and example of strings whenever they are being considered, small texts explaining reasonings, etc).

From what I see there are 2 possible initial ways to do this, based on the following requirements (which I believe we are on the same page):

  1. Given a DNS name, to get the resulting IN A entry (after all aliases are solved).
  2. Given an IP, to pick the initial query name that resulted to it.

From the way you implemented, I believe you tried to address lots of valid concerns:

  1. Being able to construct the full resolution path (through node links)
  2. Being memory efficient by having LRU of nodes and time evictions
  3. Allow fast queries through keys
  4. Memory deduplication through tree structure, etc.

I think that we should have a complete and simpler approach before trying all (or any) optimizations. I have created pseudo code mixed with garbage mixed with go at the following gist:

NOTE: Im not suggesting you use my code (specially because it does not work). I wanted to think so I started playing a bit, that's all. It might give you ideas if you accept my feedback. I don't think we should worry so much about access time if the LRU entry is already hashed (and even in recursive walks, like my example, it might fast enough).

So, at the end, Im suggesting that you take an simpler approach until proved something else is needed.

With all that said, I like very much everything else to be honest. Its like this part "does not fit yet" the rest, which was very good IMO.

@NDStrahilevitz
Copy link
Collaborator Author

@rafaeldtinoco I have the intuition that the simpler approach as it is will accidentally break for unexpected queries.
Also we have in a previous internal conversation assessed that having access to the query history, specifically from an IP up, is actually just as much of a requirement (both for the future dns enrichment in connect events, and for the dnscache data source results).
In addition, the memory efficiency with the current LRU implementation could easily explode (see the amount of queries you generated), especially if we only rely on expiry, so I think that is also a valid concern.

I would like to continue anyway with this approach, improving its documentation as you mentioned.

@rafaeldtinoco
Copy link
Contributor

I would like to continue anyway with this approach, improving its documentation as you mentioned.

OK, I can accept that just because you're trying hard to optimize things, but please expand tests then.

You can keep tracee opened for some hours and compress the events yaml file using gzip and do a gzip reader for the test. I would like to have many different queries tested.

I would also go for a "needle in the haystack" test, trying to find a specific complex case query (multiple aliases, multiples IPs and even a reverse lookup for the same IPs) among everything that was cached.

Also, anytime something is overwritten I would like it to be logged in debug mode.

I would feel more confident with that approach, and possibly others you have in mind.

Copy link
Contributor

@rafaeldtinoco rafaeldtinoco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving comments for us to talk. We're getting there...

pkg/ebpf/tracee.go Show resolved Hide resolved
pkg/ebpf/tracee.go Show resolved Hide resolved
tests/e2e-inst-test.sh Outdated Show resolved Hide resolved
pkg/dnscache/dnscache.go Show resolved Hide resolved
pkg/utils/set/set.go Outdated Show resolved Hide resolved
pkg/utils/set/threadsafe.go Outdated Show resolved Hide resolved
pkg/utils/set/set.go Outdated Show resolved Hide resolved
pkg/dnscache/dnscache_test.go Show resolved Hide resolved
Introduce a generic set data structure, including a thread safe version
(with a mutex).
The proto_dns argument's type was *trace.ProtoDNS instead of being
non-pointer.
This is an issue when using the event in "everything is an event mode",
since type conversion isn't done between tracee-ebpf to tracee-rules.
@NDStrahilevitz NDStrahilevitz force-pushed the dns_cache_datasource branch 2 times, most recently from b4cbc42 to 9ebacee Compare November 29, 2023 13:30
Add an optionally enabled cache in tracee which tracks dns queries and
subsequent answers through dns packet events.
Internally the cache is organized as a bidirectional graph of nodes
representing dns queries. Initial questions (query roots) are stored in
an LRU, so there is a maximum amount of query graphs possible (5000 by
default, but it is configurable).
The cache is exposed as a data source in signatures. A usage example and
e2e test with a signature is included.
Copy link
Contributor

@rafaeldtinoco rafaeldtinoco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for addressing/replying to all concerns.

@NDStrahilevitz NDStrahilevitz merged commit 3918cd8 into aquasecurity:main Nov 29, 2023
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data source for dns<->ip mapping
2 participants