Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Enhanced abstraction of "file" concept in libsinsp filter/display classes for effective monitoring #1134

Closed
incertum opened this issue Jun 2, 2023 · 20 comments
Labels
kind/feature New feature or request
Milestone

Comments

@incertum
Copy link
Contributor

incertum commented Jun 2, 2023

Motivation

  • fd.name: if the fd.type represents a file or directory, the fd.name field contains the full path. If the path is not already an absolute path, a custom traversal parser can be used to derive the absolute path based on the cwd and the given arg.
  • fd.nameraw was introduced to expose the arg as given to detect path traversal attacks like ../../../etc/passwd
  • fd.name.symlink_resolved or similar name (also applicable for executable path) is discussed in [FEATURE] Resolving executable path symlink on execve #1111

Syscalls in Linux often reference files or file names using custom arguments. This can make rule writing and data analysis cumbersome, as it requires remembering the relevant file paths args for a given syscall. Many users are also unfamiliar with the evt.arg.* fields and their meaning. Enhancing our documentation can address these issues by providing clear explanations and references to the concept of files in Linux. This will improve understanding and facilitate effective rule analysis.

At the same time, enhancing the user experience by overloading the fd.name field would be beneficial.

Feature

We can introduce more "overloading" instead of creating new field classes, as suggested in #1060. This approach promotes code reusability, simplifies system maintenance, and makes ETLs simpler, along with writing downstream rules.

Besides definitely embracing the reality that the concept of files and their attributes in Linux can be intricate ... 😅 same applies to the syscall args naming conventions in the Linux kernel ...

Reading below, it seems that keeping the fd. field class is the ideal choice for encapsulating file-related attributes.

In the Linux kernel, a file is an abstract representation of a resource that can be accessed or manipulated. It can represent various types of data, including regular files, directories, symbolic links, device files, and more. 

A file descriptor is a unique identifier or reference to an open file in a process.

On the other hand, an inode (index node) is a data structure in a file system that stores metadata about a file.

A file descriptor number (or simply fd number) is a unique identifier assigned by the operating system to an open file in a process. It is an integer value that represents the file within that process.

An inode number is a unique identifier assigned to a file's metadata within a file system, uniquely identifying the file within that file system.

...

Current State + proposal of "overloading" fd.name even more:

syscalls syscall arg fcomments and suggestions for ilter field class.name abstraction in libsinsp
[open, openat, openat2, unlinkat, umount, umount2, access] evt.arg.name already mapped to fd.name, a mix of file or directory fd.type -> also fd.nameraw is available, however double check for each syscall
[chroot, mkdir, rmdir, unlink, mkdirat, open_by_handle_at] evt.arg.path overload each to fd.name
[linkat, renameat, renameat2] evt.arg.oldpath fd.name.source and overload to fd.name
[linkat, renameat, renameat2] evt.arg.newpath fd.name.target
[linkat, renameat, renameat2] evt.arg.olddir fd.name.source
[linkat, renameat, renameat2] evt.arg.newdir fd.name.target
[symlink, symlinkat] evt.arg.linkpath fd.name.target
[symlink, symlinkat] evt.arg.target fd.name.source and overload to fd.name (this is the file we create a symlink over)
[quotactl] evt.arg.quotafilepath overload to fd.name
[fchownat] evt.arg.pathname overload to fd.name
[chmod, fchmodat, fchmod] evt.arg.filename and evt.arg.fd as fchmod has no filename fd.filename, also check on fd.name and fd.nameraw behavior here to stay consistent
[connect, accept* etc] various network related fd.name is overloaded with reference indicating src->dst ip tuples
[socketpair] evt.arg.source fd.name should be ?
[socketpair] evt.arg.peer fd.name should be ?
@mstemm
Copy link
Contributor

mstemm commented Jun 5, 2023

I think I'd avoid using fd.* to represent the targets of all the syscalls that can act on file paths. Currently, fd.* fields are only valid for syscalls that create a fd, and I've used that distinction when trying to explain to users why there isn't a "fd" for syscalls like renameat2, mkdir, etc. Usually the users are asking why they can't "look up" the mkdir, etc for the process afterwards, and noting the distinction between fds and paths was useful.

I understand the appeal of just building out below the fd. namespace, but I think keeping the fd. names limited to file descriptors is a good one.

@mstemm
Copy link
Contributor

mstemm commented Jun 5, 2023

Here's a recent discussion on slack where we pointed out out that fd.* fields only make sense for syscalls on a fd: https://kubernetes.slack.com/archives/CMWH3EH32/p1685803022045059

(This isn't exactly the same, as the user was talking about spawned processes, but it's pretty close).

@Andreagit97 Andreagit97 added this to the libs-backlog milestone Jun 7, 2023
@jasondellaluce
Copy link
Contributor

@incertum What's your opinion? Are there blockers for #1060 on your end?

@incertum
Copy link
Contributor Author

No blockers.

Few additional questions from my side to help find a solution that gives the best user experience if we wish to further discuss:

Echoing Luca's question from the PR #1060 (comment):

One thing that is not too clear to me is that we already have fd.* and so having file.* could be a bit confusing, and so users should know what one does that the other doesn't and vice versa. So if there is an fd should there also always be a file? And would that be a source, target or path? Are path and the other two set at the same time? These are simply questions to let me and others better understand the feature.

If fd. field class should be limited to fds then having a file class, but not populating those for example for the open, openat, openat2 syscalls indeed could be confusing to end users who may not be aware of subtle differences between Linux concepts or prioritize ease of use over subtle differences.

I am now putting on my Data Science / Big Data hat and what would help from my perspective is to have one field that is always populated with the "file path" on disk I would want to write my detection against, so that I don't have to remember which one the one is I have to care about for each unique system call, is it the source, the target, the path? This burden appears to still not be eliminated with the proposed PR.

In summary there are 2 open discussion points:

  • field class?
  • abstraction, overloading of field names, consolidation, clarity what the field represents?

Curious about more opinions on UX?

@mstemm
Copy link
Contributor

mstemm commented Jun 14, 2023

what would help from my perspective is to have one field that is always populated with the "file path" on disk I would want to write my detection against, so that I don't have to remember which one the one is I have to care about for each unique system call, ...

That's exactly the goal here--to have a set of consistent fields so they can be used for a variety of file based syscalls. file.path is for any syscall that acts on a single path (open, mkdir, unlink, etc), and file.source/file.target is for any syscall that works on two paths (rename, symlink, etc). And file.path is populated for open/openat/openat2, etc: https://github.com/falcosecurity/libs/pull/1060/files#diff-70b8039f1cd1e5664df22d0b231919b7b042dfdf223134fb8356a4b7f4d0da02R187. (In case the direct link gets changed from a rebase, look at sinsp_filter_check_file::create_fd_checks().

@incertum
Copy link
Contributor Author

incertum commented Jun 14, 2023

If file.path is the new field for having everything also fd* in case of fd.type being a file why is it then not ok to go the reverse direction and use existing fd field class instead. In both scenarios we are blurring the boundaries of "Linux concepts". In addition file field class shall then also actually represent dir? Would we also add a file.type field which would indicate if it is a directory or a file (further replicating the fd* setup)?
Asking again because once we introduce a new field class, The Falco Project supports it, including sending clear messaging to adopters.

If we start with adding a file class fields, will we tear fd field class further apart in the future for other cases?

What about fd.nameraw? Is there going to be an unsanitized file.pathraw version?

file.path is for any syscall that acts on a single path (open, mkdir, unlink, etc), and file.source/file.target is for any syscall that works on two paths (rename, symlink, etc)

Appreciate your additional input, but I have some reservations regarding the benefits and scope in this particular context. Referencing the table I created above, see #1134 (comment)

  • Cases like [linkat, renameat, renameat2] naturally map from olddir, newdir or oldpath, newpath to source and target [I don't see olddir, newdir addressed in the PR as of today and your above statement "mkdir" suggests the intention to throughout support path also for dirs?]
  • Cases like [symlink, symlinkat] are tricky as evt.arg.linkpath should be target but there exists evt.arg.target which should be the source rather or the reverse? [I don't see them addressed thoroughly in the PR as of today]
  • Those ones are tricky too [chmod, fchmodat, fchmod] or [fchownat], as it naturally mixes file and fd anyways, evt.arg.filename and evt.arg.fd as fchmod has no filename, for fchmod we would also populate fd.name to stay consistent? [I don't see them addressed in the PR as of today]
  • Will file.path also be populated with proc.exepath?

What about overloading file.path or the field we agree on with the source or target that is typically of interest for writing rules? That in my opinion would improve usability, but if everyone says it's a terrible idea, no feelings hurt.

Thanks for working on this Mark!

@mstemm
Copy link
Contributor

mstemm commented Jun 14, 2023

If file.path is the new field for having everything also fd* in case of fd.type being a file why is it then not ok to go the reverse direction and use existing fd field class instead.
To me, file.path is a superset of fd.name--it will work for any syscall that creates fds (that is open/openat/openat_2/openat2), but also work for syscalls that act on filesystem paths. We won't be removing the fd.name field, it's just that the same information will be available in file.path. file.path would also represent a directory, in cases where the syscall acts on a directory. mkdir is an example. It takes a path argument, and file.path will contain the path argument. (But it doesn't create a fd of course).

In both scenarios we are blurring the boundaries of "Linux concepts".
I think using file.path actually helps keep the concepts different. To me, the name file.path can be used for any syscall that takes a path argument, and fd.* can continue to be used for syscalls that work with fds.

If we start with adding a file class fields, will we tear fd field class further apart in the future for other cases?
I was only planning on adding the 3 fields file.path, file.source, and file.target for now. I just wanted to cover the use case of having a single predictable set of fields that can be used for a variety of file based syscalls.

What about fd.nameraw? Is there going to be an unsanitized file.pathraw version?
For all of file.path/file.source/file.target, the path will be fully resolved, adding the thread cwd when necessary. I guess we could have a file.pathraw if we really want it for consistency, but I found that the raw paths aren't very useful for creating falco rules because you can't really do matching based on them, as the thread cwd could comprise any prefix of the path.

About syscall support:

Will file.path also be populated with proc.exepath?
I wouldn't include proc.exepath as that's about the path used for an exec and not any of these file operations.

What about overloading file.path or the field we agree on with the source or target that is typically of interest for writing rules?
I'm not sure I understand this comment. For the use case I was working on, depending on the syscall, I needed to look at the source in some cases and in the target in others. I don't think we would want to combine source and target together into a single field because you don't always know whether a rule author is interested in the source or the target.

If it helps I can schedule an in-person meeting or we can discuss it during one of the upcoming falco meetings as well.

Thank you for the detailed feedback! I want to make a change that is well-thought out that everyone can agree on as well.

@incertum
Copy link
Contributor Author

hmmm I understand your perspective, but I'm not convinced of file being the right new field class, especially it can now represent anything from file, dir to fd? We may confuse users more than it might help ...

Why not choose something like path.* field class then?

path.name // at least now it's consistent with fd.name naming, we could now even consider extending it to proc.exepath
path.type // we can now index if file or dir or fd
path.nameraw // similar to fd.nameraw and yes it has tons of value for threat detection, anything related to path traversals it is useful for
path.source // need detailed docs, see comments below
path.target // need detailed docs, see comments below

re my question: What about overloading file.path or the field we agree on with the source or target that is typically of interest for writing rules?

It appears that in many existing rules you focus just on one e.g. check out rules Create Symlink Over Sensitive Files or Create Hardlink Over Sensitive Files ... and again wearing my Data Science head you then have that one field you are interested in mostly and you can do your aggregations and stats queries. If you need source and target you query them instead of course. Needs documentation of course.

If we do it, yeah we need to cover everything in a consistent way also chmod/fchmodat/fchmod/fchownat and quotactl btw as well.

... a single predictable set of fields ...

yeah here I also still have reservations, for example for symlink:

  • evt_arg_linkpath is mapped to source currently right?
  • manpage: symlink() creates a symbolic link named linkpath which contains the string target.
  • Example rule: create_symlink and (evt.arg.target in (sensitive_file_names) or evt.arg.target in (sensitive_directory_names))

At least with the original definitions I can read up the manpages and for symlink case, because you could also think of the evt.arg.target being the "source" you create the symlink over which results in the "target" linkpath, hence it can be confusing all I am saying. We would need a concise documentation on the website for every single case similar to the table at the top, because now the manpages are not valid anymore, because we created our own story ...

@mstemm
Copy link
Contributor

mstemm commented Jun 15, 2023

Why not choose something like path.* field class then?

I don't mind changing file.* to path.* if that's the consensus but to me file.* makes more sense as a prefix as it refers to filesystem paths. I could think of other paths that wouldn't be applicable here. proc.exepath is one example. Maybe fspath for filesystem path?

I don't mind adding a field to distinguish between files and directories. But I don't think having fd as a type would make sense as a path for an open would rightly belong in both the file and fd categories. Also there will be many fds (network connections, shared memory regions) that would not be mapped to a filesystem path. If you want to distinguish between directories and files we could add a boolean file.isdir.

I can also add .nameraw, although I think it will be difficult to build rules based on it as depending on the syscall, in some cases you will have fully resolved paths and for other syscalls you will not.

I will add support for all the chmod/chown variants and quotactl (for now continuing to use file.*).

You're right about how symlink maps source/target compared to link, they are currently inconsistent. Here's the current mappings:

  • link/linkat (events LINK, LINKAT, LINK_2, LINKAT_2): man pages use terms oldpath/newpath. oldpath is currently mapped to file.source, newpath is currently mapped to file.target.
  • symlink/symlinkat (events SYMLINK, SYMLINKAT): man pages use terms linkpath/target. linkpath is currently mapped to file.source, target is currently mapped to file.target.

Here's the sysdig output for ln /tmp/origfile /tmp/hardlink (e.g. create a hard link /tmp/hardlink that points to /tmp/origfile) and ln -s /tmp/origfile /tmp/softlink (e.g. create a soft link /tmp/softlink that points to /tmp/origfile):

# sudo sysdig "evt.type in (link, linkat, symlink, symlinkat)"
72021 10:02:31.235151471 5 ln (656314.656314) > linkat
72022 10:02:31.235183585 5 ln (656314.656314) < linkat res=0 olddir=-100(AT_FDCWD) oldpath=/tmp/origfile newdir=-100(AT_FDCWD) newpath=/tmp/hardlink flags=0
104584 10:02:37.127153735 5 ln (656381.656381) > symlinkat
104585 10:02:37.127193844 5 ln (656381.656381) < symlinkat res=0 target=/tmp/origfile linkdirfd=-100(AT_FDCWD) linkpath=/tmp/softlink

So to keep things consistent and also vaguely aligned with the man pages, I'll change the link/linkat variants to map oldpath to .target and newpath to .source.

And I will definitely add documentation for how these new fields will work. It will be a standalone page and referred to from https://falco.org/docs/reference/rules/supported-fields/.

I'll comment again (here I guess) once I've made the above changes.

@mstemm
Copy link
Contributor

mstemm commented Jun 15, 2023

There's a problem with supporting file.isdir (or in general differentiating files and directories) for rename/renameat/renameat2. These can be used for either files or directories and there isn't anything in the event that can be used to distinguish between files/directories. You can't just do a stat() when coming up with a value for the field either, as that wouldn't work for capture files where the original filesystem is not available.

Given this, I think it's better to not include a file.isdir field (or determining the type of the path) at all, unless there are really compelling reasons for it.

@incertum
Copy link
Contributor Author

fspath.* is something I could live with, and not intending to keep poking the bear, but initially it was argued that for the open* syscalls it is important to distinguish that we are dealing with fds and that is why a new field class was needed specifically for syscalls around file manipulations when they are not fds, but now it's not anymore important and we even include directories (here I have my peace saying it once more).

Adding .nameraw would be appreciated for consistency and that's what it is we want to know the raw arg for data modeling, and to reiterate for some adopters it is valuable.

Re the symlinks/hardlinks thanks for checking and yes staying closer to the manpages is probably better, although personally I am not sure why they choose these conventions in the symlink case ...

Thanks for planning a detailed docs page!

Understood re the edge cases around dir/file distinction (thanks also for checking), arguably the syscall tells you if it's a dir I suppose ...

In summary, we could live with the following? @falcosecurity/libs-maintainers any objections or other suggestions? fspath.* or simply fs.* should be ok would say, and I would appreciate broader consensus. Thanks!

fspath.name
fspath.nameraw
fspath.source
fspath.target

@mstemm one more request since we are aligning more and have clarity on including dir, we shall also include umount* syscalls etc?

@mstemm
Copy link
Contributor

mstemm commented Jun 15, 2023

Sounds good to me! I'll start making the changes. If we're going to support .nameraw I think we also want .sourceraw/.targetraw so I will support those as well.

And yes I can support umount. Seems like I should support mount as well then. source will be the path to the device and target will be where the device is mounted. This matches the man page.

@incertum
Copy link
Contributor Author

Thanks Mark!

@mstemm
Copy link
Contributor

mstemm commented Jun 15, 2023

I finished all the changes above. fspath.name isn't working for fchmod/fchown as they work on fds and not paths. I should be able to fix that, but I'm going to make the change in a separate PR which I'm working on now.

@incertum
Copy link
Contributor Author

Great just left some comments in the PR, I think we can finish the review in the PR and unless someone has objections we can consider the discussion here resolved.

For above plz add code comments indicating it's a todo to be fixed, ok from my end to do it in a new PR of course.

@jasondellaluce
Copy link
Contributor

Thanks to you both for the great discussion. I think both fspath.* and fs.* would make a good UX and satisfy the use cases involved. My vote would go for fs.* because it's less verbose and in the same style line as fs.*, but I'm not gonna fight for it. I'm gonna review Mark's PR.

@incertum
Copy link
Contributor Author

@jasondellaluce would we then go with fs.path instead of fs.name? What about renaming the other ones? Just so it can be changed correctly in one clean commit. Thank you!

@jasondellaluce
Copy link
Contributor

@jasondellaluce would we then go with fs.path instead of fs.name? What about renaming the other ones? Just so it can be changed correctly in one clean commit. Thank you!

I like fs.path! But then it will be dependent on whether everyone is onboard with using fs.*.

@incertum
Copy link
Contributor Author

Great let's go with below then, it appears to be the most consistent with current conventions in mind and would allow us to build fs.* field class out in the future:

fs.path.name
fs.path.nameraw
fs.path.source
fs.path.sourceraw
fs.path.target
fs.path.targetraw

@incertum
Copy link
Contributor Author

Closing and marking this discussion as concluded. We have reached alignment and the PR is merged as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants