Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add io_uring_* syscalls #1264

Closed
wants to merge 1 commit into from
Closed

Conversation

rhatdan
Copy link
Member

@rhatdan rhatdan commented Dec 9, 2022

Fixes: containers/podman#16796

Signed-off-by: Daniel J Walsh dwalsh@redhat.com

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 9, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Fixes: containers/podman#16796

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
@rhatdan
Copy link
Member Author

rhatdan commented Dec 11, 2022

@giuseppe PTAL

@giuseppe
Copy link
Member

io_uring is that it is a multiplexer for multiple syscalls, so it can effectively be used to workaround a seccomp profile.

I don't see any dangerous syscall at the moment, but this could change in the future as new syscalls are supported:

enum io_uring_op {
	IORING_OP_NOP,
	IORING_OP_READV,
	IORING_OP_WRITEV,
	IORING_OP_FSYNC,
	IORING_OP_READ_FIXED,
	IORING_OP_WRITE_FIXED,
	IORING_OP_POLL_ADD,
	IORING_OP_POLL_REMOVE,
	IORING_OP_SYNC_FILE_RANGE,
	IORING_OP_SENDMSG,
	IORING_OP_RECVMSG,
	IORING_OP_TIMEOUT,
	IORING_OP_TIMEOUT_REMOVE,
	IORING_OP_ACCEPT,
	IORING_OP_ASYNC_CANCEL,
	IORING_OP_LINK_TIMEOUT,
	IORING_OP_CONNECT,
	IORING_OP_FALLOCATE,
	IORING_OP_OPENAT,
	IORING_OP_CLOSE,
	IORING_OP_FILES_UPDATE,
	IORING_OP_STATX,
	IORING_OP_READ,
	IORING_OP_WRITE,
	IORING_OP_FADVISE,
	IORING_OP_MADVISE,
	IORING_OP_SEND,
	IORING_OP_RECV,
	IORING_OP_OPENAT2,
	IORING_OP_EPOLL_CTL,
	IORING_OP_SPLICE,
	IORING_OP_PROVIDE_BUFFERS,
	IORING_OP_REMOVE_BUFFERS,
	IORING_OP_TEE,
	IORING_OP_SHUTDOWN,
	IORING_OP_RENAMEAT,
	IORING_OP_UNLINKAT,
	IORING_OP_MKDIRAT,
	IORING_OP_SYMLINKAT,
	IORING_OP_LINKAT,
	IORING_OP_MSG_RING,
	IORING_OP_FSETXATTR,
	IORING_OP_SETXATTR,
	IORING_OP_FGETXATTR,
	IORING_OP_GETXATTR,
	IORING_OP_SOCKET,
	IORING_OP_URING_CMD,
	IORING_OP_SEND_ZC,

	/* this goes last, obviously */
	IORING_OP_LAST,
};

I am not sure we should permit it by default.

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Docker allows these syscalls. But I prefer another head nod given the security concerns.

@Luap99
Copy link
Member

Luap99 commented Dec 12, 2022

I think this isn't future proof. The seccomp profile is deny by default, right? So when the kernel adds a new syscall we have to add it explicitly to the allow list. If io_uring adds support for a new syscall it is impossible to restrict this use if it is considered dangerous. Of course I have no idea how likely this is but when this happens we might need to disallow io_uring here which will then break users that rely on io_uring.

@giuseppe
Copy link
Member

I think this isn't future proof. The seccomp profile is deny by default, right? So when the kernel adds a new syscall we have to add it explicitly to the allow list. If io_uring adds support for a new syscall it is impossible to restrict this use if it is considered dangerous. Of course I have no idea how likely this is but when this happens we might need to disallow io_uring here which will then break users that rely on io_uring.

agreed. I think we need a new profile, we have no control over what other syscalls will be enabled in io_uring.

@rhatdan
Copy link
Member Author

rhatdan commented Dec 12, 2022

How do you envision users dealing with this?

@Luap99
Copy link
Member

Luap99 commented Dec 12, 2022

Well they would need to create their own profile or we ship one and then the user only needs to do --security-opt seccomp=/usr/share/containers/seccomp-io-uring.json for example.

Ideally the kernel would just apply the seccomp filter for the io_uring operations just like the normal syscalls. To me it fells extremely weird that you can bypass the seccomp filter like that but I have no idea about the kernel internals so I guess there must be reasons for that.

@rhatdan
Copy link
Member Author

rhatdan commented Dec 12, 2022

That would mean us shipping a separate json file or users being smart enough to figure this out on their own. (Not likely).

@rhatdan
Copy link
Member Author

rhatdan commented Dec 12, 2022

Should we add the ability to modify the seccomp file directly with

podman run --security-opt seccomp:+io_uring ...

@vrothberg
Copy link
Member

Let's explore the problem space a bit further. I wonder

  • Does the kernel give certain guarantees on which (kinds of) syscalls may be added in the future? I am sure they are aware of seccomp and won't risk a priority inversion or security hole.
  • What was the reasoning for Docker to add the syscall? Looking up their decision making may reveal some information we don't have at the moment.

@giuseppe
Copy link
Member

I've asked the Moby folks why they decided to enable io_uring: moby/moby#41223 (comment)

If it is an accepted risk from the community, then let's go forward with it but to me it still sounds like a risk to open up to syscalls we have no control over and it is also not possible to control what io_uring does from seccomp at the moment.

Should we add the ability to modify the seccomp file directly with

podman run --security-opt seccomp:+io_uring ...

seccomp customization is difficult, especially with the JSON file we use. Maybe we need to provide a series of curated profiles, since CRI-O and Podman are already using different profiles, something like --security-opt seccomp:(locked|default|build|performance).

@rhatdan
Copy link
Member Author

rhatdan commented Dec 14, 2022

I like having limited choice that we can document, and perhaps diagnose when it fails.

I don't think we should merge this PR, since we should be as secure if not more secure the Docker.

Having an easy wait to turn on io_uring for users who actually need it would be better then turning it on for everyone by default.

@rhatdan rhatdan changed the title Add io_uring_* syscalls [WIP] Add io_uring_* syscalls Dec 14, 2022
@vrothberg
Copy link
Member

Having an easy wait to turn on io_uring for users who actually need it would be better then turning it on for everyone by default.

That's really something to improve on. Changing seccomp profiles is hard when we only want to add one ore two syscalls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow containers access to io_uring syscalls
4 participants