Sized_allocs -> Slices

Updated Oct 10, 2018

We have a struct sized_alloc, which originally was a buffer + length. Various 9ns devices use this to construct the output of a synthetic file, such as /proc/self/maps. The buffer + length combo is useful for doing things like snprintf - we want to build a file, but not go outside of the limits.

Eventually, we added another variable to track how far into the buffer we've printed/written. That was also useful for snprintf, and lowered the amount of boilerplate in 9ns device code.

At this point, we almost have slices. The struct sized_alloc has a void *buf, a buflen, and 'sofar'. Those correspond to a Go slice's buffer, capacity, and length.

The one downside to the sized alloc is that you need to know (guess at) the size beforehand. Using something like a slice would let us grow on demand and would simplify the code. The downside is that we'd need to be careful about concurrent uses of the buffer. With the sized_alloc, the buffer never changes. With slices, the underlying buffer can change (krealloc). Most uses are OK, but this needs to be confirmed.

For this project:

  • Implement slices in the kernel, being careful to define the rules of concurrency. "immutable after publication" is one such rule, but this requires thought. rcu-safe readers and mutations is another (probably better) approach. Dealing with concurrency is the hard part of this effort.
  • Implement other "non-standard" helpers, like slice_printf, similar to sza_printf.
  • Replace all sized_allocs with slices.

We have limited support for static and dynamic per-cpu variables. Linux's support is probably better, but possibly more complicated.

Additionally, we have two sources of per-cpu stuff: the old "per_cpu_info" (a.k.a. pcpui) and the percpu.h variables. We can probably merge the two.

This project might get tricky, and will involve digging into Linux's percpu headers and C files.

One potential course of action:

  • rename all of our per_cpu interfaces to match Linux's.
  • be careful of _PERCPU_VARPTR for dynamic variables. That might need special casing til the work is done.
  • move pieces of pcpui to per_cpu. using helper structs for things like routine and immed kmsgs will help. certain fields, such as owning_vcoreid and owning_proc, logically go together.
  • save the stuff that must be first (for x86) for last. Linux has DECLARE_PER_CPU_FIRST, which should do what we want.
  • bring over the linux headers and implementation

The last step might break dynamic variables for a little. There's a lot of code that we might not want from Linux in that area.

The tricky thing is that our kernel entry points rely on GS holding the value of pcpui. That could still be the case, but debugging could be hard. Maybe use FS for a commit or two.

We probably should rename freeb -> block_free and freeblist -> block_list_free. There might be others like that. Likewise, a lot of the Plan 9 constants are nasty. ETHERMAXTU -> ETH_DATA_LEN, ETHERHDRSIZE -> ETH_HLEN. Or something, preferably what Linux uses. We have some linux cocci files that do the opposite, and those should be removed. Part of the motivation here is that ETHERMAXTU is subtly different in Akaros than it was in Plan 9, specifically regarding whether or not the header is tracked.

This is a simple project that gives you a chance to use spatch.

NICs have limitations on how big an SG transaction can be. It'd be nice to not have to linearize those blocks, and instead have some helpers. Perhaps TCP knows what the limit is (e.g. 16, as in Linux), and doesn't build blocks larger than that. Or we have a block splitter function: given a block, yank out X SG chunks into a new block.

Plan 9 seems to have some rudimentary support for deactivating devices. Though we have ethernet function pointers called 'shutdown', but nothing calls them. You can bet none of that code has been tested either. For instance, devether->shutdown calls detach, not the specific shutdown routines...

Clean up PCI MSI/MSI-X

Updated Nov 15, 2017

Our MSI support goes as follows: when a device registers an interrupt, first we try MSI-X. If that fails, we try MSI. If that fails, we set up an old-school interrupt. The problem with this is that the device itself has no control over whether or not MSI(X) is used, nor does it know whether it is using it or not. Some devices (e.g. r8169) want to do something based on whether or not MSI is enabled.

The fix is to change our PCI MSI APIs a little, such that the devices have more say in the matter. This is a relatively straightforward project involving only the kernel. Ideally, you should have access to a device (including Qemu) that is using MSI/MSI-X so you can test your code.

Update PCI info

Updated Nov 15, 2017

Our PCI info comes from a header of structs from pcidatabase.com, which isn't active anymore. http://pci-ids.ucw.cz/ maintains a formatted text list of PCI vendor/device info. We need a script or something that will convert that text list to an auto-generated C file that we can use to replace our older PCI info. We'd be open to other alternatives to easily maintaining PCI info. This is a good starter project.

Update our Go Port

Updated Nov 15, 2017

Kevin ported Go 1.3 to Akaros. Back in 2015, we passed all of the Go tests. Since then, both Go and Akaros have moved on. The first step would be to revive the old port, make sure it works with the latest Akaros, and then see what needs to be fixed to support the latest version of Go. One big change is that Go no longer has C code in its runtime, which may be difficult for our port.

Static Analysis of the Kernel

Updated Nov 15, 2017

There are a bunch of tools for static analysis, such as sparse and smatch. People use them to find bugs on Linux all the time. We should do the same. Given the OS-like nature of parlib, perhaps that code should be audited too. One interesting question is whether or not we'd need different static analysis tools and hooks due to differences between Akaros and Linux.

No XMMs from Vcore Context

Updated Nov 15, 2017

Vcore context code, which is user-level and analogous to IRQ context, can clobber XMM registers (and floats). This is because it can call into glibc. Functions like snprintf, strlen, memset, etc, all use XMMs under the hood, perhaps transitively. Until we can be sure that vcore context won't do this, we need to aggressively save and restore the FPU/XMM state every time we forcibly enter vcore context - uthread traps, vmexits, IPI/notifs, etc. This adds about 250 cycles. By comparison, a software IPI/notify costs about 2850 cycles. A raw IRQ-and-iret costs about 900 cycles.

Update lspci port

Updated Nov 15, 2017

We have a rough port of pciutils (https://github.com/brho/pciutils). It uses raw programmed I/O. This is a little dangerous, since the PCI accesses will race with legitimate accesses. The right answer is to use the pci device (#pci), instead of raw PIO.

This project should be straightforward and would be a good beginner project.

We have half-baked support for gettimeofday. We don't support multiple clock types and you can't set the date/time. For the most part, userspace communicates with the kernel in terms of TSC ticks since boot, and most interfaces use absolute time (not relative). Whether or not that is a good idea is unclear, but we should be more consistent with how the user and kernel talk about time. Perhaps all the kernel needs to know is ticks since boot.

This project is open-ended and will involve glibc, the kernel, and to a lesser extent the PVC alarms. It's not clear what exactly we need, so you'll need to do a little research into the various clocks and use cases before suggesting a new design. Some of this work might be done, since this project is quite old.

Memory Allocation Flags

Updated Nov 15, 2017

Many memory allocation functions take '0' for their flags argument, where the absence of MEM_WAIT means we want an atomic allocation. We should explicitly change those to MEM_ATOMIC.

This project is a straightforward change to the kernel. You can probably use spatch to automate the change.

Overflowing Multiplication

Updated Nov 15, 2017

In our array allocators (kreallocarray() from BSD and kmalloc_array()), we perform a custom check for multiplication overflow of size_t. However, we also have a kernel helper that checks for u64 overflow. Ideally, we'd have a helper that can take any type and tell us if the multiplication will overflow. The alloc arrays will then both use that helper. The helper would also provide the opportunity for arch-specific includes.

This project is relatively straightforward, with the opportunity to do a little measurement and to play around with the kernel build system and CPP.

Fork/Exec

Updated Nov 15, 2017

Akaros has fork and exec syscalls, usable by SCPs only, for compatibility with existing C programs. Fork/Exec is a bad model for process creation, involving a variety of hacks for security (CLOEXEC) and is in general a mess for parallelism and MCPs.

Akaros's preferred process creation method is to create the process, then make it runnable. In between, the parent can do things like set up the FD table and whatever else we want to add.

Ideally, we'd be able to implement fork/exec under the hood with create/run. If we do it right, we can even get the kernel out of the business of argument parsing (e.g. have the parent set up the child's stack). However, how we'd exactly do this isn't clear; otherwise we probably would have done it. The issue is the intermediate code run between fork and exec in the child's context. Even something as simple as tests/listen1.c has an accept9(), a few closes, and a few dups between the fork and exec.

Anyway, if someone can come up with a nice way to get fork/exec out of the kernel, that'd be great. It may be more likely that we eventually just change all code that calls fork manually before solving this problem. This project might not have a good solution.

Syscall Wrappers

Updated Nov 15, 2017

When userspace invokes a syscall (synchronous or asynchronous), it usually uses a wrapper function that takes the maximum number of arguments. We could use wrappers for the particular number of arguments, e.g. of the form syscall4(SYS_number, a, b, c, d). This goes for syscall_async() as well. While messing around in this area, you may find other related optimizations that can speed up our syscalls.

This is a straightforward project that involves working with and rebuilding glibc and other parts of userspace.

IPv6

Updated Nov 15, 2017

The Plan 9 stack has support for IPv6, though we haven't tested it in QEMU or on hardware. Likewise, the BSD sockets shims don't support v6. (Some might, others don't). We need someone to get the IPv6 stack working, figure out how to configure and use it from userspace, and then make sure the BSD sockets shims can handle AF_INET6. Ideally, we'd continue to support v4, but that's not necessarily mandatory. We also need to make sure Drew's network stack optimizations work on v6.

This project involves the kernel, glibc, ifconfig, user tests, and maybe QEMU. Some parts of the BSD shims may require more intense coding.

We have a couple CI tidbits in place, but are lacking a full solution.

Alfonso and Kevin put together a VM image that runs Jenkins to test the kernel. We need to get this VM up and running in Gcloud and regularly testing either the master branch or a pre-master branch. Right now, the main test is "ash ifconfig; get_html". We also need a more comprehensive testing suite. This VM should also run the Go tests and ideally hook in to the Go dashboard.

Gan and Dan set up Travis, but all it can do right now is build the kernel - it can't drive any tests. And it tends to timeout, since it is forced to rebuild the toolchain (slowly!) every run.

This project is relatively straightfoward and does not involve heavy work in the codebase.

We're running an older GCC / glibc / binutils. Updating them to the latest version will be an exercise in patience.

Warning-Free Glibc

Updated Nov 15, 2017

Our glibc build has a lot of warnings. Often these warnings hide real problems. Although we're lucky we even have an Akaros port, it'd be great if we could build with -Werror and remove all of the warnings. That would help catch future issues.

This project involves building glibc repeatedly and probably learning a lot about how the glibc porting process works.

Rocks and User-level FDs

Updated Nov 15, 2017

This is copied from an older email.

if you take a look at _sock_findrock(), you'll see a potential mess.

(tools/compilers/gcc-glibc/glibc-2.19-akaros/sysdeps/akaros/plan9_sockets.c)

what's going on is that users of the sockets shims (socket(), connect(), bind(), etc.) have only an FD to identify a socket, and this FD gets passed to read() and write(). but from this FD, we want to perform various controls on sockets.

under the hood, plan 9 has two FDs for every conversation, which is analogous to the socket. the Rock is a user-level structure that exists per socket (made with the shims) that contains various bits of bookkeeping.

/*

  • since BSD programs expect to perform both control and data functions
  • through a single fd, we need to hide enough info under a rock to
  • be able to open the control file when we need it. */ struct Rock { Rock next; unsigned long dev; / inode & dev of data file / unsigned long inode; / ... / int domain; / from socket call / int stype; / ... / int protocol; / ... / struct sockaddr addr; / address from bind / int reserved; / use a priveledged port # (<1024) / struct sockaddr raddr; / peer address / char ctl[Ctlsize]; / name of control file (if any) / int other; / fd of the remote end for Unixdomain */ }; _sock_findrock()'s job is to do a lookup from socket FD to Rock. if you look at the code, you'll see it's racy (the #warning may give it away). even worse, there's a linear scan of all Rocks, trying to match the given socket FD to the Rock. there's also a stat() call, to go from FD -> { device, inode }, which then is use to find the Rock.

it'd be nice to not only deal with the various races, but also get rid of the linear scan and ideally the stat call too. it's easy enough to get rid of the races on the list management. though another issue is that there can be several threads working with the same Rock. we'd need to make sure any writes to Rocks after creation are okay.

for reference, _sock_findrock() is called from accept, bind, connect, listen, getpeername, and getsockname.

one option dan mentioned is to use sys_fd2path on the data FD. that would allow us to find the ctl's name. if we did not need any other parts of the Rock, then we'd be good. not sure if this is true or not. it looks like at least the struct sockaddr addr gets used a lot.

this reminds me of some older discussions about bundling other types of information into an FD. stuff like "have a type field for something like shared memory, and then have read/write do something different". not sure if any of those uses are worthwhile. regardless, the Rock is one such type of info that we'd want connect to the underlying file/chan.

there are other options for Rock/whatever lookup that are available since we control glibc. we can't change the type of the FD to anything other than an int (for compatibility), but we can intercept every call that sends FDs into the kernel. in fact, we already have - those are the glibc sysdeps. all functions that pass an FD into the kernel should be intercepted by our sysdeps, so that they can call our SYS_whatevers.

one option is to put more info in the FD - like both the ctl and the data FD. all non-socket functions that take FDs just ignore the top part (ctl) and operate on the bottom (data). but then we'd only have 16 bits for actual FDs, so that could be a problem (or close to one). and it doesn't have any of the other Rock info.

another option would be to have an array of Rocks, indexed by FD. every time we do something to the FD, we need to keep the Rock in sync. stuff like: zero the Rock when the FD is closed (or opened), etc. at least lookup is fast.

one pain is dup, which exposes the real issue: Rocks correspond to the underlying chan/conv; the FD is just a way to point to it. so we'd have to catch dups. but of course, we could have multiple FDs pointing to the same Rock, so perhaps the array should be Rock pointers, not Rocks. and that array would need to grow.

maybe there are options combining some of these tricks. i like the idea of using some of the upper bits of the FD for things. bit 31 is reserved, since negative values for FDs are errors. maybe we could have a flag that says if an FD has a magic struct (Rock) or not, and if so, we have a few bits for a quick lookup (maybe an index into a Rock array or array of short lists). that'd avoid the O(n) lookup and the sys_stat() call. it'd also be nice to avoid locks.

or maybe we just stick with the stat() call and just replace the Rock list with a non-racy hash table, and make sure concurrent operations on the same Rock are okay (which we need to do anyways).

either way, we'll eventually need to do something about this. it's probably not reasonable to have everyone who cares about scalability change the glibc/sockets apps to use the plan 9 network interface. though maybe that is reasonable; it's been the plan for now.

Another thing that came up since this email was written is that we might want to interpose on glibc's close() call. Due to the 'multiple libcs problem', we can't just change __close(). We'd need an internal __close() and our own close(). And all internal glibc uses of close() need to be changed to __close().

We have an existing branch of tracing code that needs to be finished, called 'ttrace.' IIRC, Ttrace is a ring buffer that can store events and their time stamps. You'd use it to track things like when threads sleep on semaphores, when processes are created and destroyed, etc. That way, when we want to look at why things are slow, we can get a general picture of what the system was doing at a more macro-level.

We probably want to hook into our existing perf events code. Maybe ttrace can work for that. If not, it should be removed. One way or another, it'd be nice to trace events that aren't just perf counters. Maybe that involves eBPF too.

Character I/O

Updated Nov 15, 2017

Our terminal I/O is rather primitive. Whenever someone writes to stdout or if the kernel does a printk, we basically write raw bytes to the serial port and monitor. There is very basic processing so that backspace works and certain escape characters are ignored. In the end, we can't do things like run vim on Akaros, or do anything that takes control of the full terminal 'window.'

There are two or three things I'd like here. The first is for our glibc-based apps to be able to do basic terminal control. The kernel is basically an intermediary between glibc-based programs that would run on Akaros (e.g. vim) and a linux terminal on the other end of a connection (serial I/O for now, eventually something like ssh). It'd be nice if the kernel was involved only minimally in console management, and we just passed-through all of the other data. Right now, the kernel lies to the user whenever there is a tc{get,set}addr call.

The second thing is that we need to continue to allow printk and other 'emergency' debugging to work at all times. The current situation is that printk is very low-level print. We can output characters from any context (IRQ, etc), and console output and input is used 'interactively'. For instance, there is a kernel monitor (CTRL-G to drop into it, or just trigger a panic!) that polls all of the input ports (keyboard and every serial line). Whatever we come up with needs to not break this, and ideally not corrupt the formatting of the prints. So to some extent, there is a multiplexing of user I/O (which adheres to whatever the terminal protocol is) and the kernel I/O (which is just raw bytes at this point).

The third thing is that it'd be nice to be able to use a monitor for output still. Right now, the kernel just writes characters into the magic CGA/VGA space. I haven't looked at it in years. Kevin built a scrollback buffer back in 2009. It'd be nice if some form of monitor I/O still worked so that you could run Akaros on a laptop. Yes, that's not our target machine, but it's very useful for debugging on hardware. I use a chromebook for this sometimes.

This is a cosmetic fix, and it's benefit is convenience for users of the system. It's also not on the critical path, nor does it muck with much of the core kernel.

OOM Handling

Updated Nov 15, 2017

We don't handle running OOM nicely. We have a couple hooks for flushing the page cache, but that's it.

We need a decent OOM handler. One bare minimum solution would be to reserve a fixed amount of RAM for use in OOM situations, and then use that RAM to free up more RAM. It's one of those things where "it takes RAM to make RAM" (just like money). To kill a process, the kernel may need to alloc small bits of RAM here and there (e.g. to send kernel messages around). So we'd need to figure out who to kill (or just pick someone) and then be able to survive long enough to get that process's RAM back.

There's a lot more to OOM than that. A full solution involves doing a bit of research into the existing methods (OOM is a broad problem) and coming up with something that fits our needs.

Our current OOM killer is "restart the OS". That works for now.

GDB on Akaros

Updated Nov 15, 2017

update: Chris did some GDB work and got it working with SCPs. Check out his stuff, my comments, and pick it up from there.

old description:

It'd be nice to be able to use GDB on programs that run on Akaros. Right now, the usual style of user debugging is printf, with the ability to set up event handlers to printf bits of info whenever a process is externally poked. Then when we want to see what a process is doing (say it appears to be deadlocked), we look at the core to see where it's program counter is (from the qemu monitor, use "info cpus", from the Akaros monitor, use "trace coretf N" to see core N's trapframe. We have some basic ability to backtrace a user context, but it only works if the frame pointer is used, so it's not universal.

Plan 9 had a crazy debugger (called Acid, I think), so that's another place to look. But after talking with Ron, I think our best bet for now is to get a minimal GDB running. From a brief look, it seems like the place to start is to try and port the GDB server, instead of the full GDB, since that is the minimal set of things required to trace a process.

There are a couple issues here. First, we don't have PTRACE, nor is it likely we ever will. So a part of this project is determining what exactly GDB needs, and then getting those facilities built into the kernel in an Akaros-approved way. This could get quite hairy, so I'd expect to get involved at some point. For instance, we might need to set hardware breakpoints for vcores, and have those breakpoints follow vcores around as they move through the system (moderately difficult). We'll probably want an interface to read the memory (and write it!) of another process (easy on the surface, but we might want some other crazy stuff in that area).

The next issue is dealing with user threads. The kernel doesn't know about user threads; it only knows about the processes vcores (virtual cores). It'd be nice to have GDB switch between debugging the various user threads, but the kernel can't help here, other than allowing access to the address space. Maybe we'd want a 2LS (second-level scheduler, which schedules the user threads) interface for debugging too - some sort of RPC thing to find out what's going on.

While we're on the topic of another process's address space, it might be neat to have a "snooping process" or blob of code that inhabits another process's address space. Imagine a small library that exists in a reserved part of the address space. The rest of that address space is the address space of the target process. (Specifically, the snooper has it's own top-level page directory (PML4), identical in contents to the target, minus one PTE for it's own code/data, and it uses the same physical pages for the intermediate page tables as the target. It's moderately easy to set up, but harder to maintain, so we'd probably make this a snapshot in time). Then GDB or some other debugger can talk to the snooper who has full access to the address space of the target. Then the snooper can run "RPCs" directly in the target's address space. You could actually have multiple snoopers too - they all have their own distinct top-level page directory.

That last bit might be a little over the top.

Block Extra Data

Updated Nov 15, 2017

A year or so ago, Drew and I sat down and added "extra data" support to Plan 9's struct block. In Plan 9, most all I/O is done by moving struct blocks around between queues (struct queue). They are similar to mbufs and skbs. The block originally was just a payload and some pointers. Drew and I added an array of pointers, lengths, and offsets, such that a block could consist of the header (the main payload), and a bunch of blobs of non-contiguous memory. If NICs could handle the extra data, then we could avoid extra memcpys throughout the networking stack. The driver could build big blocks from LRO, and then pass them up the networking stack.

Drew and I got to a certain point with the extra data work, then got preempted for higher-priority things. The current situation is that you can CONFIG to build with or without it. With the CONFIG on, you may trigger a panic when you reach code that isn't done yet. The Go tests trigger this, for example. We basically need to sit down and figure out how the qio/block code works and implement the changes needed for block extra data and remove all of the panics.

One of the things that has popped up in this area is that block EBDs can have holes. Often NICs want to know how many EBDs with data there are (base && len). Note that some blocks have a refcnt'd base, but no len, so it's a little tricky.

This is one of our half-finished projects. It's probably a decent spot to look to get familiar with the struct block I/O.