Simple mmap for PVFS2 and others
Switch branches/tags
Nothing to show
Clone or download
blewis
blewis notes
Latest commit a709a13 Sep 1, 2018
Permalink
Failed to load latest commit information.
LICENSE Added GPL v2 license text Aug 3, 2018
Makefile renamed module to shim Aug 3, 2018
NOTES notes Sep 1, 2018
README notes Sep 1, 2018
example.c cleanup Aug 31, 2018
shim.c cleanup Aug 31, 2018

README

THE SHIM FILE SYSTEM

Warning
-------

Shim is a very experimental Linux kernel module. And it breaks several
conventions. Don't use it on production machines. It can crash your system!

Rationale
-------

The shim module defines an intermediary file system for the Linux kernel that
sits between applications and an underlying file system. The shim module
implements address space operations on its files as read and write operations
on corresponding backing files in the underlying file system. Shim is designed
to provide read/write memory mapping capability for file systems that do not
natively support memory mapping. In particular, we wrote shim to allow us to
memory map files on a pvfs2 file system without modifying pvfs2 itself.
 
Shim does not enforce cache consistency between processes accessing the same
file. Instead, we leave the management of cache consistency up to the
applications by providing user space functions for cache updates after write
operations (a traditional msync function), and before read operations (a kind
of reverse msync).


Design
-------

Loading the shim.ko module enables support for the file system. Shim
does not use a backing device and is mounted similarly to ramfs and tmpfs
file systems.

Symbolic links are a special operation for shim. One uses a symbolic link to
set up a 'shadow' file in shim. Address space operations on the link are
implemented as read or write operations on the backing file. We chose this
unconvention to make shim easy to use with existing tools. When a link
is created, the backing file is opened for reading and writing by the module.

Deleting a symbolic link closes the open file descriptor to the backing file
and de-allocates data structures used by shim internally.  Although memory
mapping is the main design objective, basic read and write file operations are
supported for convenience.

New files create in a shim file system also create a new file in a default
backing file system (see module options below), and link the backing and
shim files.

Read, write, and truncate operations on shim files are passed through to
their corresponding backing files.

Writes to the backing file are updated with kernel-page granularity for now.
Future versions will include byte-addressible backing file updates.

Shim filesystems don't support directories, named pipes or other special
files (and might crash if you try to use those).


Cache consistency
-------

Shim does not enforce cache consistency between multiple processes mapping the
same file.

The standard msync call is implemented. When called by a process, mysnc causes
cached pages that have been flagged as changed to be written back to the
backing file.

Shim provides a function to explicitly refresh the page cache with data from
the backing file, replacing any data already in the cache (sort of a reverse
msync). This function is implemented using a variation of the standard read
function. Using read allows us to provide this functionality to applications
through universally-available system libraries. Normally, the standard read
function results in a pass-through read operation to the corresponding backing
file. However, if the buffer argument to the read function is NULL, there will
be no pass-through read and instead the pages in the page cache range
associated with the extent of the read call will be updated with data from the
backing file. Consider the following standard read example:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
int fd = open("/path/to/shim/file", O_RDWR);
size_t count = 555;
void *buf = (void *)malloc(count);
lseek(fd, 5000, SEEK_SET);
read(fd, buf, count);

The example reads 555 bytes starting at the 5000th byte from the file into the
buffer buf. With a standard kernel page size of 4096 bytes, this read occurs
completely in the 2nd page of the file mapping.  Compare that to the next
example which simply updates the 2nd page of the mapping in the page cache
(covering the requested read range) with data from the backing file:

lseek(fd, 5000, SEEK_SET);
read(fd, NULL, count);

It's possible that multiple applications across different computers in a
network share the same backing file.  The applications can use a combination of
msync and the above read hack 'reverse msync' to coordinate data consistency
between their respective page caches and the backing file when one or more
application is writing data.

Beware that read and msync are not generally atomic operations across more than
one kernel page. Use these operations together with a distributed locking or
consensus algorithm for consistency across distributed applications.


Example
-------

# This example creats a one megabyte backing file on the current file system,
# installs the shim module and creates a shadow file suitable for memory
# mapping in the shim file system.

dd if=/dev/zero of=backing_file bs=1M count=1
make
insmod shim.ko
mount -t shim none /mnt
ln -s backing_file /mnt

# (you can now memory map /mnt/backing_file)

umount /mnt
rmmod shim


Module and mount options
-------

Module options include

- read_ahead (nonnegative integer)
  The number of pages to read ahead when memory maps are accessed, defaults to
  1024. Set this number large for efficient block access, or keep small for
  workloads with generic r/w access.

- verbose (zero or one)
  Defaults to 0. Set to 1 to print lots of informative message in the kernel
  log.

Example:  insmod shim.ko verbose=1 read_ahead=4096


There is only one mount option; it specifies the directory that backing
files shall be created in by default. It defaults to /dev/shm.

Example:  mount -t shim none /mnt -o /tmp
Example:  mount -t shim none /mnt -o /some/backing/directory