diff --git a/doc/cephfs/index.rst b/doc/cephfs/index.rst index 63dfad7fa3542a..ad30220b06728e 100644 --- a/doc/cephfs/index.rst +++ b/doc/cephfs/index.rst @@ -81,6 +81,7 @@ authentication keyring. .. toctree:: :maxdepth: 1 + POSIX compatibility CephFS Quotas Using Ceph with Hadoop libcephfs <../../api/libcephfs-java/> diff --git a/doc/cephfs/posix.rst b/doc/cephfs/posix.rst new file mode 100644 index 00000000000000..54c2c189a29da3 --- /dev/null +++ b/doc/cephfs/posix.rst @@ -0,0 +1,37 @@ +======================== + Differences from POSIX +======================== + +CephFS aims to adhere to POSIX semantics wherever possible. For +example, in contrast to many other common network file systems like +NFS, CephFS maintains strong cache coherency across clients. The goal +is for processes communicating via the file system to behave the same +when they are on different hosts as when they are on the same host. + +However, there are a few places where CephFS diverges from strict +POSIX semantics for various reasons: + +- In shared simultaneous writer situations, a write that crosses + object boundaries is not necessarily atomic. This means that you + could have writer A write "aa|aa" and writer B write "bb|bb" + simultaneously (where | is the object boundary), and end up with + "aa|bb" rather than the proper "aa|aa" or "bb|bb". +- POSIX includes the telldir(2) and seekdir(2) system calls that allow + you to obtain the current directory offset and seek back to it. + Because CephFS may refragment directories at any time, it is + difficult to return a stable integer offset for a directory. As + such, a seekdir to a non-zero offset may often work but is not + guaranteed to do so. A seekdir to offset 0 will always work (and is + equivalent to rewinddir(2)). +- Sparse files propagate incorrectly to the stat(2) st_blocks field. + Because CephFS does not explicitly track which parts of a file are + allocated/written, the st_blocks field is always populated by the + file size divided by the block size. This will cause tools like + du(1) to overestimate consumed space. (The recursive size field, + maintained by CephFS, also includes file "holes" in its count.) +- When a file is mapped into memory via mmap(2) on multiple hosts, + writes are not coherently propagated to other clients' caches. That + is, if a page is cached on host A, and then updated on host B, host + A's page is not coherently invalidated. (Shared writable mmap + appears to be quite rare--we have yet to here any complaints about this + behavior, and implementing cache coherency properly is complex.) diff --git a/doc/dev/differences-from-posix.rst b/doc/dev/differences-from-posix.rst deleted file mode 100644 index 1cc99428fe2030..00000000000000 --- a/doc/dev/differences-from-posix.rst +++ /dev/null @@ -1,16 +0,0 @@ -======================== - Differences from POSIX -======================== - - -Ceph does have a few places where it diverges from strict POSIX semantics for various reasons: - -- Sparse files propagate incorrectly to tools like df. They will only - use up the required space, but in df will increase the "used" space - by the full file size. We do this because actually keeping track of - the space a large, sparse file uses is very expensive. -- In shared simultaneous writer situations, a write that crosses - object boundaries is not necessarily atomic. This means that you - could have writer A write "aa|aa" and writer B write "bb|bb" - simultaneously (where | is the object boundary), and end up with - "aa|bb" rather than the proper "aa|aa" or "bb|bb".