Skip to content
Browse files

doc: Architecture, placeholder in install, and first appendix.

Signed-off-by: Tommi Virtanen <>
  • Loading branch information...
1 parent 0a14c75 commit e09d4a96025f58f370f30b0aa61d88b2173074e4 Tommi Virtanen committed Sep 1, 2011
Showing with 210 additions and 16 deletions.
  1. +17 −0 doc/appendix/differences-from-posix.rst
  2. +10 −0 doc/appendix/index.rst
  3. +163 −16 doc/architecture.rst
  4. +1 −0 doc/index.rst
  5. +19 −0 doc/ops/install.rst
17 doc/appendix/differences-from-posix.rst
@@ -0,0 +1,17 @@
+ Differences from POSIX
+.. todo:: delete
+Ceph does have a few places where it diverges from strict POSIX semantics for various reasons:
+- Sparse files propagate incorrectly to tools like df. They will only
+ use up the required space, but in df will increase the "used" space
+ by the full file size. We do this because actually keeping track of
+ the space a large, sparse file uses is very expensive.
+- In shared simultaneous writer situations, a write that crosses
+ object boundaries is not necessarily atomic. This means that you
+ could have writer A write "aa|aa" and writer B write "bb|bb"
+ simultaneously (where | is the object boundary), and end up with
+ "aa|bb" rather than the proper "aa|aa" or "bb|bb".
10 doc/appendix/index.rst
@@ -0,0 +1,10 @@
+ Appendices
+.. toctree::
+ :glob:
+ :numbered:
+ :titlesonly:
+ *
179 doc/architecture.rst
@@ -2,26 +2,173 @@
Architecture of Ceph
-- Introduction to Ceph Project
+Ceph is a distributed network storage and file system with distributed
+metadata management and POSIX semantics.
- - High-level overview of project benefits for users (few paragraphs, mention each subproject)
- - Introduction to sub-projects (few paragraphs to a page each)
+RADOS is a reliable object store, used by Ceph, but also directly
- - RGW
- - RBD
- - Ceph
+``radosgw`` is an S3-compatible RESTful HTTP service for object
+storage, using RADOS storage.
- - Example scenarios Ceph projects are/not suitable for
- - (Very) High-Level overview of Ceph
+RBD is a Linux kernel feature that exposes RADOS storage as a block
+device. Qemu/KVM also has a direct RBD client, that avoids the kernel
- This would include an introduction to basic project terminology,
- the concept of OSDs, MDSes, and Monitors, and things like
- that. What they do, some of why they're awesome, but not how they
- work.
-- Discussion of MDS terminology, daemon types (active, standby,
- standby-replay)
+Monitor cluster
+``cmon`` is a lightweight daemon that provides a consensus for
+distributed decisionmaking in a Ceph/RADOS cluster.
-.. todo:: write me
+It also is the initial point of contact for new clients, and will hand
+out information about the topology of the cluster, such as the
+You normally run 3 ``cmon`` daemons, on 3 separate physical machines,
+isolated from each other; for example, in different racks or rows.
+You could run just 1 instance, but that means giving up on high
+You may use the same hosts for ``cmon`` and other purposes.
+``cmon`` processes talk to each other using a Paxos_\-style
+protocol. They discover each other via the ``[mon.X] mon addr`` fields
+in ``ceph.conf``.
+.. todo:: What about ``monmap``? Fact check.
+Any decision requires the majority of the ``cmon`` processes to be
+healthy and communicating with each other. For this reason, you never
+want an even number of ``cmon``\s; there is no unambiguous majority
+subgroup for an even number.
+.. _Paxos:
+.. todo:: explain monmap
+``cosd`` is the storage daemon that provides the RADOS service. It
+uses ``cmon`` for cluster membership, services object read/write/etc
+request from clients, and peers with other ``cosd``\s for data
+The data model is fairly simple on this level. There are multiple
+named pools, and within each pool there are named objects, in a flat
+namespace (no directories). Each object has both data and metadata.
+The data for an object is a single, potentially big, series of
+bytes. Additionally, the series may be sparse, it may have holes that
+contain binary zeros, and take up no actual storage.
+The metadata is an unordered set of key-value pairs. It's semantics
+are completely up to the client; for example, the Ceph filesystem uses
+metadata to store file owner etc.
+.. todo:: Verify that metadata is unordered.
+Underneath, ``cosd`` stores the data on a local filesystem. We
+recommend using Btrfs_, but any POSIX filesystem that has extended
+attributes should work (see :ref:`xattr`).
+.. _Btrfs:
+.. todo:: write about access control
+.. todo:: explain osdmap
+.. todo:: explain plugins ("classes")
+Ceph filesystem
+The Ceph filesystem service is provided by a daemon called
+``cmds``. It uses RADOS to store all the filesystem metadata
+(directories, file ownership, access modes, etc), and directs clients
+to access RADOS directly for the file contents.
+The Ceph filesystem aims for POSIX compatibility, except for a few
+chosen differences. See :doc:`/appendix/differences-from-posix`.
+``cmds`` can run as a single process, or it can be distributed out to
+multiple physical machines, either for high availability or for
+For high availability, the extra ``cmds`` instances can be `standby`,
+ready to take over the duties of any failed ``cmds`` that was
+`active`. This is easy because all the data, including the journal, is
+stored on RADOS. The transition is triggered automatically by
+For scalability, multiple ``cmds`` instances can be `active`, and they
+will split the directory tree into subtrees (and shards of a single
+busy directory), effectively balancing the load amongst all `active`
+Combinations of `standby` and `active` etc are possible, for example
+running 3 `active` ``cmds`` instances for scaling, and one `standby`.
+To control the number of `active` ``cmds``\es, see :doc:`/ops/grow/mds`.
+.. topic:: Status as of 2011-09:
+ Multiple `active` ``cmds`` operation is stable under normal
+ circumstances, but some failure scenarios may still cause
+ operational issues.
+.. todo:: document `standby-replay`
+.. todo:: mds.0 vs mds.alpha etc details
+``radosgw`` is a FastCGI service that provides a RESTful_ HTTP API to
+store objects and metadata. It layers on top of RADOS with its own
+data formats, and maintains it's own user database, authentication,
+access control, and so on.
+.. _RESTful:
+Rados Block Device (RBD)
+In virtual machine scenarios, RBD is typically used via the ``rbd``
+network storage driver in Qemu/KVM, where the host machine uses
+``librbd`` to provide a block device service to the guest.
+Alternatively, as no direct ``librbd`` support is available in Xen,
+the Linux kernel can act as the RBD client and provide a real block
+device on the host machine, that can then be accessed by the
+virtualization. This is done with the command-line tool ``rbd`` (see
+The latter is also useful in non-virtualized scenarios.
+Internally, RBD stripes the device image over multiple RADOS objects,
+each typically located on a separate ``cosd``, allowing it to perform
+better than a single server could.
+.. todo:: cephfs, cfuse, librados, libceph, librbd
+.. todo:: Summarize how much Ceph trusts the client, for what parts (security vs reliability).
+.. todo:: Example scenarios Ceph projects are/not suitable for
1 doc/index.rst
@@ -94,6 +94,7 @@ Table of Contents
+ appendix/index
Indices and tables
19 doc/ops/install.rst
@@ -12,3 +12,22 @@ mentioning all the design tradeoffs and options like journaling
locations or filesystems
At this point, either use 1 or 3 mons, point to :doc:`grow/mon`
+OSD installation
+what does btrfs give you (the journaling thing)
+.. _xattr:
+Enabling extended attributes
+how to enable xattr on ext4/3

0 comments on commit e09d4a9

Please sign in to comment.
Something went wrong with that request. Please try again.