Skip to content

Commit

Permalink
doc: Architecture, placeholder in install, and first appendix.
Browse files Browse the repository at this point in the history
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
  • Loading branch information
Tommi Virtanen committed Sep 1, 2011
1 parent 0a14c75 commit e09d4a9
Show file tree
Hide file tree
Showing 5 changed files with 210 additions and 16 deletions.
17 changes: 17 additions & 0 deletions doc/appendix/differences-from-posix.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
========================
Differences from POSIX
========================

.. todo:: delete http://ceph.newdream.net/wiki/Differences_from_POSIX

Ceph does have a few places where it diverges from strict POSIX semantics for various reasons:

- Sparse files propagate incorrectly to tools like df. They will only
use up the required space, but in df will increase the "used" space
by the full file size. We do this because actually keeping track of
the space a large, sparse file uses is very expensive.
- In shared simultaneous writer situations, a write that crosses
object boundaries is not necessarily atomic. This means that you
could have writer A write "aa|aa" and writer B write "bb|bb"
simultaneously (where | is the object boundary), and end up with
"aa|bb" rather than the proper "aa|aa" or "bb|bb".
10 changes: 10 additions & 0 deletions doc/appendix/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
============
Appendices
============

.. toctree::
:glob:
:numbered:
:titlesonly:

*
179 changes: 163 additions & 16 deletions doc/architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,173 @@
Architecture of Ceph
======================

- Introduction to Ceph Project
Ceph is a distributed network storage and file system with distributed
metadata management and POSIX semantics.

- High-level overview of project benefits for users (few paragraphs, mention each subproject)
- Introduction to sub-projects (few paragraphs to a page each)
RADOS is a reliable object store, used by Ceph, but also directly
accessible.

- RADOS
- RGW
- RBD
- Ceph
``radosgw`` is an S3-compatible RESTful HTTP service for object
storage, using RADOS storage.

- Example scenarios Ceph projects are/not suitable for
- (Very) High-Level overview of Ceph
RBD is a Linux kernel feature that exposes RADOS storage as a block
device. Qemu/KVM also has a direct RBD client, that avoids the kernel
overhead.

This would include an introduction to basic project terminology,
the concept of OSDs, MDSes, and Monitors, and things like
that. What they do, some of why they're awesome, but not how they
work.

- Discussion of MDS terminology, daemon types (active, standby,
standby-replay)
Monitor cluster
===============

``cmon`` is a lightweight daemon that provides a consensus for
distributed decisionmaking in a Ceph/RADOS cluster.

.. todo:: write me
It also is the initial point of contact for new clients, and will hand
out information about the topology of the cluster, such as the
``osdmap``.

You normally run 3 ``cmon`` daemons, on 3 separate physical machines,
isolated from each other; for example, in different racks or rows.

You could run just 1 instance, but that means giving up on high
availability.

You may use the same hosts for ``cmon`` and other purposes.

``cmon`` processes talk to each other using a Paxos_\-style
protocol. They discover each other via the ``[mon.X] mon addr`` fields
in ``ceph.conf``.

.. todo:: What about ``monmap``? Fact check.

Any decision requires the majority of the ``cmon`` processes to be
healthy and communicating with each other. For this reason, you never
want an even number of ``cmon``\s; there is no unambiguous majority
subgroup for an even number.

.. _Paxos: http://en.wikipedia.org/wiki/Paxos_algorithm

.. todo:: explain monmap


RADOS
=====

``cosd`` is the storage daemon that provides the RADOS service. It
uses ``cmon`` for cluster membership, services object read/write/etc
request from clients, and peers with other ``cosd``\s for data
replication.

The data model is fairly simple on this level. There are multiple
named pools, and within each pool there are named objects, in a flat
namespace (no directories). Each object has both data and metadata.

The data for an object is a single, potentially big, series of
bytes. Additionally, the series may be sparse, it may have holes that
contain binary zeros, and take up no actual storage.

The metadata is an unordered set of key-value pairs. It's semantics
are completely up to the client; for example, the Ceph filesystem uses
metadata to store file owner etc.

.. todo:: Verify that metadata is unordered.

Underneath, ``cosd`` stores the data on a local filesystem. We
recommend using Btrfs_, but any POSIX filesystem that has extended
attributes should work (see :ref:`xattr`).

.. _Btrfs: http://en.wikipedia.org/wiki/Btrfs

.. todo:: write about access control

.. todo:: explain osdmap

.. todo:: explain plugins ("classes")


Ceph filesystem
===============

The Ceph filesystem service is provided by a daemon called
``cmds``. It uses RADOS to store all the filesystem metadata
(directories, file ownership, access modes, etc), and directs clients
to access RADOS directly for the file contents.

The Ceph filesystem aims for POSIX compatibility, except for a few
chosen differences. See :doc:`/appendix/differences-from-posix`.

``cmds`` can run as a single process, or it can be distributed out to
multiple physical machines, either for high availability or for
scalability.

For high availability, the extra ``cmds`` instances can be `standby`,
ready to take over the duties of any failed ``cmds`` that was
`active`. This is easy because all the data, including the journal, is
stored on RADOS. The transition is triggered automatically by
``cmon``.

For scalability, multiple ``cmds`` instances can be `active`, and they
will split the directory tree into subtrees (and shards of a single
busy directory), effectively balancing the load amongst all `active`
servers.

Combinations of `standby` and `active` etc are possible, for example
running 3 `active` ``cmds`` instances for scaling, and one `standby`.

To control the number of `active` ``cmds``\es, see :doc:`/ops/grow/mds`.

.. topic:: Status as of 2011-09:

Multiple `active` ``cmds`` operation is stable under normal
circumstances, but some failure scenarios may still cause
operational issues.

.. todo:: document `standby-replay`

.. todo:: mds.0 vs mds.alpha etc details



``radosgw``
===========

``radosgw`` is a FastCGI service that provides a RESTful_ HTTP API to
store objects and metadata. It layers on top of RADOS with its own
data formats, and maintains it's own user database, authentication,
access control, and so on.

.. _RESTful: http://en.wikipedia.org/wiki/RESTful


Rados Block Device (RBD)
========================

In virtual machine scenarios, RBD is typically used via the ``rbd``
network storage driver in Qemu/KVM, where the host machine uses
``librbd`` to provide a block device service to the guest.

Alternatively, as no direct ``librbd`` support is available in Xen,
the Linux kernel can act as the RBD client and provide a real block
device on the host machine, that can then be accessed by the
virtualization. This is done with the command-line tool ``rbd`` (see
:doc:`/ops/rbd`).

The latter is also useful in non-virtualized scenarios.

Internally, RBD stripes the device image over multiple RADOS objects,
each typically located on a separate ``cosd``, allowing it to perform
better than a single server could.


Client
======

.. todo:: cephfs, cfuse, librados, libceph, librbd


.. todo:: Summarize how much Ceph trusts the client, for what parts (security vs reliability).


TODO
====

.. todo:: Example scenarios Ceph projects are/not suitable for
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ Table of Contents
man/index
papers
glossary
appendix/index


Indices and tables
Expand Down
19 changes: 19 additions & 0 deletions doc/ops/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,22 @@ mentioning all the design tradeoffs and options like journaling
locations or filesystems

At this point, either use 1 or 3 mons, point to :doc:`grow/mon`

OSD installation
================

btrfs
-----

what does btrfs give you (the journaling thing)


ext4/ext3
---------

.. _xattr:

Enabling extended attributes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

how to enable xattr on ext4/3

0 comments on commit e09d4a9

Please sign in to comment.