From 98b718ea29fc5288894fbbae70db3c89cd65ce74 Mon Sep 17 00:00:00 2001 From: Alfredo Deza Date: Wed, 21 Feb 2018 08:39:10 -0500 Subject: [PATCH 1/4] doc quick-ceph-deploy update for newer ceph-volume API Signed-off-by: Alfredo Deza --- doc/start/quick-ceph-deploy.rst | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/start/quick-ceph-deploy.rst b/doc/start/quick-ceph-deploy.rst index 50b7f307f6ef2..8855f00294314 100644 --- a/doc/start/quick-ceph-deploy.rst +++ b/doc/start/quick-ceph-deploy.rst @@ -126,11 +126,13 @@ configuration details, perform the following steps using ``ceph-deploy``. #. Add three OSDs. For the purposes of these instructions, we assume you have an unused disk in each node called ``/dev/vdb``. *Be sure that the device is not currently in use and does not contain any important data.* - ceph-deploy osd create {ceph-node}:{device} + ceph-deploy osd create --data {device} {ceph-node} For example:: - ceph-deploy osd create node1:vdb node2:vdb node3:vdb + ceph-deploy osd create --data /dev/vdb node1 + ceph-deploy osd create --data /dev/vdb node2 + ceph-deploy osd create --data /dev/vdb node3 #. Check your cluster's health. :: From bdd7a0f7fe73b305886de302efe2bedf24ac00a4 Mon Sep 17 00:00:00 2001 From: Alfredo Deza Date: Wed, 21 Feb 2018 10:14:06 -0500 Subject: [PATCH 2/4] doc/man update ceph-deploy for the new ceph-volume API Signed-off-by: Alfredo Deza --- doc/man/8/ceph-deploy.rst | 97 ++++++++++----------------------------- 1 file changed, 24 insertions(+), 73 deletions(-) diff --git a/doc/man/8/ceph-deploy.rst b/doc/man/8/ceph-deploy.rst index 6654d182f0903..36c6d1a9afd8a 100644 --- a/doc/man/8/ceph-deploy.rst +++ b/doc/man/8/ceph-deploy.rst @@ -15,11 +15,7 @@ Synopsis | **ceph-deploy** **mon** *create-initial* -| **ceph-deploy** **osd** *prepare* [*ceph-node*]:[*dir-path*] - -| **ceph-deploy** **osd** *activate* [*ceph-node*]:[*dir-path*] - -| **ceph-deploy** **osd** *create* [*ceph-node*]:[*dir-path*] +| **ceph-deploy** **osd** *create* *--data* *device* *ceph-node* | **ceph-deploy** **admin** [*admin-node*][*ceph-node*...] @@ -251,46 +247,12 @@ Subcommand ``list`` lists disk partitions and Ceph OSDs. Usage:: - ceph-deploy disk list [HOST:[DISK]] - -Here, [HOST] is hostname of the node and [DISK] is disk name or path. - -Subcommand ``prepare`` prepares a directory, disk or drive for a Ceph OSD. It -creates a GPT partition, marks the partition with Ceph type uuid, creates a -file system, marks the file system as ready for Ceph consumption, uses entire -partition and adds a new partition to the journal disk. - -Usage:: - - ceph-deploy disk prepare [HOST:[DISK]] - -Here, [HOST] is hostname of the node and [DISK] is disk name or path. - -Subcommand ``activate`` activates the Ceph OSD. It mounts the volume in a -temporary location, allocates an OSD id (if needed), remounts in the correct -location ``/var/lib/ceph/osd/$cluster-$id`` and starts ``ceph-osd``. It is -triggered by ``udev`` when it sees the OSD GPT partition type or on ceph service -start with ``ceph disk activate-all``. - -Usage:: - - ceph-deploy disk activate [HOST:[DISK]] - -Here, [HOST] is hostname of the node and [DISK] is disk name or path. - -Subcommand ``zap`` zaps/erases/destroys a device's partition table and contents. -It actually uses ``sgdisk`` and it's option ``--zap-all`` to destroy both GPT and -MBR data structures so that the disk becomes suitable for repartitioning. -``sgdisk`` then uses ``--mbrtogpt`` to convert the MBR or BSD disklabel disk to a -GPT disk. The ``prepare`` subcommand can now be executed which will create a new -GPT partition. - -Usage:: - - ceph-deploy disk zap [HOST:[DISK]] + ceph-deploy disk list HOST -Here, [HOST] is hostname of the node and [DISK] is disk name or path. +Subcommand ``zap`` zaps/erases/destroys a device's partition table and +contents. It actually uses ``ceph-volume lvm zap`` remotely, alternatively +allowing someone to remove the Ceph metadata from the logical volume. osd --- @@ -298,46 +260,35 @@ osd Manage OSDs by preparing data disk on remote host. ``osd`` makes use of certain subcommands for managing OSDs. -Subcommand ``prepare`` prepares a directory, disk or drive for a Ceph OSD. It -first checks against multiple OSDs getting created and warns about the -possibility of more than the recommended which would cause issues with max -allowed PIDs in a system. It then reads the bootstrap-osd key for the cluster or -writes the bootstrap key if not found. It then uses :program:`ceph-disk` -utility's ``prepare`` subcommand to prepare the disk, journal and deploy the OSD -on the desired host. Once prepared, it gives some time to the OSD to settle and -checks for any possible errors and if found, reports to the user. +Subcommand ``create`` prepares a device for Ceph OSD. It first checks against +multiple OSDs getting created and warns about the possibility of more than the +recommended which would cause issues with max allowed PIDs in a system. It then +reads the bootstrap-osd key for the cluster or writes the bootstrap key if not +found. +It then uses :program:`ceph-volume` utility's ``lvm create`` subcommand to +prepare the disk, (and journal if using filestore) and deploy the OSD on the desired host. +Once prepared, it gives some time to the OSD to start and checks for any +possible errors and if found, reports to the user. -Usage:: - - ceph-deploy osd prepare HOST:DISK[:JOURNAL] [HOST:DISK[:JOURNAL]...] +Bluestore Usage:: -Subcommand ``activate`` activates the OSD prepared using ``prepare`` subcommand. -It actually uses :program:`ceph-disk` utility's ``activate`` subcommand with -appropriate init type based on distro to activate the OSD. Once activated, it -gives some time to the OSD to start and checks for any possible errors and if -found, reports to the user. It checks the status of the prepared OSD, checks the -OSD tree and makes sure the OSDs are up and in. + ceph-deploy osd create --data DISK HOST -Usage:: +Filestore Usage:: - ceph-deploy osd activate HOST:DISK[:JOURNAL] [HOST:DISK[:JOURNAL]...] + ceph-deploy osd create --data DISK --journal JOURNAL HOST -Subcommand ``create`` uses ``prepare`` and ``activate`` subcommands to create an -OSD. - -Usage:: - ceph-deploy osd create HOST:DISK[:JOURNAL] [HOST:DISK[:JOURNAL]...] +.. note:: For other flags available, please see the man page or the --help menu + on ceph-deploy osd create -Subcommand ``list`` lists disk partitions, Ceph OSDs and prints OSD metadata. -It gets the osd tree from a monitor host, uses the ``ceph-disk-list`` output -and gets the mount point by matching the line where the partition mentions -the OSD name, reads metadata from files, checks if a journal path exists, -if the OSD is in a OSD tree and prints the OSD metadata. +Subcommand ``list`` lists devices associated to Ceph as part of an OSD. +It uses the ``ceph-volume lvm list`` output that has a rich output, mapping +OSDs to devices and other interesting information about the OSD setup. Usage:: - ceph-deploy osd list HOST:DISK[:JOURNAL] [HOST:DISK[:JOURNAL]...] + ceph-deploy osd list HOST admin From c957c70f48a02a1fe3e477b24a4d9206100feaa7 Mon Sep 17 00:00:00 2001 From: Alfredo Deza Date: Wed, 21 Feb 2018 10:15:24 -0500 Subject: [PATCH 3/4] doc/rados/deployment update ceph-deploy references with new ceph-volume API Signed-off-by: Alfredo Deza --- doc/rados/deployment/ceph-deploy-osd.rst | 70 ++++++------------------ 1 file changed, 18 insertions(+), 52 deletions(-) diff --git a/doc/rados/deployment/ceph-deploy-osd.rst b/doc/rados/deployment/ceph-deploy-osd.rst index a4eb4d129d922..3994adc86426f 100644 --- a/doc/rados/deployment/ceph-deploy-osd.rst +++ b/doc/rados/deployment/ceph-deploy-osd.rst @@ -21,7 +21,7 @@ before building out a large cluster. See `Data Storage`_ for additional details. List Disks ========== -To list the disks on a node, execute the following command:: +To list the disks on a node, execute the following command:: ceph-deploy disk list {node-name [node-name]...} @@ -38,72 +38,38 @@ execute the following:: .. important:: This will delete all data. -Prepare OSDs -============ +Create OSDs +=========== Once you create a cluster, install Ceph packages, and gather keys, you -may prepare the OSDs and deploy them to the OSD node(s). If you need to -identify a disk or zap it prior to preparing it for use as an OSD, +may create the OSDs and deploy them to the OSD node(s). If you need to +identify a disk or zap it prior to preparing it for use as an OSD, see `List Disks`_ and `Zap Disks`_. :: - ceph-deploy osd prepare {node-name}:{data-disk}[:{journal-disk}] - ceph-deploy osd prepare osdserver1:sdb:/dev/ssd - ceph-deploy osd prepare osdserver1:sdc:/dev/ssd + ceph-deploy osd create --data {data-disk} {node-name} -The ``prepare`` command only prepares the OSD. On most operating -systems, the ``activate`` phase will automatically run when the -partitions are created on the disk (using Ceph ``udev`` rules). If not -use the ``activate`` command. See `Activate OSDs`_ for -details. +For example:: -The foregoing example assumes a disk dedicated to one Ceph OSD Daemon, and -a path to an SSD journal partition. We recommend storing the journal on -a separate drive to maximize throughput. You may dedicate a single drive -for the journal too (which may be expensive) or place the journal on the -same disk as the OSD (not recommended as it impairs performance). In the -foregoing example we store the journal on a partitioned solid state drive. + ceph-deploy osd create --data /dev/ssd osd-server1 -You can use the settings --fs-type or --bluestore to choose which file system -you want to install in the OSD drive. (More information by running -'ceph-deploy osd prepare --help'). +For bluestore (the default) the example assumes a disk dedicated to one Ceph +OSD Daemon. Filestore is also supported, in which case a ``--journal`` flag in +addition to ``--filestore`` needs to be used to define the Journal device on +the remote host. -.. note:: When running multiple Ceph OSD daemons on a single node, and +.. note:: When running multiple Ceph OSD daemons on a single node, and sharing a partioned journal with each OSD daemon, you should consider the entire node the minimum failure domain for CRUSH purposes, because if the SSD drive fails, all of the Ceph OSD daemons that journal to it will fail too. -Activate OSDs -============= - -Once you prepare an OSD you may activate it with the following command. :: - - ceph-deploy osd activate {node-name}:{data-disk-partition}[:{journal-disk-partition}] - ceph-deploy osd activate osdserver1:/dev/sdb1:/dev/ssd1 - ceph-deploy osd activate osdserver1:/dev/sdc1:/dev/ssd2 - -The ``activate`` command will cause your OSD to come ``up`` and be placed -``in`` the cluster. The ``activate`` command uses the path to the partition -created when running the ``prepare`` command. - - -Create OSDs -=========== - -You may prepare OSDs, deploy them to the OSD node(s) and activate them in one -step with the ``create`` command. The ``create`` command is a convenience method -for executing the ``prepare`` and ``activate`` command sequentially. :: - - ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}] - ceph-deploy osd create osdserver1:sdb:/dev/ssd1 - -.. List OSDs -.. ========= +List OSDs +========= -.. To list the OSDs deployed on a node(s), execute the following command:: +To list the OSDs deployed on a node(s), execute the following command:: -.. ceph-deploy osd list {node-name} + ceph-deploy osd list {node-name} Destroy OSDs @@ -111,7 +77,7 @@ Destroy OSDs .. note:: Coming soon. See `Remove OSDs`_ for manual procedures. -.. To destroy an OSD, execute the following command:: +.. To destroy an OSD, execute the following command:: .. ceph-deploy osd destroy {node-name}:{path-to-disk}[:{path/to/journal}] From cc796073b61d8c0ecfb33eb234bbc995f21c58c7 Mon Sep 17 00:00:00 2001 From: Alfredo Deza Date: Wed, 21 Feb 2018 10:15:57 -0500 Subject: [PATCH 4/4] doc/rados/troubleshooting update ceph-deploy references with new ceph-voume API Signed-off-by: Alfredo Deza --- .../troubleshooting/troubleshooting-pg.rst | 74 +++++++++---------- 1 file changed, 36 insertions(+), 38 deletions(-) diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst index 83a791ce2ae13..828ba799ae05f 100644 --- a/doc/rados/troubleshooting/troubleshooting-pg.rst +++ b/doc/rados/troubleshooting/troubleshooting-pg.rst @@ -5,8 +5,8 @@ Placement Groups Never Get Clean ================================ -When you create a cluster and your cluster remains in ``active``, -``active+remapped`` or ``active+degraded`` status and never achieve an +When you create a cluster and your cluster remains in ``active``, +``active+remapped`` or ``active+degraded`` status and never achieve an ``active+clean`` status, you likely have a problem with your configuration. You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ @@ -26,63 +26,61 @@ Ceph daemon may cause a deadlock due to issues with the Linux kernel itself configuration, in spite of the limitations as described herein. If you are trying to create a cluster on a single node, you must change the -default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning +default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning ``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration file before you create your monitors and OSDs. This tells Ceph that an OSD can peer with another OSD on the same host. If you are trying to set up a -1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, -Ceph will try to peer the PGs of one OSD with the PGs of another OSD on +1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, +Ceph will try to peer the PGs of one OSD with the PGs of another OSD on another node, chassis, rack, row, or even datacenter depending on the setting. -.. tip:: DO NOT mount kernel clients directly on the same node as your - Ceph Storage Cluster, because kernel conflicts can arise. However, you +.. tip:: DO NOT mount kernel clients directly on the same node as your + Ceph Storage Cluster, because kernel conflicts can arise. However, you can mount kernel clients within virtual machines (VMs) on a single node. If you are creating OSDs using a single disk, you must create directories -for the data manually first. For example:: +for the data manually first. For example:: - mkdir /var/local/osd0 /var/local/osd1 - ceph-deploy osd prepare {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 - ceph-deploy osd activate {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 + ceph-deploy osd create --data {disk} {host} Fewer OSDs than Replicas ------------------------ -If you have brought up two OSDs to an ``up`` and ``in`` state, but you still -don't see ``active + clean`` placement groups, you may have an +If you have brought up two OSDs to an ``up`` and ``in`` state, but you still +don't see ``active + clean`` placement groups, you may have an ``osd pool default size`` set to greater than ``2``. There are a few ways to address this situation. If you want to operate your -cluster in an ``active + degraded`` state with two replicas, you can set the -``osd pool default min size`` to ``2`` so that you can write objects in +cluster in an ``active + degraded`` state with two replicas, you can set the +``osd pool default min size`` to ``2`` so that you can write objects in an ``active + degraded`` state. You may also set the ``osd pool default size`` -setting to ``2`` so that you only have two stored replicas (the original and -one replica), in which case the cluster should achieve an ``active + clean`` +setting to ``2`` so that you only have two stored replicas (the original and +one replica), in which case the cluster should achieve an ``active + clean`` state. -.. note:: You can make the changes at runtime. If you make the changes in +.. note:: You can make the changes at runtime. If you make the changes in your Ceph configuration file, you may need to restart your cluster. Pool Size = 1 ------------- -If you have the ``osd pool default size`` set to ``1``, you will only have -one copy of the object. OSDs rely on other OSDs to tell them which objects +If you have the ``osd pool default size`` set to ``1``, you will only have +one copy of the object. OSDs rely on other OSDs to tell them which objects they should have. If a first OSD has a copy of an object and there is no second copy, then no second OSD can tell the first OSD that it should have -that copy. For each placement group mapped to the first OSD (see +that copy. For each placement group mapped to the first OSD (see ``ceph pg dump``), you can force the first OSD to notice the placement groups it needs by running:: - + ceph osd force-create-pg - + CRUSH Map Errors ---------------- -Another candidate for placement groups remaining unclean involves errors +Another candidate for placement groups remaining unclean involves errors in your CRUSH map. @@ -96,10 +94,10 @@ of these states for a long time this may be an indication of a larger problem. For this reason, the monitor will warn when placement groups get "stuck" in a non-optimal state. Specifically, we check for: -* ``inactive`` - The placement group has not been ``active`` for too long +* ``inactive`` - The placement group has not been ``active`` for too long (i.e., it hasn't been able to service read/write requests). - -* ``unclean`` - The placement group has not been ``clean`` for too long + +* ``unclean`` - The placement group has not been ``clean`` for too long (i.e., it hasn't been able to completely recover from a previous failure). * ``stale`` - The placement group status has not been updated by a ``ceph-osd``, @@ -172,11 +170,11 @@ and things will recover. Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk failure), we can tell the cluster that it is ``lost`` and to cope as -best it can. +best it can. .. important:: This is dangerous in that the cluster cannot - guarantee that the other copies of the data are consistent - and up to date. + guarantee that the other copies of the data are consistent + and up to date. To instruct Ceph to continue anyway:: @@ -262,7 +260,7 @@ data, but it is ``down``. The full range of possible states include: * not queried (yet) Sometimes it simply takes some time for the cluster to query possible -locations. +locations. It is possible that there are other locations where the object can exist that are not listed. For example, if a ceph-osd is stopped and @@ -280,7 +278,7 @@ are recovered. To mark the "unfound" objects as "lost":: ceph pg 2.5 mark_unfound_lost revert|delete This the final argument specifies how the cluster should deal with -lost objects. +lost objects. The "delete" option will forget about them entirely. @@ -334,9 +332,9 @@ placement group count for pools is not useful, but you can change it `here`_. Can't Write Data ================ -If your cluster is up, but some OSDs are down and you cannot write data, +If your cluster is up, but some OSDs are down and you cannot write data, check to ensure that you have the minimum number of OSDs running for the -placement group. If you don't have the minimum number of OSDs running, +placement group. If you don't have the minimum number of OSDs running, Ceph will not allow you to write data because there is no guarantee that Ceph can replicate your data. See ``osd pool default min size`` in the `Pool, PG and CRUSH Config Reference`_ for details. @@ -442,7 +440,7 @@ In this case, we can learn from the output: * ``size_mismatch_oi``: the size stored in the object-info is different from the one read from OSD.2. The latter is 0. -You can repair the inconsistent placement group by executing:: +You can repair the inconsistent placement group by executing:: ceph pg repair {placement-group-ID} @@ -456,9 +454,9 @@ If ``read_error`` is listed in the ``errors`` attribute of a shard, the inconsistency is likely due to disk errors. You might want to check your disk used by that OSD. -If you receive ``active + clean + inconsistent`` states periodically due to -clock skew, you may consider configuring your `NTP`_ daemons on your -monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph +If you receive ``active + clean + inconsistent`` states periodically due to +clock skew, you may consider configuring your `NTP`_ daemons on your +monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph `Clock Settings`_ for additional details.