docs(managing_deis): rewrite several pages based on store

[skip ci]
deis · Oct 7, 2014 · 17e2d1f · 17e2d1f
1 parent 009ddf7
commit 17e2d1f
Show file tree

Hide file tree

Showing 12 changed files with 324 additions and 111 deletions.
diff --git a/docs/managing_deis/add_remove_host.rst b/docs/managing_deis/add_remove_host.rst
@@ -0,0 +1,201 @@
+:title: Addding/Removing Hosts
+:description: Considerations for adding or removing Deis hosts.
+
+.. _add_remove_host:
+
+Adding/Removing Hosts
+=====================
+
+Most Deis components handle new machines just fine. Care has to be taken when removing machines from the cluster, however, since the deis-store components act as the backing store for all the stateful data Deis needs to function properly.
+
+Note that these instructions follow the Ceph documentation for `removing monitors`_ and `removing OSDs`_. Should these instructions differ significantly from the Ceph documentation, the Ceph documentation should be followed, and a PR to update this documentation would be much appreciated.
+
+Since Ceph uses the Paxos algorithm, it is important to always have enough monitors in the cluster to be able to achieve a majority: 1:1, 2:3, 3:4, 3:5, 4:6, etc. It is always preferable to add a new node to the cluster before removing an old one, if possible.
+
+This documentation will assume a running three-node Deis cluster. We will add a fourth machine to the cluster, then remove the first machine.
+
+Inspecting health
+-----------------
+
+Before we begin, we should check the state of the Ceph cluster to be sure it's healthy. We can do this by logging into any machine in the cluster, entering a store container, and then querying Ceph:
+
+.. code-block:: console
+
+    core@deis-1 ~ $ nse deis-store-monitor
+    groups: cannot find name for group ID 11
+    root@deis-1:/# ceph -s
+        cluster c3ff2017-b0a8-4c5a-be00-636560ca567d
+         health HEALTH_OK
+         monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 8, quorum 0,1,2 deis-1,deis-2,deis-3
+         osdmap e18: 3 osds: 3 up, 3 in
+          pgmap v31: 960 pgs, 9 pools, 1158 bytes data, 45 objects
+                16951 MB used, 31753 MB / 49200 MB avail
+                     960 active+clean
+
+We see from the ``pgmap`` that we have 960 placement groups, all of which are ``active+clean``. This is good!
+
+Adding a node
+-------------
+
+To add a new node to your Deis cluster, simply provision a new CoreOS machine with the same etcd discovery URL specified in the cloud-config file. When the new machine comes up, it will join the etcd cluster. You can confirm this with ``fleetctl list-machines``.
+
+Since logspout, publisher, store-monitor, and store-daemon are global units, they will be automatically started on the new node.
+
+Once the new machine is running, we can inspect the Ceph cluster health again:
+
+.. code-block:: console
+
+    core@deis-1 ~ $ nse deis-store-monitor
+    groups: cannot find name for group ID 11
+    root@deis-1:/# ceph -s
+        cluster c3ff2017-b0a8-4c5a-be00-636560ca567d
+         health HEALTH_WARN clock skew detected on mon.deis-4
+         monmap e4: 4 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0,deis-4=172.17.8.103:6789/0}, election epoch 12, quorum 0,1,2,3 deis-1,deis-2,deis-3,deis-4
+         osdmap e22: 4 osds: 4 up, 4 in
+          pgmap v43: 960 pgs, 9 pools, 1158 bytes data, 45 objects
+                22584 MB used, 42352 MB / 65600 MB avail
+                     960 active+clean
+
+Note that we have:
+
+.. code-block:: console
+
+     monmap e4: 4 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0,deis-4=172.17.8.103:6789/0}, election epoch 12, quorum 0,1,2,3 deis-1,deis-2,deis-3,deis-4
+     osdmap e22: 4 osds: 4 up, 4 in
+
+We have 4 monitors and OSDs. Hooray!
+
+Removing a node
+---------------
+
+When removing a node from the cluster that runs a deis-store component, you'll need to tell Ceph that both the store-daemon and store-monitor running on this host will be leaving the cluster. We're going to remove the first node in our cluster, deis-1. That machine has an IP address of ``172.17.8.100``.
+
+Removing an OSD
+~~~~~~~~~~~~~~~
+
+Before we can tell Ceph to remove an OSD, we need the OSD ID. We can get this from etcd:
+
+.. code-block:: console
+
+    core@deis-2 ~ $ etcdctl get /deis/store/osds/172.17.8.100
+    1
+
+Note: In some cases, we may not know the IP or hostname or the machine we want to remove. In these cases, we can use ``ceph osd tree`` to see the current state of the cluster. This will list all the OSDs in the cluster, and report which ones are down.
+
+Now that we have the OSD's ID, let's remove it. We'll need a shell in any store-monitor or store-daemon container on any host in the cluster (except the one we're removing). In this example, I am on ``deis-2``.
+
+.. code-block:: console
+
+    core@deis-2 ~ $ nse deis-store-monitor
+    groups: cannot find name for group ID 11
+    root@deis-2:/# ceph osd out 1
+    marked out osd.1.
+
+
+This instructs Ceph to start relocating placement groups on that OSD to another host. We can watch this with ``ceph -w``:
+
+.. code-block:: console
+
+    root@deis-2:/# ceph -w
+        cluster c3ff2017-b0a8-4c5a-be00-636560ca567d
+         health HEALTH_WARN clock skew detected on mon.deis-4
+         monmap e4: 4 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0,deis-4=172.17.8.103:6789/0}, election epoch 12, quorum 0,1,2,3 deis-1,deis-2,deis-3,deis-4
+         osdmap e24: 4 osds: 4 up, 3 in
+          pgmap v58: 960 pgs, 9 pools, 1158 bytes data, 45 objects
+                16900 MB used, 31793 MB / 49200 MB avail
+                     960 active+clean
+
+    2014-10-07 17:55:11.900151 mon.0 [INF] pgmap v58: 960 pgs: 960 active+clean; 1158 bytes data, 16900 MB used, 31793 MB / 49200 MB avail; 29 B/s, 3 objects/s recovering
+    2014-10-07 17:56:38.860305 mon.0 [INF] pgmap v59: 960 pgs: 960 active+clean; 1158 bytes data, 16900 MB used, 31793 MB / 49200 MB avail
+
+We can see that the placement groups are back in a clean state. We can now stop the daemon. Since the store units are global units, we can't target a specific one to stop. Instead, we log into the host machine and instruct Docker to stop the container:
+
+.. code-block:: console
+
+    core@deis-1 ~ $ docker stop deis-store-daemon
+    deis-store-daemon
+
+Back inside a store container on ``deis-2``, we can finally remove the OSD:
+
+.. code-block:: console
+
+    core@deis-2 ~ $ nse deis-store-monitor
+    groups: cannot find name for group ID 11
+    root@deis-2:/# ceph osd crush remove osd.1
+    removed item id 1 name 'osd.1' from crush map
+    root@deis-2:/# ceph auth del osd.1
+    updated
+    root@deis-2:/# ceph osd rm 1
+    removed osd.1
+
+For cleanup, we should remove the OSD entry from etcd:
+
+.. code-block:: console
+
+    core@deis-2 ~ $ etcdctl rm /deis/store/osds/172.17.8.100
+
+That's it! If we inspect the health, we see that there are now 3 osds again, and all of our placement groups are ``active+clean``.
+
+.. code-block:: console
+
+    core@deis-2 ~ $ nse deis-store-monitor
+    groups: cannot find name for group ID 11
+    root@deis-2:/# ceph -s
+        cluster c3ff2017-b0a8-4c5a-be00-636560ca567d
+         health HEALTH_WARN clock skew detected on mon.deis-4
+         monmap e4: 4 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0,deis-4=172.17.8.103:6789/0}, election epoch 12, quorum 0,1,2,3 deis-1,deis-2,deis-3,deis-4
+         osdmap e28: 3 osds: 3 up, 3 in
+          pgmap v81: 960 pgs, 9 pools, 1158 bytes data, 45 objects
+                16915 MB used, 31779 MB / 49200 MB avail
+                     960 active+clean
+
+Removing a monitor
+~~~~~~~~~~~~~~~~~~
+
+Removing a monitor is much easier. First, we remove the etcd entry so any clients that are using Ceph won't use the monitor for connecting:
+
+.. code-block:: console
+
+    $ etcdctl rm /deis/store/hosts/172.17.8.100
+
+Within 5 seconds, confd will run on all store clients and remove the monitor from the ``ceph.conf`` configuration file.
+
+Next, we stop the container:
+
+.. code-block:: console
+
+    core@deis-1 ~ $ docker stop deis-store-monitor
+    deis-store-monitor
+
+
+Back on another host, we can again enter a store container and then remove this monitor:
+
+.. code-block:: console
+
+    root@deis-2:/# ceph mon remove deis-1
+    2014-10-07 18:14:38.055584 7fab0d6e7700  0 monclient: hunting for new mon
+    2014-10-07 18:14:38.055584 7fab0d6e7700  0 monclient: hunting for new mon
+    removed mon.deis-1 at 172.17.8.100:6789/0, there are now 3 monitors
+    2014-10-07 18:14:38.072885 7fab0c5e4700  0 -- 172.17.8.101:0/1000361 >> 172.17.8.100:6789/0 pipe(0x7faafc007c90 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7faafc007f00).fault
+    2014-10-07 18:14:38.072885 7fab0c5e4700  0 -- 172.17.8.101:0/1000361 >> 172.17.8.100:6789/0 pipe(0x7faafc007c90 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7faafc007f00).fault
+
+Note the faults that follow - this is normal to see when a Ceph client is unable to communicate with a certain monitor. The important line is that we see ``removed mon.deis-1 at 172.17.8.100:6789/0, there are now 3 monitors``.
+
+Finally, let's check the health of the cluster:
+
+.. code-block:: console
+
+    root@deis-2:/# ceph -s
+        cluster c3ff2017-b0a8-4c5a-be00-636560ca567d
+         health HEALTH_OK
+         monmap e5: 3 mons at {deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0,deis-4=172.17.8.103:6789/0}, election epoch 16, quorum 0,1,2 deis-2,deis-3,deis-4
+         osdmap e28: 3 osds: 3 up, 3 in
+          pgmap v91: 960 pgs, 9 pools, 1158 bytes data, 45 objects
+                16927 MB used, 31766 MB / 49200 MB avail
+                     960 active+clean
+
+We're done!
+
+.. _`removing monitors`: http://ceph.com/docs/v0.80.5/rados/operations/add-or-rm-mons/#removing-monitors
+.. _`removing OSDs`: http://docs.ceph.com/docs/v0.80.5/rados/operations/add-or-rm-osds/#removing-osds-manual
+
diff --git a/docs/managing_deis/backing_up_data.rst b/docs/managing_deis/backing_up_data.rst
@@ -7,37 +7,20 @@ Backing up Data
 ========================
 
 While applications deployed on Deis follow the Twelve-Factor methodology and are thus stateless,
-Deis maintains platform state in two places: data containers and etcd.
+Deis maintains platform state in two places: the :ref:`Store` component, and in etcd.
 
-Data containers
+Store component
 ---------------
-Data containers are simply Docker containers that expose a volume which is shared with another container.
-The components with data containers are builder, database, logger, and registry. Since these are just
-Docker containers, they can be exported with ordinary Docker commands:
+The store component runs `Ceph`_, and is used by the :ref:`Database` and :ref:`Registry` components
+as a data store. This enables the components themselves to freely move around the cluster while
+their state is backed by store.
 
-.. code-block:: console
-
-    dev $ fleetctl ssh deis-builder.service
-    coreos $ sudo docker export deis-builder-data > /home/coreos/deis-builder-data-backup.tar
-    dev $ fleetctl ssh deis-database.service
-    coreos $ sudo docker export deis-database-data > /home/coreos/deis-database-data-backup.tar
-    dev $ fleetctl ssh deis-logger.service
-    coreos $ sudo docker export deis-logger-data > /home/coreos/deis-logger-data-backup.tar
-    dev $ fleetctl ssh deis-registry.service
-    coreos $ sudo docker export deis-registry-data > /home/coreos/deis-registry-data-backup.tar
-
-Importing looks very similar:
-
-.. code-block:: console
+The store component is configured to still operate in a degraded state, and will automatically
+recover should a host fail and then rejoin the cluster. Total data loss of Ceph is only possible
+if all of the store containers are removed. However, backup of Ceph is fairly straightforward.
 
-    dev $ fleetctl ssh deis-builder.service
-    coreos $ cat /home/coreos/deis-builder-data-backup.tar | sudo docker import - deis-builder-data
-    dev $ fleetctl ssh deis-database.service
-    coreos $ cat /home/coreos/deis-database-data-backup.tar | sudo docker import - deis-database-data
-    dev $ fleetctl ssh deis-logger.service
-    coreos $ cat /home/coreos/deis-logger-data-backup.tar | sudo docker import - deis-logger-data
-    dev $ fleetctl ssh deis-registry.service
-    coreos $ cat /home/coreos/deis-registry-data-backup.tar | sudo docker import - deis-registry-data
+Data in Ceph is stored on the filesystem in ``/var/lib/ceph``, and metadata information is stored
+within Ceph. Ceph provides the ability to take snapshots of storage pools with the `rados`_ command.
 
 Using pg_dump
 -------------
@@ -46,7 +29,7 @@ dump of the database.
 
 .. code-block:: console
 
-    dev $ fleetctl ssh deis-database.service
+    dev $ fleetctl ssh deis-database@1.service
     coreos $ nse deis-database
     coreos $ sudo -u postgres pg_dumpall > pg_dump.sql
 
@@ -61,3 +44,5 @@ documentation in `#683`_.
 
 .. _`#683`: https://github.com/coreos/etcd/issues/683
 .. _`etcd-dump`: https://github.com/AaronO/etcd-dump
+.. _`Ceph`: http://ceph.com
+.. _`rados`: http://ceph.com/docs/master/man/8/rados
diff --git a/docs/managing_deis/builder_settings.rst b/docs/managing_deis/builder_settings.rst
@@ -38,7 +38,7 @@ setting                                   description
 /deis/controller/protocol                 protocol of the controller component (set by controller)
 /deis/registry/host                       host of the controller component (set by registry)
 /deis/registry/port                       port of the controller component (set by registry)
-/deis/services/*                          application metadata (set by controller)
+/deis/services/*                          healthy application containers reported by deis/publisher
 /deis/slugbuilder/image                   slugbuilder image to use (default: deis/slugbuilder:latest)
 /deis/slugrunner/image                    slugrunner image to use (default: deis/slugrunner:latest)
 ====================================      ===========================================================

diff --git a/docs/managing_deis/ha_database.rst b/docs/managing_deis/ha_database.rst
diff --git a/docs/managing_deis/index.rst b/docs/managing_deis/index.rst
@@ -1,5 +1,5 @@
 :title: Managing Deis
-:description: Step-by-step guide for operations engineers setting up a private PaaS using Deis.
+:description: Guide for operations engineers managing a private PaaS using Deis.
 
 .. _managing_deis:
 
@@ -11,6 +11,8 @@ Managing Deis
 
 .. toctree::
 
+    add_remove_host
+    backing_up_data
     builder_settings
     cache_settings
     controller_settings
@@ -21,9 +23,7 @@ Managing Deis
     store_daemon_settings
     store_gateway_settings
     store_monitor_settings
-    managing_users
+    operational_tasks
     platform_logging
     platform_monitoring
-    backing_up_data
-    ha_database
     security_considerations
diff --git a/docs/managing_deis/managing_users.rst b/docs/managing_deis/managing_users.rst
diff --git a/docs/managing_deis/operational_tasks.rst b/docs/managing_deis/operational_tasks.rst
@@ -0,0 +1,49 @@
+:title: Operational tasks
+:description: Common operational tasks for your Deis cluster.
+
+.. _operational_tasks:
+
+Operational tasks
+~~~~~~~~~~~~~~~~~
+
+Inspecting store
+================
+It is sometimes helpful to query the :Ref:`Store` component to ask about the health of the Ceph cluster.
+To do this, log into any machine running a ``store-monitor`` or ``store-daemon`` service. Then,
+``nse deis-store-monitor`` or ``nse deis-store-daemon`` and issue a ``ceph -s``. This should output the
+health of the cluster like:
+
+.. code-block:: console
+
+    cluster 6506db0c-9eae-4bb6-a40a-95954dd3c4c3
+    health HEALTH_OK
+    monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 8, quorum 0,1,2 deis-1,deis-2,deis-3
+    osdmap e7: 3 osds: 3 up, 3 in
+    pgmap v14: 192 pgs, 3 pools, 0 bytes data, 0 objects
+    19378 MB used, 28944 MB / 49200 MB avail
+    192 active+clean
+
+If you see ``HEALTH_OK``, this means everything is working as it should.
+Note also ``monmap e3: 3 mons at...`` which means all three monitor containers are up and responding,
+and ``osdmap e7: 3 osds: 3 up, 3 in`` which means all three daemon containers are up and running.
+
+We can also see from the ``pgmap`` that we have 192 placement groups, all of which are ``active+clean``.
+
+Managing users
+==============
+
+There are two classes of Deis users: normal users and administrators.
+
+* Users can use most of the features of Deis - creating and deploying applications, adding/removing domains, etc.
+* Administrators can perform all the actions that users can, but they can also create, edit, and destroy clusters.
+
+The first user created on a Deis installation is automatically an administrator.
+
+Promoting users to administrators
+---------------------------------
+
+You can use the ``deis perms`` command to promote a user to an administrator:
+
+.. code-block:: console
+
+    $ deis perms:create john --admin