Merge pull request #108 from betatim/operations-docs

Operations docs
earthlab · Aug 6, 2018 · c5f6c19 · c5f6c19
2 parents 3b758d7 + 1b3bb30
commit c5f6c19
Showing 1 changed file with 83 additions and 0 deletions.
diff --git a/docs/source/day-to-day.rst b/docs/source/day-to-day.rst
@@ -24,6 +24,89 @@ is happening keep an eye on the monitoring to see what happens when lots of
 people login at the same time.
 
 
+Operations
+----------
+
+This section contains commands and snippets that let you inspect the state of
+the cluster and perform tasks that are useful when things are broken.
+
+To perform these commands you need to have ``kubectl`` installed and setup
+on your local laptop (see :ref:`google-cloud` for details).
+
+
+Inspecting what is going on in the cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To see what is running on the cluser (and get a feeling for what is the normal
+state of affairs) run ``kubectl get pods --all-namespaces``. This will list all
+pods that are running, including service pods that we don't ever interact with.
+
+You should see at least two pods in each namespace that is associated to a hub.
+The namespace and hubname are the same, so ``staginghub`` lives in the
+``staginghub`` namespace.
+
+Pods in the ``kube-system``, ``monitoring`` and ``router`` namespaces are best
+left alone.
+
+Each hub specific namespace should contain at least two pods: ``hub-77fbd96bb-dh2b5``
+and ``proxy-6549f4fbc8-8zn67``. Everything after hub- and proxy- will change
+when you restart the hub or make configuration changes. The status of these
+two pods should be ``Running``.
+
+To see what a pod is printing to its terminal run ``kubectl logs <podname> --namespace <hubname>``.
+This will let you see errors or exceptions that happened.
+
+You can find out more about a pod by running ``kubectl describe pod <podname> --namespace <hubname>``.
+This will give you information on why a pod is not running or what it is up to
+while you are waiting for it to start running.
+
+When someone logs in to the hub and starts their server a new pod will appear in
+the namespace of the hub. The pod will be named ``jupyter-<username>``. You can
+inspect it with the usual ``kubectl`` commands.
+
+
+Known problems and solutions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There is a known problem in one of JupyterHub's components that means sometimes
+the hub misses that a user's pod has started and will keep that user waiting
+after they logged in. The symptoms of this are that there is a running pod for
+a user in the right namespace but the login process does not complete. In this
+case restart the hub by running ``kubectl delete pod hub-.... --namespace <hubname>``,
+replacing the ... in the hub name with the proper name of the hub pod. This should
+not interrupt currently active users and fixes a lot of things that can go wrong.
+
+
+Inspecting virtual machines
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To tell how many virtual machines (or nodes) are part of the cluster run
+``kubectl get nodes``. There should be at least one node with ``core-pool`` in
+its name running at all times. Once users login and start their servers new
+nodes will appear that have ``user-pool`` in their name. These nodes are
+automatically created and destroyed based on demand.
+
+You can learn about a node by running ``kubectl describe node <nodename>``.
+
+
+Scaling up cluster before a class/workshop
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Having a cluster that automatically scales up and down based on demand is great,
+but starting a new virtual machine takes a few minutes (~5-9minutes). This makes
+for a poor user experience when lots of users login at the start of a class or
+workshop. Luckily in this case we know when the herd is going to arrive and can
+scale up the cluster just before. To do this go to the admin panel of your hub
+``https://hub.earthdatascience.org/<hubname>/hub/admin`` and start the servers
+for a large fraction of your users. This will trigger the scale up event and if
+you do this about 15minutes before the start of a class your cluster should be
+big and ready when students login.
+
+One thing to keep in mind is that unused user servers will eventually be turned
+off again and the cluster will shrink down again. This means you can not scale
+up the cluster using this strategy many hours before class starts.
+
+
 Making changes to an existing hub
 ---------------------------------