Skip to content

Commit

Permalink
Merge pull request #108 from betatim/operations-docs
Browse files Browse the repository at this point in the history
Operations docs
  • Loading branch information
betatim committed Aug 6, 2018
2 parents 3b758d7 + 1b3bb30 commit c5f6c19
Showing 1 changed file with 83 additions and 0 deletions.
83 changes: 83 additions & 0 deletions docs/source/day-to-day.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,89 @@ is happening keep an eye on the monitoring to see what happens when lots of
people login at the same time.


Operations
----------

This section contains commands and snippets that let you inspect the state of
the cluster and perform tasks that are useful when things are broken.

To perform these commands you need to have ``kubectl`` installed and setup
on your local laptop (see :ref:`google-cloud` for details).


Inspecting what is going on in the cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To see what is running on the cluser (and get a feeling for what is the normal
state of affairs) run ``kubectl get pods --all-namespaces``. This will list all
pods that are running, including service pods that we don't ever interact with.

You should see at least two pods in each namespace that is associated to a hub.
The namespace and hubname are the same, so ``staginghub`` lives in the
``staginghub`` namespace.

Pods in the ``kube-system``, ``monitoring`` and ``router`` namespaces are best
left alone.

Each hub specific namespace should contain at least two pods: ``hub-77fbd96bb-dh2b5``
and ``proxy-6549f4fbc8-8zn67``. Everything after hub- and proxy- will change
when you restart the hub or make configuration changes. The status of these
two pods should be ``Running``.

To see what a pod is printing to its terminal run ``kubectl logs <podname> --namespace <hubname>``.
This will let you see errors or exceptions that happened.

You can find out more about a pod by running ``kubectl describe pod <podname> --namespace <hubname>``.
This will give you information on why a pod is not running or what it is up to
while you are waiting for it to start running.

When someone logs in to the hub and starts their server a new pod will appear in
the namespace of the hub. The pod will be named ``jupyter-<username>``. You can
inspect it with the usual ``kubectl`` commands.


Known problems and solutions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is a known problem in one of JupyterHub's components that means sometimes
the hub misses that a user's pod has started and will keep that user waiting
after they logged in. The symptoms of this are that there is a running pod for
a user in the right namespace but the login process does not complete. In this
case restart the hub by running ``kubectl delete pod hub-.... --namespace <hubname>``,
replacing the ... in the hub name with the proper name of the hub pod. This should
not interrupt currently active users and fixes a lot of things that can go wrong.


Inspecting virtual machines
~~~~~~~~~~~~~~~~~~~~~~~~~~~

To tell how many virtual machines (or nodes) are part of the cluster run
``kubectl get nodes``. There should be at least one node with ``core-pool`` in
its name running at all times. Once users login and start their servers new
nodes will appear that have ``user-pool`` in their name. These nodes are
automatically created and destroyed based on demand.

You can learn about a node by running ``kubectl describe node <nodename>``.


Scaling up cluster before a class/workshop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Having a cluster that automatically scales up and down based on demand is great,
but starting a new virtual machine takes a few minutes (~5-9minutes). This makes
for a poor user experience when lots of users login at the start of a class or
workshop. Luckily in this case we know when the herd is going to arrive and can
scale up the cluster just before. To do this go to the admin panel of your hub
``https://hub.earthdatascience.org/<hubname>/hub/admin`` and start the servers
for a large fraction of your users. This will trigger the scale up event and if
you do this about 15minutes before the start of a class your cluster should be
big and ready when students login.

One thing to keep in mind is that unused user servers will eventually be turned
off again and the cluster will shrink down again. This means you can not scale
up the cluster using this strategy many hours before class starts.


Making changes to an existing hub
---------------------------------

Expand Down

0 comments on commit c5f6c19

Please sign in to comment.