In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

# JupyterHub Notes

This Jupyter server is running as a [Docker container](https://www.docker.com/resources/what-container) in a [Kubernetes cluster](https://kubernetes.io/).  Docker containers act much like light-weight virtual machines, so you can interact with this much like any other Linux machine.  From the terminal, you can get a Bash shell and thereby execute most any binary.  You even have `sudo` access.

However, the container isn't quite a real virtual machine, so there are some differences that you should be aware of.

## Lack of data persistence

Only your home directory (`/home/jovyan`) is stored on a persistent disk.  This means that when your container is restarted, the rest of the file system will be reset to its initial state.  While we will try to leave your container up as much as possible, we may have to restart it occasionally, for a variety of reasons.  While you are welcome to install additional software, any installed outside of your home directory will disappear if your container is restarted.  You may wish to write a script to install this additional software if it is critical to your project.

(Incidentally, this may help you if you accidentally mis-configure your container.  Stopping and restarting your Jupyter server from the control panel will reset the container to its initial state.  Please wait for five minutes after stopping the container before restarting it.  Otherwise, your persistent volume may not get hooked up correctly.)

### Conda environments

The Python environment is manged by [Conda](https://conda.io/docs/).  Not only does Conda handle installing packages (with the `conda install` command), it allows you to create independent Python environments each with its own set of packages.

The default Python environment has a set of packages that are known to work with the lecture notebooks and miniprojects.  It also lives in `/opt/conda`, and therefore will get reset by container restarts.  However, we have configured Conda such that additional environments will get created in your home directory (specifically, `~/conda-envs`).  Therefore, we recommend that you create another environment to install or upgrade Python packages.

To create a new Python 3.6 environment named `capstone-dev`, for example, run
```
conda create -n capstone-dev python=3.6 ipykernel
```
(The name is entirely up to you.)  This creates a minimal environment, and installs the IPython kernel, so it can be used in Jupyter notebooks.  Alternatively, you can clone the existing environment into a new environment:
```
conda create -n capstone-dev --clone data3
```

Either way, you will have a new environment, but nothing will run in there by default.  In order to use it, you need to activate it:
```
source activate capstone-dev
```
Note that the activation operates per shell&mdash;you will need to reactivate this environment if you want it active in another shell.  Conda alters your shell prompt to indicate the current environment, to help you keep track.  Now any `conda` or `pip` commands will operate in this new environment, and any call of `python` will run with the packages installed in it.

To use this environment in a Jupyter notebook, we must register the kernel.  From *within the environment*, run
```
python -m ipykernel install --prefix ~/.local --name capstone-dev --display-name 'Capstone Development'
```
Both the name and display name are arbitrary, but for your sanity, it is recommended that you use names related to that of the environment.  Now you can create notebooks using this kernel, or switch existing notebooks to use this kernel.

## Accessing other ports

While your container has ports just like any other machine, these ports are not accessible to machines outside the cluster, like, for example, your laptop.  This means that if you run a webserver on port 1234, you won't be able to access it at <tt>http://<i>session</i>.tditrain.com:1234</tt>.  (After all, which container should it connect to?)  There are several ways around this.

### HTTP proxy

We are running a proxy that will connect a specific URL to a port on your container.  That web server you run on port 1234 will be available at <tt>https://<i>session</i>.tditrain.com/user/<i>username</i>/proxy/1234/</tt> (replacing *session* and *username* appropriately, of course).  This doesn't require any additional work on your part, but it comes with two significant caveats.

1. This URL is only accessible to you when you are logged in.  This is actually a plus during testing, since you don't want attackers visiting your insecure work-in-progress, but it does mean that you can't show off your work through this URL.

2. This means that the URL the browser sees is different from that which your server expects, which is important if you have absolute URLs.  If you reference a stylesheet living at `/css/style.css`, your browser will ask for <tt>https://<i>session</i>.tditrain.com/css/style.css</tt>, but the file will actually be served from <tt>https://<i>session</i>.tditrain.com/user/<i>username</i>/proxy/1234/css/style.css</tt>.  You can avoid this problem by using relative URLs, but that may not always be practical.

### Port tunneling

There are a number of tools that allow port tunneling, but we've had success using [`ngrok`](https://ngrok.com/).  It's installed in your container, but you will need to sign up for a (free) account yourself.  Once you're signed up, you'll need to authenticate (see [step 3 of this page](https://dashboard.ngrok.com/get-started)).

At that point, you can tunnel any port through `ngrok`.  To tunnel that website running on port 1234, run
```
ngrok http 1234
```
You will then be shown a text UI that includes a URL of the form `http://12345678.ngrok.io`.  Requests sent to that URL will be tunneled through the `ngrok` connection to your container.

With the free account, you can only make a single `ngrok` connection at a time; however, that connection can forward up to four ports.  You must use the configuration file at `~/.ngrok2/ngrok.yml` to do this.  To connect the Hadoop UI, at port 8042, as well as your website, add to this configuration file:
```
tunnels:
    website:
        proto: http
        addr: 1234
    hadoop:
        proto: http
        addr: 8042
```
and then start both connections from a single command
```
ngrok start website hadoop
```
You will get a separate URL for each tunneled port.

#### Setting up SSH

`ngrok` also supports raw TCP tunneling.  This can be used to connect to an SSH server running on your container, for example.  There's no need to do this, but some people find it convenient.

While the SSH server is installed, it does not start automatically.  You must start it with
```
sudo service ssh start
```
The server is configured to only accept public key authentication.  Put your public key in `~/.ssh/authorized_keys`.

Now, set up TCP tunneling to port 22.
```
ngrok tcp 22
```
You'll get a forwarding address of the form `tcp://0.tcp.ngrok.io:12345`.  On your local machine, you can connect with
```
ssh jovyan@0.tcp.ngrok.io -p 12345
```
Note that you're likely to get a different port each time you connect, so you'll likely get a unknown host warning each time.

## Memory limits

To avoid memory contention, each container is subject to a limit of memory usage.  If a process exceeds that limit, it is subject to being killed.  Unfortunately, this limit is not reflected in any of the OS utilities, which show the full memory available to the underlying cluster node.  These limits depend on our cloud server configuration; check with your instructor to find out the limits on your container.

## Downloading Material

You might be wondering how to download the material from your servers to your local computer.  There are a few ways to do it!
The first way - zip the entire datacourse directory and download with `scp` through `ngrok`
1. Install `zip` utility if not installed (`sudo apt install zip` from a terminal)
2. zip datacourse directory `zip -r datacourse.zip datacourse/` (you may want to use `-x` flags to exclude things like `.xml.gz`, all the data is public, so you can re-download it from your local machine)
3. Create an `ngrok` connection over port 22 (follow instructions above as to how to set this up, make sure to add your ssh public key to authorized keys)
4. use `scp` from your local machine to download the zip file, the command will look something like `scp -P 10296 jovyan@0.tcp.ngrok.io:/home/jovyan/datacourse.zip .`

If you don't want to zip the entire directory, other options:

- Use `rsync`
- Push to your own `S3` bucket
- Delete all the downloaded data for miniprojects and use the download button in the Jupyter notebook console (this will work up to a few Gb).

*Copyright &copy; 2019 The Data Incubator.  All rights reserved.*