# basics


This notebook is a bit of a mish-mash or melange of ideas and details that don't fit in elsewhere in the
**synoptic** notebook ensemble. 


## Outline

* Introduction
* Data sources
* Jupyter notebook shortcuts
* Jupyter notebook interesting features
* On installing Python packages
* Using widgets
* Graphs
* Research notes
* JupyterHub for curriculum 


The [synoptic notebooks](https://github.com/robfatland/synoptic) 
are centered around oceanography but have applicability to many forms of earth system data science.
Port Cormorack is the associated Jupyter Hub where the various synoptic notebooks have been developed. Therefore if you
are working through these notebooks and they are not working: There is most likely something in your set-up that is at
variance with Port Cormorack.


A related GitHub repository [pangeo-data/cmr](https://github.com/pangeo-data/cmr) includes notebooks on 
the NASA Commone Metadata Repository (CMR) and is therefore concerned with access to remote sensing datasets. 


### Data sources


* [Heidi Sosik's Imaging Flow Cytometer (WHOI)](http://ifcb-data.whoi.edu/mvco)
  * [Same but API-specific](http://ifcb-data.whoi.edu/api)
* [BCO-DMO (WHOI)](https://www.bco-dmo.org/data)
* ARGO
  * [GitHub pyARGO](https://github.com/castelao/pyARGO) Rob is not convinced this is complete enough to be useful...
  * [UCSD ARGO Informational FAQ](http://www.argo.ucsd.edu/Data_FAQ.html#RorD)
  * [Coriolis map-based data selection tool](http://www.argodatamgt.org/Access-to-data/Argo-data-selection)
  * [xarray lesson that works with ARGO data](https://rabernat.github.io/research_computing/xarray.html)
* GLODAP
* AQUA
* TERRA
* OOI CA CI
* ECCO2
* LIVE OCEAN
* NANOOS
* Crowd sourcing science efforts: 
[Braindr](https://braindr.us/#/), [Whaledr](https://whale-dr.firebaseapp.com/#/play), [appstract](https://appstract.pub/#/play)



### Jupyter shortcuts


[This blog post](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/): **really useful** Jupyter notebook shortcuts, e.g. *ESC* mode:


- a + cell above
- b + cell below
- m cell is markdown
- y cell is code
- dd delete current cell
- f Find and replace on code but not output
- o Toggle cell output
- shift + tab shows the docstring for an object you have just typed in a code cell
  - Keep hitting tab to cycle through documentation
- ctrl + shift splits the current cell in two at your cursor
- shift + J or shift + down multi-select downwards
- shift + K or shift + up multi-select upwards
- shift + M to merge



### Jupyter notebook interesting features


- Jupyter notebooks can be managed as open source narratives (GitHub-backed) 
- Can treat local code as a simple package via 'import' on local-directory ```methods.py``` file
- Cloud-backed JupyterHub can scale, manage authentication, access large datasets in object storage
  - with / without containers
  - ...or via **hosted environments**: SageMaker on AWS, AMLS + Azure Notebooks, CoLab on Google
- Data deconstruction on NetCDF files: Learning how to get *at* the data
- Widget-based interactivity: Sliders, buttons, movie-building and playback...
- Additional rapidly emerging technologies (pursue at your peril!)
  - Anaconda (is a Python environment and a package manager)
  - Kubernetes (is container orchestration)
  - HELM (automates Kubernetes configuration)
  - Docker / DockerHub (containerized working environments)
  - git / GitHub (version control, open source code sharing)
  - binder 
  - TerraForm
  - ...and so on...


### On installing Python packages


To run this notebook: pandas, netcdf4, xarray and boto must be installed.
One way to get there is to invoke the command line as in:

```
!conda install netcdf4 -y
!conda install xarray -y
```

However this is very stop-gap. Better is to enable the user to go to a terminal with sudo privileges. 
In this case we want to invoke a virtual Python environment called 'dev' where we create an associated
installation of Python packages we want. This is the first step towards customizing a compute environment.


```
$ export PATH=/opt/anaconda/bin:$PATH
$ source activate dev
(dev) $ conda install networkx
```

- Done as a 'normal'
- Installation is available to anyone using the **dev** kernel

### Using widgets

In [6]:
# This is about enabling widgets; which is automatically working in Port Cormorack
# Uncomment the next line and run this cell if necessary
# !jupyter nbextension enable --py widgetsnbextension --sys-prefix 

In [7]:
import sys          # used to halt the program
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import networkx as nx
import random as r
from numpy import zeros
import warnings; warnings.simplefilter('ignore')

import numpy.linalg

# These imports give us control sliders that we use for selecting depth slices from the dataset
from ipywidgets import *
from traitlets import dlink

### Graphs

Graphs are definable as two sets: A first set of elements and a second set that associates pairs 
of elements from the first set. 


In what follows here we take advantage of the Python graph package **NetworkX** which has been 
pre-installed on the PortCormorack JupyterHub. The idea is to establish some use patterns on
NetworkX in case these are of interest in the thematic geopspatial analysis (particularly oceanography)
of *synoptic* and *cmr* repositories. 

In [8]:
import sys          # used to halt the program
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import networkx as nx
import random as r
from numpy import zeros
import warnings; warnings.simplefilter('ignore')

import numpy.linalg

# These imports give us control sliders that we use for selecting depth slices from the dataset
from ipywidgets import *
from traitlets import dlink

# This creates a 2D color-coded view of oxygen at the surface, attaching a slider to a depth parameter
def DrawSomeGraph(type_choice, n, control, layout):
    
    plt.figure(num=None, figsize=(6,6), dpi=80)
    plt.axis('off')
    fig = plt.figure(1)
    
    title_string = 'i was never given a title for this... odd'

    probability = float(control)/100.0

    if type_choice == 0: # G(n,m) chooses one possible graph with n vertices and m edges at random
        m = control # number of edges
        G = nx.gnm_random_graph(n, m)
        title_string = 'Randomly chosen from among all graphs with m vertices and n edges'
    elif type_choice == 1: 
        G = nx.random_graphs.erdos_renyi_graph(n, probability)
        title_string = 'Erdos-Renyi random graph with probability ' + str(probability)
    elif type_choice == 2:
        # Use this graph to win a trophy at an information theory conference
        G=nx.karate_club_graph()
        title_string = 'The Karate club graph; go claim your trophy!'
    elif type_choice == 3: 
        G = nx.heawood_graph()
        title_string = 'Heawood graph, a cubic graph on 14 vertices with girth 6'
    elif type_choice == 4:
        k = min(n, 5)
        G = nx.newman_watts_strogatz_graph(n, k, probability)
        title_string = 'Small world graph'
    elif type_choice == 5: 
        k = min(n, 10)
        G = nx.powerlaw_cluster_graph(n, k, probability)
        title_string = 'Power law cluster graph'
    elif type_choice == 6: 
        k = int(control * n / 100)
        if k < 1: k = 1
        G = nx.barabasi_albert_graph(n, k)
        title_string = 'Preferential attachment graph'
    elif type_choice == 7: 
        G = nx.complete_graph(n)
        title_string = 'complete graph'
    elif type_choice == 8: 
        G = nx.complete_multipartite_graph(n, control)
        title_string = 'complete bipartite n, m'
    elif type_choice == 9: 
        # since the size parameter is a dimension (exponent) we auto-curtail this
        if n > 7: n = 7
        G = nx.hypercube_graph(n)
        title_string = 'Hypercube graph where n is now the dimension'
    elif type_choice == 10: 
        G = nx.random_lobster(n, probability, probability)
        title_string = 'random lobster with both probabilities ' + str(probability)
    else:
        distance_threshold = float(control)/100.
        G = nx.random_geometric_graph(n, distance_threshold)
        title_string = 'random geometric using a threshold of ' + str(distance_threshold)
    
    # Use the layout parameter to choose an arrangement scheme
    if layout == 0:
        pos=nx.circular_layout(G)
        title_string += '    (circular)'
    elif layout == 1:
        pos=nx.random_layout(G) 
        title_string += '    (random)'
    elif layout == 2:
        if type_choice == 9:
            # Multiple shells hang on a hypercube 
            pos=nx.shell_layout(G)
        else: 
            shells = []
            nShells = 1
            if n > 5: nShells += 1
            if n > 11: nShells += 1
            if n > 36: nShells += 1
            if n > 72: nShells += 1
            deltaShell = int(n / nShells)
            # n = 43 gives nShells = 4, deltaShell = 10; 0-10, 10-20, 20-30, 30-40
            for i in range(nShells):
                shells.append(range(int(i*deltaShell), int((i+1)*deltaShell)))
            if n > nShells * deltaShell: 
                shells.append(range(int(nShells*deltaShell), n))
            pos=nx.shell_layout(G, shells) 
        title_string += '    (shell)'
    elif layout == 3:
        pos=nx.spring_layout(G) 
        title_string += '    (spring)'
    elif layout == 4:
        pos=nx.spectral_layout(G)
        title_string += '    (spectral)'
    else: # there is a 'graphviz' layout but not using this 'pos =' mechanism
        pos=nx.circular_layout(G)
        title_string += '    (circular)'
    
    # This does not work properly
    # nx.draw(G)
    cut = 1.1
    xmax = cut * max(xx for xx, yy in pos.values())
    ymax = cut * max(yy for xx, yy in pos.values())
    xmin = cut * min(xx for xx, yy in pos.values())
    ymin = cut * min(yy for xx, yy in pos.values())
    plt.xlim(xmin, xmax)
    plt.ylim(ymin, ymax)
    plt.title(title_string)

    nx.draw_networkx(G,pos,with_labels=False,node_size=50)
    # ,node_color='blue')
    plt.show()
    
    # print(title_string)
    
    L = nx.normalized_laplacian_matrix(G)
    e, v = numpy.linalg.eig(L.A)
    e.sort()
    for i in range(len(e)):
        if e[i] < 1.0e-14: e[i] = 0
    eivstring = "Eigenvalues:"
    nTerms = min(n, 9)
    for i in range(nTerms):
        eivstring += '%.3f' % e[i] + ', '
    if n > 10: eivstring += ', ..., '
    eivstring += '%.3f' % e[n-1]
    print(eivstring)
    # plt.hist(e, bins=100)  # histogram with 100 bins
    # plt.xlim(0, 2)  # eigenvalues between 0 and 2 

# This is the interactive slider
interact(DrawSomeGraph, 
            type_choice={'Gnm random': 0, \
                         'Erdos-Renyi random': 1, \
                         'Karate Club': 2, \
                         'Heawood Graph': 3, \
                         'Small world': 4, \
                         'Power law cluster': 5, \
                         'Preferential attachment': 6, \
                         'Complete': 7, \
                         'Complete Bipartite': 8, \
                         'Hypercube': 9, \
                         'Random Lobster': 10 \
                        }, 
            n=widgets.IntSlider(min=2,max=100,step=1,value=10, continuous_update=False),
            control=widgets.IntSlider(min=0,max=100,step=1,value=50,continuous_update=False),
            layout = {'circular': 0, \
                      'random': 1, \
                      'shell': 2, \
                      'spring': 3, \
                      'spectral': 4
                        }
        )

interactive(children=(Dropdown(description='type_choice', options={'Gnm random': 0, 'Erdos-Renyi random': 1, '…

<function __main__.DrawSomeGraph>

### Research notes


#### xarray features


* apply operations over dimensions by name: ```x.sum('time')```
* select values by label instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01')
* athematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.
* supports flexible split-apply-combine operations with groupby: x.groupby('time.dayofyear').mean().
* database-like alignment based on coordinate labels that smoothly handles missing values
  * x, y = xr.align(x, y, join='outer').
* keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.


Note that pandas provides many of these features but does not make use of dimension names, and its core data structures are fixed dimensional arrays


#### eScience incubator case study


by Purshottam Shivraj


* This project was an attempt to streamline workflow for optimizing physical parameters of Hydrological models (cf DHSVM)
* Optimization was based on techniques like FAST, DECISION TREES and DREAM


##### **Before** 


- Scientists manually tweak parameters for their model
- For a large number of parameters the computation would fail


##### **Plan**


Distribute the computation workload across many machines to accelerate the optimization task; three stages:


- FAST is a Fourier Amplitude Sensitivity Test, a variance-based sensitivity analysis method
  - Idiosyncracy: Sample size required to achieve reasonable results is rather large
  - Determines the sensitivity of the physical parameters
  - Eventually narrows down the parameter space to only very sensitive parameters...
    - ...i.e. those contributing the most towards predictive power of the model (metric: Nash–Sutcliffe) 
  - The bottleneck: Computationally expensive, requires multiple iterations
- FAST results > decision tree (scikit-learn) to narrow the parameters further
- Decision tree results > DREAM, an MCMC optimization method > convergence result


##### **Result** 

- 11,000 model runs of DHSVM 
- 10 physical parameters: snow_threshold, lateral conductivity of water across landscape, etc
- The most important parameters: *temperature lapse rate* and *lateral conductivity* of water across landscape
- Local machine without parallelization: 1 month
- Cloud instance (AWS) in parallel: 1 day 




### JupyterHub for curriculum

- The Port Cormorack Jupyter Hub is oriented towards data-driven research
- It is also -- however -- a potential curriculum resource; with these guidelines
    - Teaching a class of **N** students will create a potential scale challenge for **N > 2** 
    - Each student should have a simple sign-on at a unique URL
    - The students' work environment should be pre-populated with content... or with instructions for doing so
    - Students start and modify notebooks on a persisting memory space (only; not one anothers' environments)
    - It will be necessary to constrain and monitor spend (without involving students / credit cards)
    - Some sort of recovery model should be in place should the persistent memory fail
      - The most reliable is to have the students back up to GitHub repositories: An independent system
    - Student access should not have more than a one-minute delay to connect
    - There must be a mechanism for updating course material
    - public cloud { **AWS**, **Google Cloud**, **Azure**, etcetera }
    - Under the hood...
       - Scaling technologies { docker, DockerHub, kubernetes, KOPS, HELM, ...}
       - Replication technologies { JupyterHub, GitHub, nbgitpuller, ...} 
       - Storage {Each user has a persistent disk space or EFS or object storage buckets }
       - Programming language including package manager { Python 3, Anaconda }
       - Authentication: OAuth through Github, Google etc 


#### Two contrasting JupyterHub deployment methods


1. A single giant server where packages and environments are installed directly onto the machine via pip install or conda install. Pros: easy to set up, can quickly install packages as the need arises. Cons: works for small groups, 
constrained memory/compute capacity. 

  * Provision a large server - say one with 48 cpus and 192 GB memory. Follow the steps outlined in https://jupyterhub.readthedocs.io/en/latest/installation-guide.html to install jupyterhub. Thereafter, packages 
and libraries can be installed on the server using the traditional ```pip install``` or ```conda install```. 
This will propagate as **kernels** on Jupyterhub. Each user is assigned a home directory on the server which 
corresponds to their login username (i.e. if Github is the authentication mechanism then the home directories 
should correspond exactly to the Github username).

2. Scalable using a container management and load balancing.  Container management = Docker + Kubernetes + Helm + KOPS... 
Pros: scalable, can accommodate a large number of participants. Cons: uses docker images to create environments, therefore 
needs a longer lead time to ensure all packages, libraries and dependencies are installed; not very improvisational


  * See [this link](https://zero-to-jupyterhub-with-kubernetes.readthedocs.io/en/latest/)


  * Instead of the giant server (generating waste heat much of the time) the number of servers can be scaled up or 
down as required. 

  * Four technologies:

    * Kubernetes which is a manager for the deployment and scaling of containers. Kubernetes clusters are the foundation of the scalable Jupyterhub ecosystem -- it consists of a master node where the Jupyterhub spawner resides and worker nodes where “pods” or docker containers are spawned. 

    * Kubernetes clusters are provisioned on Google Cloud Platform using its native API, Azure using Azure container services and AWS using two methods - KOPS (Kubernetes operations service) or a quickstart Heptio Template. KOPS autoscales while deployment of kubernetes clusters using the Heptio template does NOT autoscale. 


    * Once the kubernetes cluster is provisioned, the kubernetes command tool (kubectl) needs to be installed 
in order to interact the kubernetes cluster. The distinction between kubectl and kubernetes management tools 
like KOPs is that kubectl cannot provision a kubernetes cluster. It is mainly used to issue commands TO a pre-made 
kubernetes cluster.  Kubectl helps you get cluster information like node IPs and # of pods or to figure out if 
something went wrong with your kubernetes cluster. Jupyterhub is installed on the kubernetes cluster using 
helm. Helm is an application manager for kubernetes. Once a user logs on to your scalable Jupyterhub, a “pod” 
starts up. What a pod is, is essentially a docker container. Jupyterhub pulls a docker image from an image that 
you have created; this image contains all your packages and libraries that you want to use for your class. 
The path to this image is specified in a configuration file while installing Jupyterhub via helm. This docker 
image can also contain *some* data though not encouraged to be large datasets. 

    * ```helm install jupyterhub``` to install jupyterhub on the kubernetes cluster


#### Docker


Also requisite. Stipulate how much CPU, memory and disk space to allocate to each user. 
This will factor into the size and number of machines in the cluster. For example let’s say you have 
150 students in your course and each student will be using 4 cpus and 6GB of memory. This will mean 
that your maximum cluster size should be about 17 c4.8xlarge instances on AWS. That will allocate 
space for about 9 pods per instance. 



#### Democratize the process



* Create a **Deployjhub** package to auto-provision the kubernetes cluster, install jupyterhub, deploy the docker image
* Train teaching assistants to spin up instances


