In [1]:
# For interactive plots, comment the next line
%pylab inline
# For interactive plots, uncomment the next line
# %pylab ipympl
import warnings
warnings.filterwarnings('ignore')

Populating the interactive namespace from numpy and matplotlib


# Introduction

> For instructions on using Jupyter notebooks, see the [README.md](../../README.md) file. 

This notebook describes some Python basics, along with specifics about the PODPAC library. Specifically we will go over:

* How to import libraries in Python
* The structure of the PODPAC library
* Basic Python language features such as indexing and class inheritance
* Creating a MATLAB-like environment in Python using the `Numpy` and `Matplotlib` libraries
* Labeled arrays using [xarray](http://xarray.pydata.org/en/stable/)

# Importing modules
* Unlike MATLAB, Python libraries need to be `imported` before they can be used
* Imported libraries usually have a namespace
* Portions of libraries, can be imported

## Examples

In [2]:
import podpac                     # Import PODPAC with the namespace 'podpac'
import podpac as pc               # Import PODPAC with the namespace 'pc'
from podpac import Coordinates    # Import Coordinates from PODPAC into the main namespace

# PODPAC library structure
PODPAC is composed out of multiple sub-modules/sub-libraries. The major ones, from a user's perspective are shown below. 
<img src='../images/podpac-user-api.png' style='width:80%; margin-left:auto;margin-right:auto;' />


We can examine what's in the PODPAC library by using the `dir` function

In [3]:
dir(podpac)

['Coordinates',
 'Node',
 'NodeException',
 'NodeTrait',
 'UnitsDataArray',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'algorithm',
 'authentication',
 'clinspace',
 'compositor',
 'coordinates',
 'core',
 'crange',
 'data',
 'interpolators',
 'managers',
 'pipeline',
 'settings',
 'units',
 'utils',
 'version',
 'version_info']

Anything that starts with the "dunderscore" `__<attr>__` is an internal Python method and can be ignored. 

In PODPAC, the top-level classes and functions are frequently used and include:
* `Coordinates`: class for defining coordinates
* `Node`: Base class for defining PODPAC compute Pipeline
* `NodeException`: The error type thrown by Nodes
* `clinspace`: A helper function used to create uniformly spaced coordinates based on the number of points
* `crange`: Another helper function used to create uniformly spaced coordinates based on step size
* `settings`: A module with various settings that define caching behavior, login credentials, etc.
* `version_info`: Python dictionary giving the version of the PODPAC library

The top-level modules or sub-packages (or sub libraries) include: 
* `algorithm`: here you can find generic `Algorithm` nodes to do different types of computations
* `authentication`: this contains utilities to help authenticate users to download data
* `compositor`: here you can find nodes that help to combine multiple data sources into a single node
* `coordinates`: this module contains additional utilities related to creating coordinates
* `core`: this is where the core library is implemented, and follows the directory structure of the code
* `data`: here you can find generic `DataSource` nodes for reading and interpreting  data sources
* `datalib`: here you can find domain-specific `DataSource` nodes for reading data from specific instruments, studies, and programs
* `interpolators`: this contains classes for dealing with automatic interpolation
* `pipeline`: this contains generic `Pipeline` nodes which can be used to share and re-create PODPAC processing routines

Diving into specifically what's available in some of these submodules

In [4]:
# Generic Algorithm nodes
dir(podpac.algorithm)

['Algorithm',
 'Arange',
 'Arithmetic',
 'Convolution',
 'CoordData',
 'Count',
 'DayOfYear',
 'ExpandCoordinates',
 'Generic',
 'GroupReduce',
 'Kurtosis',
 'Mask',
 'Max',
 'Mean',
 'Median',
 'Min',
 'SelectCoordinates',
 'SinCoords',
 'Skew',
 'SpatialConvolution',
 'StandardDeviation',
 'Sum',
 'TimeConvolution',
 'UnaryAlgorithm',
 'Variance',
 'YearSubstituteCoordinates',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__']

In [5]:
# Generic DataSource nodes
dir(podpac.data)

['Array',
 'CSV',
 'DataSource',
 'Dataset',
 'H5PY',
 'INTERPOLATION_DEFAULT',
 'INTERPOLATION_METHODS',
 'INTERPOLATION_METHODS_DICT',
 'INTERPOLATORS',
 'INTERPOLATORS_DICT',
 'Interpolation',
 'InterpolationException',
 'PyDAP',
 'Rasterio',
 'ReprojectedSource',
 'WCS',
 'Zarr',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'interpolation_trait']

In [6]:
# Specific data libraries built into podpac
import podpac.datalib   # not loaded by default
dir(podpac.datalib)

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


['EGI',
 'GFS',
 'GFSLatest',
 'IntakeCatalog',
 'SMAP',
 'SMAPBestAvailable',
 'SMAPPorosity',
 'SMAPProperties',
 'SMAPSource',
 'SMAPWilt',
 'SMAP_PRODUCT_MAP',
 'TerrainTiles',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'drought_monitor',
 'egi',
 'gfs',
 'intake',
 'nasaCMR',
 'smap',
 'smap_egi',
 'sys',
 'terraintiles']

In [7]:
# Nothing here yet
# dir(podpac.alglib)

# Basic Python languages features
* Python uses zero indexing

In [8]:
alist = [1, 2, 3, 4]
alist[0] 

1

* Python is typeless

In [9]:
mytype = 'is now a string'  # variable mytype is a string
mytype = 154147             # variable mytype is now an integer

* Python is object oriented, supporting class inheritance

In [10]:
# define a class
class MyClass(object):  # Inherits from standard Python object (new-style classes)
    my_class_integer = 0  # This is a class attributes, it will be copied for new instances
    my_class_list = [1]   # This is a class attributes, it will be shared amongst instances
    
    # This is the class constructor
    def __init__(self, my_class_instance_list=None):
        self.my_class_instance_list = my_class_instance_list # This is an instance variable

# Define a child class that inherits from MyClass
class MyChildClass(MyClass): 
    my_child_class_str = 'A string'  # Add a new attribute
    my_class_integer = 1  # Overwrite the value from the base class
    
# Create an instance of each class
my_class = MyClass()
my_child_class = MyChildClass()

# Demonstrate the inheritence
print("The child has the parent's attributes (and methods):")
print("\t my_child_class.my_class_integer=", my_child_class.my_class_integer)
print("\t my_child_class.my_class_list=", my_child_class.my_class_list)
print("\t my_child_class.my_class_instance_list=", my_child_class.my_class_instance_list)
print("\t my_child_class.my_child_class_str=", my_child_class.my_child_class_str)

The child has the parent's attributes (and methods):
	 my_child_class.my_class_integer= 1
	 my_child_class.my_class_list= [1]
	 my_child_class.my_class_instance_list= None
	 my_child_class.my_child_class_str= A string


* Python passes by reference, sometimes...
    * Basic types are copied (int, float, str)
    * Container types are passed my reference (list, tuple, dict, object)
    
See [Python-pass-by-reference-note.ipynb](Python-pass-by-reference-note.ipynb) for more details.

# Creating a MATLAB-like environment in Python

> [**NumPy**](https://www.numpy.org/) and [**Matplotlib**](https://matplotlib.org/) libraries

Unlike MATLAB, the standard Python library does not come with array-handling and plotting capabilities. 

* For array-handling, the `numpy` Python package can be used, and is generally imported as follows:

In [11]:
import numpy as np

* For plotting, the `Matplotlib` Python package can be used, and is generally imported as follows:

In [12]:
import matplotlib.pyplot as plt

* [Numpy for Matlab users](https://docs.scipy.org/doc/numpy-1.15.0/user/numpy-for-matlab-users.html) is a useful reference for new users.
* `Matplotlib` plotting routines use nearly the same interface as MATLAB plotting routines.
* Both `Numpy` and `Matplotlib` can be imported as follows:

In [13]:
from matplotlib.pylab import *

* When using JupyterLab or an IPython console, the "IPython magic function" `%pylab` can be used.

In [14]:
%pylab

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib


* This magic function, when invoked, can be instructed to use different plotting "Backends", and that affects how plots are displayed
```python 
%pylab  # nothing specified will default to creating a new window for plots
%pylab inline  # this will create images (non-interactive) inside the console or JupyterLab notebook
%pylab ipympl  # thiw will create interactive plots inside JupyterLab notebooks
```

# Labeled arrays using [xarray](http://xarray.pydata.org/en/stable/)

PODPAC uses the Python library `xarray` as the output from PODPAC Nodes. `xarray` uses "labeled" arrays, which can be confusing to new users. Labeled arrays give a dimension name, and coordinates for the different dimensions of an array. 

For example, data in a 2-D array might have different rows related to latitudes, and different columns related to longitudes. `xarray` explicitly adds this information to the array. This has a number of advantages:

* Arrays are automatically aligned. If I store my data as latitude=rows and longitude=columns, but someone else stores it as latitude=columns and longitude=rows, then `xarray` will automatically transpose one of these arrays when doing math with them
* Arrays are automatically broadcast. If I wanted to add a 2-D array with latitude and longitude coordinates to a 3-D array with latitude, longitude, time coordinates, `xarray` will automatically broadcast the 2-D array, creating copies for each time point. 
* Operations can be done by dimension name instead of axis. To take the mean over the 'time' dimension, `xarray` allows you to specify 'time' as the axis. You no longer have to remember if it was the first, last, or a different axis in your array. 
* Data can be accessed via dimension name. Again, instead of remembering the axis, the data can be subsetted or sliced by the name. 

While `xarray` offers many advantages over raw `Numpy` arrays, there are a few caveats and drawbacks. For example, since `xarray` automatically aligns coordinates, it's difficult to take the difference between two arrays with different times. For example:

In [15]:
import xarray as xr
# create a labeled array
a = xr.DataArray([2018.1, 2018.2], dims=['time'], coords=[['2018-01-01', '2018-01-02']])
print ('a: ', a)

# create another labeled array with different time coordinate
b = xr.DataArray([2018.3, 2018.4], dims=['time'], coords=[['2018-01-03', '2018-01-04']])
print ('b: ', b)

# take the difference between the two arrays
# The result is an empty array, because none of the coordinates align
print('a-b: ', a - b)

# The proper way to do this with xarray is indexing the time to remove the dimension
print ('b[0]: ', b[0])  # b[0] is now a scalar

# Now we can take the difference
print('a-b[0]', a - b[0])

# or alternatively selecting the dimension by time
print('a-b[0]', a - b.sel(time='2018-01-03'))

a:  <xarray.DataArray (time: 2)>
array([2018.1, 2018.2])
Coordinates:
  * time     (time) <U10 '2018-01-01' '2018-01-02'
b:  <xarray.DataArray (time: 2)>
array([2018.3, 2018.4])
Coordinates:
  * time     (time) <U10 '2018-01-03' '2018-01-04'
a-b:  <xarray.DataArray (time: 0)>
array([], dtype=float64)
Coordinates:
  * time     (time) object 
b[0]:  <xarray.DataArray ()>
array(2018.3)
Coordinates:
    time     <U10 '2018-01-03'
a-b[0] <xarray.DataArray (time: 2)>
array([-0.2, -0.1])
Coordinates:
  * time     (time) <U10 '2018-01-01' '2018-01-02'
a-b[0] <xarray.DataArray (time: 2)>
array([-0.2, -0.1])
Coordinates:
  * time     (time) <U10 '2018-01-01' '2018-01-02'


Fortunately, if you prefer raw arrays, the raw `Numpy` array can always be accessed.

In [16]:
print(a.data, type(a.data))

[2018.1 2018.2] <class 'numpy.ndarray'>
