# AEWS Python Notebook 08e: AEWS miscellanea

**Author**: Eric Lehmann, CSIRO Data61  
**Date**:  July 01, 2016

**Note**: The Python code below is "rudimentary" etc. etc. Priority is here given to code interpretability rather than execution efficiency.

**Note**: this notebook should be accessible and viewable at [https://github.com/eric542/agdc_v2/tree/master/notebooks](https://github.com/eric542/agdc_v2/tree/master/notebooks).

## Summary

Building up on the concepts introduced in the previous notebooks in this series, we work out the remaining components of the AEWS implementation $-$ see *'AEWS Python Notebook 08a'* for details of these components. The contents summary for the present notebook is given below.

**Abstract $-$** From an AEWS perspective, an adequate storage mode is to save the WQ data as a netCDF dataset that includes a time series of WQ maps. As new Landsat imagery is made available on the AGDC, and subsequently processed by the AEWS routines, new WQ maps will need to be appended to the existing netCDF time series. This notebook (08e) investigates a couple of ways this can be achieved using Python.


## Preliminaries

This (Jupyter) notebook was written for use on the NCI's VDI system, with the following pre-loaded module:

```
 $ module use /g/data/v10/public/modules/modulefiles --append
 $ module load agdc-py2-prod 
```

**NOTE**: the specific module loaded here (`agdc-py2-prod`) is different from the module loaded in earlier notebooks (`agdc-py2-dev`)! While the earlier module contained only Landsat 5 data, the `agdc-py2-prod` module links to a (different) AGDC database containing the following NBART/NBAR/PQA datasets:

* Landsat 8: 2013
* Landsat 7: 2013
* Landsat 5: 2006/2007

It is unclear whether the API functions in these 2 modules are identical or represent different versions.

**NOTE 2**: as of mid-June 2016, changes were made to the AGDC API v2.0, and the above Landsat datasets (and related API functions) can now be accessed through the module `agdc-py2-prod/1.0.3` (pre-major-change version). How long this module will remain accessible and/or when it will be replaced with the formal v2.0 API is still unclear at this time (June 2016).

In [1]:
%%html  # Definitions for some pretty text boxes...
<style>
    div.warn { background-color: #e8c9c9; border-left: 5px solid #c27070; padding: 0.5em }
    div.note { background-color: #cce0ff; border-left: 5px solid #5c85d6; padding: 0.5em }
    div.info { background-color: #ffe680; border-left: 5px solid #cca300; padding: 0.5em }
</style>

In [2]:
import numpy as np
from numpy.random import uniform
from netCDF4 import date2num
from datetime import datetime

from netCDF4 import Dataset, num2date
from datetime import date, timedelta

from pprint import pprint
from __future__ import print_function

## Updating a netCDF dataset: method 1

### Setting up the netCDF dataset for appending

New time slices can be appended to an existing netCDF dataset by defining an "unlimited" time dimension. Let's demonstrate this by first creating a test netCDF file `'test.nc'`. This dataset will have 3 dimensions: time, lat and lon, the first of which is defined as unlimited, i.e. it will grow automatically as new data is appended.

In [3]:
rootgrp = Dataset("test.nc", "w")
time = rootgrp.createDimension("time", None)   # using 'None' indicates the dimension is unlimited
lat = rootgrp.createDimension("lat", 73)
lon = rootgrp.createDimension("lon", 144)

In [4]:
print ( rootgrp )
print ( rootgrp.dimensions )
print ( "\nIs the 'time' variable unlimited?", time.isunlimited() )

<type 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    dimensions(sizes): time(0), lat(73), lon(144)
    variables(dimensions): 
    groups: 

OrderedDict([('time', <type 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 0
), ('lat', <type 'netCDF4._netCDF4.Dimension'>: name = 'lat', size = 73
), ('lon', <type 'netCDF4._netCDF4.Dimension'>: name = 'lon', size = 144
)])

Is the 'time' variable unlimited? True


We now create the `time`, `latitude`, `longitude` variables, as well as the main `temp` variable of 3D data:

In [5]:
times = rootgrp.createVariable("time","f8",("time",))
latitudes = rootgrp.createVariable("latitude","f4",("lat",))
longitudes = rootgrp.createVariable("longitude","f4",("lon",))
temp = rootgrp.createVariable("temp","f4",("time","lat","lon",))
print(temp)

<type 'netCDF4._netCDF4.Variable'>
float32 temp(time, lat, lon)
unlimited dimensions: time
current shape = (0, 73, 144)
filling on, default _FillValue of 9.96920996839e+36 used



In [6]:
rootgrp.variables

OrderedDict([('time', <type 'netCDF4._netCDF4.Variable'>
              float64 time(time)
              unlimited dimensions: time
              current shape = (0,)
              filling on, default _FillValue of 9.96920996839e+36 used),
             ('latitude', <type 'netCDF4._netCDF4.Variable'>
              float32 latitude(lat)
              unlimited dimensions: 
              current shape = (73,)
              filling on, default _FillValue of 9.96920996839e+36 used),
             ('longitude', <type 'netCDF4._netCDF4.Variable'>
              float32 longitude(lon)
              unlimited dimensions: 
              current shape = (144,)
              filling on, default _FillValue of 9.96920996839e+36 used),
             ('temp', <type 'netCDF4._netCDF4.Variable'>
              float32 temp(time, lat, lon)
              unlimited dimensions: time
              current shape = (0, 73, 144)
              filling on, default _FillValue of 9.96920996839e+36 used)])

In [7]:
latitudes

<type 'netCDF4._netCDF4.Variable'>
float32 latitude(lat)
unlimited dimensions: 
current shape = (73,)
filling on, default _FillValue of 9.96920996839e+36 used

In [8]:
print(latitudes[:])

[-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --]


The Python variable `latitudes` (for instance) is a netCDF4 variable somehow linked to the netCDF file `'test.nc'`. Setting values for this variable automatically writes them to the `.nc` file (upon closing the connection).

In [9]:
latitudes[:] = np.arange(-90,91,2.5)
longitudes[:] = np.arange(-180,180,2.5)

The length / size of the variables with unlimited dimension automatically grows as new data is appended / added to the variables:

In [10]:
print( temp.shape, times.shape )
print( times[:] )

temp[0:5,:,:] = uniform(size=(5,73,144))   # some random data

print( "\n", temp.shape, times.shape )
print( times[:] )

(0, 73, 144) (0,)
[]

 (5, 73, 144) (5,)
[-- -- -- -- --]


In [11]:
temp[0,:,:]

array([[ 0.24164586,  0.53466713,  0.89016908, ...,  0.51961076,
         0.91276568,  0.23320143],
       [ 0.68931675,  0.82504773,  0.79529351, ...,  0.78653747,
         0.69598752,  0.35384986],
       [ 0.60408092,  0.03871369,  0.94179893, ...,  0.73702049,
         0.15689316,  0.82223141],
       ..., 
       [ 0.30853307,  0.97341442,  0.57557809, ...,  0.56061757,
         0.52724916,  0.38175559],
       [ 0.74199027,  0.82460821,  0.01898139, ...,  0.34432012,
         0.10491152,  0.55185574],
       [ 0.58114558,  0.94513285,  0.18880354, ...,  0.23026513,
         0.50431281,  0.41810608]], dtype=float32)

We can also define attributes for each variable:

In [12]:
latitudes.units = "degrees north"
longitudes.units = "degrees east"
temp.units = "Z"
times.units = "hours since 0001-01-01 00:00:00.0"
times.calendar = "gregorian"

Now we can set the values of the `time` variable, with some plausible date values:

In [13]:
dates = [ datetime(2001,3,1)+n*timedelta(hours=12) for n in range(temp.shape[0]) ]
print ( "Some random dates:\n", dates )

times[:] = date2num(dates, units=times.units, calendar=times.calendar)
print ( "\nCorresponding time values (in units %s): " % times.units+"\n", times[:] )

# if we need to retreive the dates from 'time' values:
# dates = num2date(times[:],units=times.units,calendar=times.calendar)

Some random dates:
 [datetime.datetime(2001, 3, 1, 0, 0), datetime.datetime(2001, 3, 1, 12, 0), datetime.datetime(2001, 3, 2, 0, 0), datetime.datetime(2001, 3, 2, 12, 0), datetime.datetime(2001, 3, 3, 0, 0)]

Corresponding time values (in units hours since 0001-01-01 00:00:00.0): 
 [ 17533104.  17533116.  17533128.  17533140.  17533152.]


Closing the netCDF4 dataset will create / write the dataset to the `'test.nc'` file:

In [14]:
rootgrp.close()

In [15]:
!ls -lh test.nc

-rw-r--r-- 1 eal599 jr4 4.3M Jul  1 09:58 test.nc


### Compressing netCDF variables

Upon closing, this dataset of 5 x 73 x 144 float32 values, has a size of about 4.3MB on disk. Apparently, this can be reduced by using compression of the most important netCDF variables. Let's see what sort of difference this makes...

In [16]:
rootgrp = Dataset("test.nc", "w")   # overwrites existing file
time = rootgrp.createDimension("time", None)
lat = rootgrp.createDimension("lat", 73)
lon = rootgrp.createDimension("lon", 144)

times = rootgrp.createVariable("time","f8",("time",))
latitudes = rootgrp.createVariable("latitude","f4",("lat",))
longitudes = rootgrp.createVariable("longitude","f4",("lon",))
temp = rootgrp.createVariable("temp","f4",("time","lat","lon",), zlib=True, least_significant_digit=2)
   # lossy compression by truncation of the data to a precision of 2 significant digits

latitudes.units = "degrees north"
longitudes.units = "degrees east"
temp.units = "Z"
times.units = "hours since 0001-01-01 00:00:00.0"
times.calendar = "gregorian"

times[:] = date2num(dates, units=times.units, calendar=times.calendar)
latitudes[:] = np.arange(-90,91,2.5)
longitudes[:] = np.arange(-180,180,2.5)
temp[0:5,:,:] = uniform(size=(5,73,144))   # some random data

rootgrp.close()

!ls -lh test.nc

-rw-r--r-- 1 eal599 jr4 4.1M Jul  1 09:58 test.nc


Well... not a big difference here, but might be different for larger and/or different datasets. Probably worth keeping in mind in any case.

### Appending to netCDF dataset

Once we have created the netCDF file as per above (with unlimited 'time' dimension), we can append to it as follows...

In [17]:
grp = Dataset("test.nc", "a")   # open the dataset in 'append' mode
grp

<type 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    dimensions(sizes): time(5), lat(73), lon(144)
    variables(dimensions): float64 [4mtime[0m(time), float32 [4mlatitude[0m(lat), float32 [4mlongitude[0m(lon), float32 [4mtemp[0m(time,lat,lon)
    groups: 

"Read in" the 'temp' variable from the netCDF dataset:

In [18]:
temp = grp['temp']
print( temp )
temp[4,:,:]

<type 'netCDF4._netCDF4.Variable'>
float32 temp(time, lat, lon)
    least_significant_digit: 2
    units: Z
unlimited dimensions: time
current shape = (5, 73, 144)
filling on, default _FillValue of 9.96920996839e+36 used



array([[ 0.09375  ,  0.1796875,  0.609375 , ...,  0.1484375,  0.6328125,
         0.8125   ],
       [ 0.9921875,  0.25     ,  0.5234375, ...,  0.8671875,  0.625    ,
         0.6953125],
       [ 0.4453125,  0.5234375,  0.6328125, ...,  0.390625 ,  0.53125  ,
         0.7265625],
       ..., 
       [ 0.0078125,  0.3359375,  0.3515625, ...,  0.3515625,  0.0625   ,
         0.3203125],
       [ 0.328125 ,  0.109375 ,  0.8984375, ...,  0.953125 ,  0.0078125,
         0.8046875],
       [ 0.0703125,  0.9140625,  0.40625  , ...,  0.4609375,  0.4765625,
         0.1640625]], dtype=float32)

Append new data to 'temp' variable:

In [19]:
temp[5,:,:] = np.round( 50 + uniform(size=(1,73,144)) )   # some new (different) data
temp[5,:,:]

array([[ 50.,  50.,  50., ...,  50.,  51.,  51.],
       [ 51.,  51.,  51., ...,  51.,  51.,  51.],
       [ 50.,  51.,  51., ...,  50.,  50.,  50.],
       ..., 
       [ 51.,  50.,  50., ...,  51.,  50.,  50.],
       [ 51.,  51.,  51., ...,  50.,  50.,  51.],
       [ 50.,  51.,  50., ...,  50.,  50.,  50.]], dtype=float32)

In [20]:
time = grp['time']
time[:]

masked_array(data = [17533104.0 17533116.0 17533128.0 17533140.0 17533152.0 --],
             mask = [False False False False False  True],
       fill_value = 9.96920996839e+36)

As expected, the 'time' variable / dimension has grown by one, though the newly created index does not have a proper value assigned to it yet. Let's define it with some new date:

In [21]:
time[-1] = date2num( datetime(2001,3,1)+10*timedelta(hours=12), units=time.units, calendar=time.calendar)
time[:]

array([ 17533104.,  17533116.,  17533128.,  17533140.,  17533152.,
        17533224.])

Now, closing the netCDF dataset will again save the data to file:

In [22]:
grp.close()
!ls -lh test.nc

-rw-r--r-- 1 eal599 jr4 4.1M Jul  1 09:58 test.nc


We can load up the `.nc` file again to ensure the data has been saved properly:

In [23]:
grp = Dataset("test.nc", "r")
print( grp['temp'] )
print( grp['temp'][(4,5),:,:] )
print( "\n", grp['time'] )
print( grp['time'][:] )
grp.close()

<type 'netCDF4._netCDF4.Variable'>
float32 temp(time, lat, lon)
    least_significant_digit: 2
    units: Z
unlimited dimensions: time
current shape = (6, 73, 144)
filling on, default _FillValue of 9.96920996839e+36 used

[[[  9.37500000e-02   1.79687500e-01   6.09375000e-01 ...,   1.48437500e-01
     6.32812500e-01   8.12500000e-01]
  [  9.92187500e-01   2.50000000e-01   5.23437500e-01 ...,   8.67187500e-01
     6.25000000e-01   6.95312500e-01]
  [  4.45312500e-01   5.23437500e-01   6.32812500e-01 ...,   3.90625000e-01
     5.31250000e-01   7.26562500e-01]
  ..., 
  [  7.81250000e-03   3.35937500e-01   3.51562500e-01 ...,   3.51562500e-01
     6.25000000e-02   3.20312500e-01]
  [  3.28125000e-01   1.09375000e-01   8.98437500e-01 ...,   9.53125000e-01
     7.81250000e-03   8.04687500e-01]
  [  7.03125000e-02   9.14062500e-01   4.06250000e-01 ...,   4.60937500e-01
     4.76562500e-01   1.64062500e-01]]

 [[  5.00000000e+01   5.00000000e+01   5.00000000e+01 ...,   5.00000000e+01
     5.1

## Updating a netCDF dataset: method 2

Another method is to merge two `.nc` files with the `ncrcat` shell command. On the VDI, this command appears to be available by default in the terminal, whereas on NCI's Raijin, it becomes available after loading the following module:

```
 $ module load nco 
```

Let's see how to use this option by first creating another test dataset with 10 new time slices with dates sometime in 2002:

In [24]:
rootgrp = Dataset("test2.nc", "w")
time = rootgrp.createDimension("time", None)
lat = rootgrp.createDimension("lat", 73)
lon = rootgrp.createDimension("lon", 144)

times = rootgrp.createVariable("time","f8",("time",))
latitudes = rootgrp.createVariable("latitude","f4",("lat",))
longitudes = rootgrp.createVariable("longitude","f4",("lon",))
temp = rootgrp.createVariable("temp","f4",("time","lat","lon",), zlib=True, least_significant_digit=2)
   # lossy compression by truncation of the data to a precision of 2 significant digits

latitudes.units = "degrees north"
longitudes.units = "degrees east"
temp.units = "Z"
times.units = "hours since 0001-01-01 00:00:00.0"
times.calendar = "gregorian"

temp[0:10,:,:] = np.round( 80 + uniform(size=(10,73,144)) )   # some random data
dates = [ datetime(2002,12,1)+n*timedelta(hours=12) for n in range(temp.shape[0]) ]
times[:] = date2num(dates, units=times.units, calendar=times.calendar)
latitudes[:] = np.arange(-90,91,2.5)
longitudes[:] = np.arange(-180,180,2.5)

rootgrp.close()

!ls -lh test2.nc

-rw-r--r-- 1 eal599 jr4 4.1M Jul  1 09:58 test2.nc


We now have the two `test.nc` and `test2.nc` files. We should be able to concatenate them using the following command:

In [25]:
!ncrcat test.nc test2.nc -O test_all.nc
!ls -lh test_all.nc

-rw-r--r-- 1 eal599 jr4 4.1M Jul  1 09:58 test_all.nc


This seems to have worked OK (though it's unclear why the new dataset is of similar size to the original datasets). Let's check whether this new concatenated dataset has been correctly generated:

In [26]:
grp = Dataset("test_all.nc", "r")
print( grp['temp'] )
print( grp['temp'][(4,5,6),:,:] )
print( "\n", grp['time'] )
time = grp['time'][:]
print( num2date(times[:],units=times.units,calendar=times.calendar) )
grp.close()

<type 'netCDF4._netCDF4.Variable'>
float32 temp(time, lat, lon)
    least_significant_digit: 2
    units: Z
unlimited dimensions: time
current shape = (16, 73, 144)
filling on, default _FillValue of 9.96920996839e+36 used

[[[  9.37500000e-02   1.79687500e-01   6.09375000e-01 ...,   1.48437500e-01
     6.32812500e-01   8.12500000e-01]
  [  9.92187500e-01   2.50000000e-01   5.23437500e-01 ...,   8.67187500e-01
     6.25000000e-01   6.95312500e-01]
  [  4.45312500e-01   5.23437500e-01   6.32812500e-01 ...,   3.90625000e-01
     5.31250000e-01   7.26562500e-01]
  ..., 
  [  7.81250000e-03   3.35937500e-01   3.51562500e-01 ...,   3.51562500e-01
     6.25000000e-02   3.20312500e-01]
  [  3.28125000e-01   1.09375000e-01   8.98437500e-01 ...,   9.53125000e-01
     7.81250000e-03   8.04687500e-01]
  [  7.03125000e-02   9.14062500e-01   4.06250000e-01 ...,   4.60937500e-01
     4.76562500e-01   1.64062500e-01]]

 [[  5.00000000e+01   5.00000000e+01   5.00000000e+01 ...,   5.00000000e+01
     5.

This seems OK, with not 16 dates in total (the first few in 2001, and last few in 2002), and correctly concatenated 'temp' variable.

As a final check, let's see if the `ncrcat` option also works with more "basic" `.nc` datasets, created without an unlimited dimension, and without variables compression.

In [27]:
rootgrp = Dataset("test.nc", "w")
time = rootgrp.createDimension("time", 3)   # not unlimited!
lat = rootgrp.createDimension("lat", 73)
lon = rootgrp.createDimension("lon", 144)

times = rootgrp.createVariable("time","f8",("time",))
latitudes = rootgrp.createVariable("latitude","f4",("lat",))
longitudes = rootgrp.createVariable("longitude","f4",("lon",))
temp = rootgrp.createVariable("temp","f4",("time","lat","lon",))

temp[0:3,:,:] = np.round( 5 + uniform(size=(3,73,144)) )   # some random data
dates = [ datetime(2002,12,1)+n*timedelta(hours=1) for n in range(temp.shape[0]) ]
times[:] = date2num(dates, units="hours since 0001-01-01 00:00:00.0", calendar="gregorian")
latitudes[:] = np.arange(-90,91,2.5)
longitudes[:] = np.arange(-180,180,2.5)

rootgrp.close()
!ls -lh test.nc

-rw-r--r-- 1 eal599 jr4 131K Jul  1 09:58 test.nc


In [28]:
rootgrp = Dataset("test2.nc", "w")
time = rootgrp.createDimension("time", 2)   # not unlimited!
lat = rootgrp.createDimension("lat", 73)
lon = rootgrp.createDimension("lon", 144)

times = rootgrp.createVariable("time","f8",("time",))
latitudes = rootgrp.createVariable("latitude","f4",("lat",))
longitudes = rootgrp.createVariable("longitude","f4",("lon",))
temp = rootgrp.createVariable("temp","f4",("time","lat","lon",))

temp[0:2,:,:] = np.round( 2 + uniform(size=(2,73,144)) )   # some random data
dates = [ datetime(2003,1,1)+n*timedelta(hours=1) for n in range(temp.shape[0]) ]
times[:] = date2num(dates, units="hours since 0001-01-01 00:00:00.0", calendar="gregorian")
latitudes[:] = np.arange(-90,91,2.5)
longitudes[:] = np.arange(-180,180,2.5)

rootgrp.close()
!ls -lh test2.nc

-rw-r--r-- 1 eal599 jr4 90K Jul  1 09:58 test2.nc


In [29]:
!ncrcat test.nc test2.nc -O test_all.nc

ncrcat: ERROR no variables fit criteria for processing
ncrcat: HINT Extraction list must contain a record variable which to concatenate. A record variable is a variable defined with a record dimension. Often the record dimension, aka unlimited dimension, refers to time. For more information on creating record dimensions within existing datasets, see http://nco.sf.net/nco.html#mk_rec_dmn


OK, so it looks like to be able to use this option, our WQ datasets will have to have the time variable defined as unlimited.