# The genutil module (general utilities)

This module contains ***cdms aware*** general utilities that are not restricted to Earth Sciences

As a general rule it means that axes operation can be done by axis *name* instead of simply indices. This allows for generic code that works on data with different number of dimension or a different order (not common in our case).

Also genutil functions are usually capable of worknig on multiple axes at the same time. (e.g average over multiple dimensions)

The most commonly used function from genutil is the `averager`.

Area averaging is one of the most common data reduction procedures used in climate data analysis. The `genutil` package has a powerful area averaging function. The `averager()` function provides a convenient way of averaging your data giving you control over the order of operations (i.e which dimensions are averaged over first) and also the weighting for the different axes. You can pass your own array of weights for each dimension or use the default (grid) weights or specify equal weighting.

**Usage:**

```
result = averager (data, axis= axisoptions , weights= weightoptions , action= actionoptions , returned= returnedoptions , combinewts=combinewtsoptions )
```

Let's demonstrate some of this, using some of our data

In [1]:
import cdms2
ipsl_ps_file = cdms2.open("/global/cscratch1/sd/cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/ps/gr/v20180803/ps_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc")
ps = ipsl_ps_file("ps", time=("2000", "2010"))
ipsl_ta_file = cdms2.open("/global/cscratch1/sd/cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/ta/gr/v20180803/ta_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc")
ta = ipsl_ta_file("ta", time=("2000", "2010"), order='xtzy') # Just to make it hard
print("ps:", ps.shape)
print("ta:", ta.shape)

ps: (120, 143, 144)
ta: (144, 120, 19, 143)


As you can see to average over latitude in numpy we would have to run a different function for `ps` and `ta`

In [2]:
import numpy
ps_zonal_np = numpy.average(ps, 1)
ta_zonal_np = numpy.average(ta, 3)
print(ps_zonal_np.shape)
print(ta_zonal_np.shape)

(120, 144)
(144, 120, 19)


With genutil we do not need to worry about this

In [3]:
import genutil
ps_zonal_gen = genutil.averager(ps, axis='y')
ta_zonal_gen = genutil.averager(ta, axis='y')
print(ps_zonal_gen.shape)
print(ta_zonal_gen.shape)

(120, 144)
(144, 120, 19)


As mentioned earlier, an additional *bonus* from using `genutil` is that the average is properly weighted according to bounds, (numpy applies equal weights to each latitude)

In [4]:
print("Max percentage difference: {:.2f}%".format(((ps_zonal_gen-ps_zonal_np)/ps_zonal_gen).max()*100.))

Max percentage difference: 3.37%


# Usage Details

Usage:

```
result = averager (data, axis= axisoptions , weights= weightoptions , action= actionoptions , returned= returnedoptions , combinewts=combinewtsoptions )
```

#### axisoptions

    Default: 0 (the first dimension in the data you pass to the function).
        Restrictions: axisoptions has to be a string. You can pass axis='tyx', or '123', or 'x (plev)' etc. the same
        way as in order= options for variable operations EXCEPT that '...' (i.e Ellipses) are not allowed. If V is an
        array of type Numeric or MA, the axisoptions can only be of the form '123'.

#### weightoptions

'generate' | 'weighted' | 'equal' | 'unweighted' | array | Masked Variable

    Default:

    'weighted' for Transient Variables (MVs)
    'unweighted' for MA or Numeric arrays.

Note that depending on the array being operated on by averager, the default weights change!
'weighted' or 'generate' means the averaging uses the grid information to generate weights for that dimension.
'equal' or 'unweighted' means use equal weights for all the grid points in that axis.
array is an array of weights (of the same shape as the dimension being averaged over or same shape as V) can be passed.

Masked Variable means an MV of the same shape as V can be passed.

Additional Notes:

    'generate' or 'weighted':
        The weights are generated using the bounds for the specified axis. For latitude and longitude, the weights 
        are calculated using the area (see the cdms manual grid.getWeights() for more details) whereas for the other
        weights are the difference between the bounds (when the bounds are available). If the bounds are stored in
        the file being read in, then those values are used. Otherwise, bounds are generated as long as
        cdms.setAutoBounds ('on') is set. If cdms.setAutoBounds() is set to 'off', then an Error is raised.

#### actionoptions

'average' | 'sum'

    Default: 'average'

    You can either return the weighted average or the weighted sum of the data.

#### returnedoptions

0 | 1

    Default: 0

    0 implies that the sum of weights are not returned after averaging operation).

    1 implies that the sum of weights after the average operation is returned.


#### combinewtsoption

0 | 1

    Default: 0

    0 implies weights passed for individual axes are not combined into one weight array for the full variable V before
    performing operation.

    1 implies weights passed for individual axes are combined into one weight array for the full variable before
    performing average or sum operations. One- dimensional weight arrays or key words of ‘weighted’ or ‘unweighted’ 
    must be passed for the axes over which the operation is to be performed.

    Additionally, weights for axes that are not being averaged or summed may also bepassed in the order in which they
    appear. If the weights for the other axes are not passed, they are assumed to be equally weighted.

# Multiple Axes Operations

## General Case

`genutil` let you average over multiple operations at once:

In [5]:
ps_time_serie = genutil.averager(ps, axis='xy')
print(ps_time_serie[:12])  # First year

[98506.73694889374 98506.73692899902 98506.73696174526 98506.73694023809
 98506.73694071262 98506.73694012045 98506.73694027291 98506.73697412701
 98506.73694064411 98506.73694002412 98506.73690655541 98506.73697967327]


# Missing Values, Ordering Matters!

`genutil` deals properly with missing values.

But it is worth mentioning that when missing values are present the **ordering of operations matters** 

Please consider the following example (pure `numpy` for simplicity)

Let's generate some masked data

In [6]:
data = numpy.ma.array([[3,4,999,7], [999,5,999,999],[1,2,5,5],[999,999,6,4.]])
data = numpy.ma.masked_equal(data, 999.)
data

masked_array(
  data=[[3.0, 4.0, --, 7.0],
        [--, 5.0, --, --],
        [1.0, 2.0, 5.0, 5.0],
        [--, --, 6.0, 4.0]],
  mask=[[False, False,  True, False],
        [ True, False,  True,  True],
        [False, False, False, False],
        [ True,  True, False, False]],
  fill_value=999.0)

Let's average over the 0 axis first

In [7]:
print("Y, X:", numpy.ma.average(numpy.ma.average(data, axis=0)))

Y, X: 4.125


Let's average over the second axis first:

In [8]:
print("X, Y:", numpy.ma.average(numpy.ma.average(data, axis=-1)))

X, Y: 4.479166666666667


Now let's average over al axes at once

In [7]:
print("All:", numpy.ma.average(data))

All: 4.2


As you can see when dealing with missing data the order of operation does matter.

Fortunately `genutil.averager` can help with this by returning the weights

In [8]:
temp, weights = genutil.averager(data, axis=0, returned=1)
print(weights)
print(genutil.averager(temp, weights=weights))  # Correct average

[2. 3. 2. 3.]
4.2


In [9]:
temp, weights = genutil.averager(data, axis=1, returned=1)
print(weights)
print(genutil.averager(temp, weights=weights))  # Correct average

[3. 1. 4. 2.]
4.2


In [10]:
cdms2.setAutoBounds(1)
data2 = cdms2.MV2.array(data)
genutil.averager(data2, axis='(axis_0)(axis_1)')

variable_80
masked_array(data=4.2,
             mask=False,
       fill_value=1e+20)