New API for Canvas.raster (#556)

As mentioned in holoviz/holoviews#1909, the ``Canvas.raster()`` API has not been matching the rest of datashader, due to that code originating in an external project (gridtools). These differences made it difficult for external tools like HoloViews to provide a consistent interface for datashading across object types. Fully rewriting this code would be a lot of work, but this PR rewrites the top-level API to be more similar to other Canvas glyph types. The changes should be nearly fully backwards compatible for now, but the previous way of doing it will now be deprecated and removed in a later release. API Changes: - Renamed ``Canvas.raster(downsample_method=X)`` to ``Canvas.raster(agg=X)``: What gridtools calls "downsampling" is precisely the same concept as what datashader calls aggregation everywhere else. "reduction" is perhaps even more accurate, but arguments like that are called ``agg`` elsewhere, so I've adopted that convention here as well. ``downsample_method`` is still accepted as an alias for now, if ``agg`` is not present. - Renamed ``Canvas.raster(upsample_method=X)`` to ``Canvas.raster(interpolate=X)``: "interpolate" seems like a better complement to "agg" than "upsample_method". ``upsample_method`` is still accepted as an alias for now, if ``interpolate`` is not present. - The ``agg`` argument of other calls accepts any object of type ``.reductions.Reduction``, but ``raster(downsample_method=...)`` accepted only string arguments. It now accepts ``Reduction`` objects like ``rd.mean()``, extracting the "column" name, if any, and comparing it to the DataArray's name, if any (signaling an error if a "column" name is specified but doesn't match). Of course, it's not really a "column", but it's the same idea. The column name need not be provided to the agg, but if it is, it must match any name declared for the DataArray. String names are still accepted, for backwards compatibility, but will eventually be removed. - Reduction has been changed to allow instantiation without any argument, for use with unnamed DataArrays. This change may make some messages for user errors more confusing, but additional checks have been added to alleviate that. - Stub reduction functions have been added for three aggregations supported by ``Canvas.raster`` but not previously available ``.reductions``: ``mode``, ``first``, and ``last``. All three are designed for use with categorical data where numerical averaging is not appropriate and an actual existing value must be returned. For now, these work only with raster, but at least ``first`` and ``last`` should be able to be implemented for other glyph types easily. (``mode`` is more complicated because it would require unbounded buffers per pixel to hold all distinct values encountered). - ``Canvas.raster`` now accepts xarray Datasets (collections of aligned DataArrays), with the column argument to each reduction selecting the appropriate DataArray from the Dataset. - Made the interpolation support in ``Canvas.trimesh`` match that of ``Canvas.raster``, using a string argument ``interpolate`` instead of a Boolean ``interp``. The Boolean is still accepted for now, but will be deleted before release once HoloViews and GeoViews master have been updated. - The ``layer`` argument of Canvas.raster was previously confusingly a *1-based* integer index, but it is now an xarray coordinate. Xarray coordinates support 0-based, 1-based, or arbitrary floating-point indexing depending on how the DataArray was declared, and so the behavior should be the same for arrays explicitly declared with 1-based indexing (such as multi-band Landsat images indexed with integers), but in other cases the proper coordinate will now need to be supplied.
holoviz · Jan 26, 2018 · 37aa852 · 37aa852
1 parent 686eac6
commit 37aa852
Show file tree

Hide file tree

Showing 7 changed files with 327 additions and 91 deletions.
diff --git a/datashader/core.py b/datashader/core.py
@@ -4,29 +4,14 @@
 import pandas as pd
 import dask.dataframe as dd
 from dask.array import Array
-from xarray import DataArray
+from xarray import DataArray, Dataset
 from collections import OrderedDict
 
 from .utils import Dispatcher, ngjit, calc_res, calc_bbox, orient_array, compute_coords, get_indices, dshape_from_pandas, dshape_from_dask, categorical_in_dtypes
-from .resampling import (resample_2d, US_NEAREST, US_LINEAR, DS_FIRST, DS_LAST,
-                         DS_MEAN, DS_MODE, DS_VAR, DS_STD, DS_MIN, DS_MAX)
+from .resampling import resample_2d
+from .utils import Expr # noqa (API import)
 
-
-class Expr(object):
-    """Base class for expression-like objects.
-
-    Implements hashing and equality checks. Subclasses should implement an
-    ``inputs`` attribute/property, containing a tuple of everything that fully
-    defines that expression.
-    """
-    def __hash__(self):
-        return hash((type(self), self.inputs))
-
-    def __eq__(self, other):
-        return type(self) is type(other) and self.inputs == other.inputs
-
-    def __ne__(self, other):
-        return not self == other
+from . import reductions as rd
 
 
 class Axis(object):
@@ -201,7 +186,7 @@ def line(self, source, x, y, agg=None):
             agg = any_rdn()
         return bypixel(source, self, Line(x, y), agg)
 
-    def trimesh(self, vertices, simplices, mesh=None, agg=None, interp=True):
+    def trimesh(self, vertices, simplices, mesh=None, agg=None, interp=True, interpolate=None):
         """Compute a reduction by pixel, mapping data to pixels as a triangle.
 
         >>> import datashader as ds
@@ -240,17 +225,29 @@ def trimesh(self, vertices, simplices, mesh=None, agg=None, interp=True):
             purposes. This dataframe is expected to have come from
             ``datashader.utils.mesh()``. If this argument is not None, the first
             two arguments are ignored.
-        interp : boolean, optional
-            Specify whether to do bilinear interpolation of the pixels within each
-            triangle. This can be thought of as a "weighted average" of the vertex
-            values. Defaults to True.
+        interpolate : str, optional default=linear
+            Method to use for interpolation between specified values. ``nearest``
+            means to use a single value for the whole triangle, and ``linear``
+            means to do bilinear interpolation of the pixels within each
+            triangle (a weighted average of the vertex values). For 
+            backwards compatibility, also accepts ``interp=True`` for ``linear``
+            and ``interp=False`` for ``nearest``.
         """
         from .glyphs import Triangles
         from .reductions import mean as mean_rdn
         from .utils import mesh as create_mesh
 
         source = mesh
 
+        # 'interp' argument is deprecated as of datashader=0.6.4
+        if interpolate is not None:
+            if interpolate == 'linear':
+                interp = True
+            elif interpolate == 'nearest':
+                interp = False
+            else:
+                raise ValueError('Invalid interpolate method: options include {}'.format(['linear','nearest']))
+
         # Validation is done inside the [pd]d_mesh utility functions
         if source is None:
             source = create_mesh(vertices, simplices)
@@ -274,9 +271,11 @@ def trimesh(self, vertices, simplices, mesh=None, agg=None, interp=True):
     def raster(self,
                source,
                layer=None,
-               upsample_method='linear',
-               downsample_method='mean',
-               nan_value=None):
+               upsample_method='linear',    # Deprecated as of datashader=0.6.4
+               downsample_method=rd.mean(), # Deprecated as of datashader=0.6.4
+               nan_value=None,
+               agg=None,
+               interpolate=None):
         """Sample a raster dataset by canvas size and bounds.
 
         Handles 2D or 3D xarray DataArrays, assuming that the last two
@@ -291,16 +290,18 @@ def raster(self,
 
         Parameters
         ----------
-        source : xarray.DataArray
-            input datasource most likely obtain from `xr.open_rasterio()`.
-        layer : int
-            source layer number : optional default=None
-        upsample_method : str, optional default=linear
-            resample mode when upsampling raster.
+        source : xarray.DataArray or xr.Dataset
+            2D or 3D labelled array (if Dataset, the agg reduction must
+            define the data variable).
+        layer : float
+            For a 3D array, value along the z dimension : optional default=None
+        interpolate : str, optional  default=linear
+            Resampling mode when upsampling raster.
             options include: nearest, linear.
-        downsample_method : str, optional default=mean
-            resample mode when downsampling raster.
-            options include: first, last, mean, mode, var, std
+        agg : Reduction, optional default=mean()
+            Resampling mode when downsampling raster.
+            options include: first, last, mean, mode, var, std, min, max
+            Also accepts string names, for backwards compatibility.
         nan_value : int or float, optional
             Optional nan_value which will be masked out when applying
             the resampling.
@@ -310,28 +311,66 @@ def raster(self,
         data : xarray.Dataset
 
         """
-        upsample_methods = dict(nearest=US_NEAREST,
-                                linear=US_LINEAR)
-
-        downsample_methods = dict(first=DS_FIRST,
-                                  last=DS_LAST,
-                                  mean=DS_MEAN,
-                                  mode=DS_MODE,
-                                  var=DS_VAR,
-                                  std=DS_STD,
-                                  min=DS_MIN,
-                                  max=DS_MAX)
-
-        if upsample_method not in upsample_methods.keys():
-            raise ValueError('Invalid upsample method: options include {}'.format(list(upsample_methods.keys())))
-        if downsample_method not in downsample_methods.keys():
-            raise ValueError('Invalid downsample method: options include {}'.format(list(downsample_methods.keys())))
+        # For backwards compatibility
+        if agg         is None: agg=downsample_method
+        if interpolate is None: interpolate=upsample_method
+
+        upsample_methods = ['nearest','linear']
+
+        downsample_methods = {'first':'first', rd.first:'first',
+                              'last':'last',   rd.last:'last',
+                              'mode':'mode',   rd.mode:'mode',
+                              'mean':'mean',   rd.mean:'mean',
+                              'var':'var',     rd.var:'var',
+                              'std':'std',     rd.std:'std',
+                              'min':'min',     rd.min:'min',
+                              'max':'max',     rd.max:'max'}
+
+        if interpolate not in upsample_methods:
+            raise ValueError('Invalid interpolate method: options include {}'.format(upsample_methods))
+
+        if not isinstance(source, (DataArray, Dataset)):
+            raise ValueError('Expected xarray DataArray or Dataset as '
+                             'the data source, found %s.'
+                             % type(source).__name__)
+
+        column = None
+        if isinstance(agg, rd.Reduction):
+            agg, column = type(agg), agg.column
+            if (isinstance(source, DataArray) and column is not None
+                and source.name != column):
+                agg_repr = '%s(%r)' % (agg.__name__, column)
+                raise ValueError('DataArray name %r does not match '
+                                 'supplied reduction %s.' %
+                                 (source.name, agg_repr))
+
+        if isinstance(source, Dataset):
+            data_vars = list(source.data_vars)
+            if column is None:
+                raise ValueError('When supplying a Dataset the agg reduction '
+                                 'must specify the variable to aggregate. '
+                                 'Available data_vars include: %r.' % data_vars)
+            elif column not in source.data_vars:
+                raise KeyError('Supplied reduction column %r not found '
+                               'in Dataset, expected one of the following '
+                               'data variables: %r.' % (column, data_vars))
+            source = source[column]
+
+        if agg not in downsample_methods.keys():
+            raise ValueError('Invalid aggregation method: options include {}'.format(list(downsample_methods.keys())))
+        ds_method = downsample_methods[agg]
+
+        if source.ndim not in [2, 3]:
+            raise ValueError('Raster aggregation expects a 2D or 3D '
+                             'DataArray, found %s dimensions' % source.ndim)
 
         res = calc_res(source)
         ydim, xdim = source.dims[-2:]
         xvals, yvals = source[xdim].values, source[ydim].values
         left, bottom, right, top = calc_bbox(xvals, yvals, res)
-        array = orient_array(source, res, layer)
+        if layer is not None:
+            source=source.sel(**{source.dims[0]: layer})
+        array = orient_array(source, res)
         dtype = array.dtype
 
         if nan_value is not None:
@@ -354,26 +393,26 @@ def raster(self,
         height_ratio = (ymax - ymin) / (self.y_range[1] - self.y_range[0])
 
         if np.isclose(width_ratio, 0) or np.isclose(height_ratio, 0):
-            raise ValueError('Canvas x_range or y_range values do not match closely-enough with the data source to be able to accurately rasterize. Please provide ranges that are more accurate.')
+            raise ValueError('Canvas x_range or y_range values do not match closely enough with the data source to be able to accurately rasterize. Please provide ranges that are more accurate.')
 
         w = int(np.ceil(self.plot_width * width_ratio))
         h = int(np.ceil(self.plot_height * height_ratio))
         cmin, cmax = get_indices(xmin, xmax, xvals, res[0])
         rmin, rmax = get_indices(ymin, ymax, yvals, res[1])
 
-        kwargs = dict(w=w, h=h, ds_method=downsample_methods[downsample_method],
-                      us_method=upsample_methods[upsample_method], fill_value=fill_value)
+        kwargs = dict(w=w, h=h, ds_method=ds_method,
+                      us_method=interpolate, fill_value=fill_value)
         if array.ndim == 2:
             source_window = array[rmin:rmax+1, cmin:cmax+1]
             if isinstance(source_window, Array):
                 source_window = source_window.compute()
-            if downsample_method in ['var', 'std']:
+            if ds_method in ['var', 'std']:
                 source_window = source_window.astype('f')
             data = resample_2d(source_window, **kwargs)
             layers = 1
         else:
             source_window = array[:, rmin:rmax+1, cmin:cmax+1]
-            if downsample_method in ['var', 'std']:
+            if ds_method in ['var', 'std']:
                 source_window = source_window.astype('f')
             arrays = []
             for arr in source_window:

diff --git a/datashader/glyphs.py b/datashader/glyphs.py
@@ -3,8 +3,7 @@
 from toolz import memoize
 import numpy as np
 
-from .core import Expr
-from .utils import ngjit, isreal
+from .utils import ngjit, isreal, Expr
 
 
 class Glyph(Expr):

diff --git a/datashader/reductions.py b/datashader/reductions.py
@@ -6,8 +6,7 @@
 from toolz import concat, unique
 import xarray as xr
 
-from .core import Expr
-from .utils import ngjit
+from .utils import Expr, ngjit
 
 
 class Preprocess(Expr):
@@ -34,10 +33,12 @@ def apply(self, df):
 
 class Reduction(Expr):
     """Base class for per-bin reductions."""
-    def __init__(self, column):
+    def __init__(self, column=None):
         self.column = column
 
     def validate(self, in_dshape):
+        if not self.column in in_dshape.dict:
+            raise ValueError("specified column not found")
         if not isnumeric(in_dshape.measure[self.column]):
             raise ValueError("input must be numeric")
 
@@ -76,7 +77,7 @@ def __init__(self, column=None):
 
     @property
     def inputs(self):
-        return (extract(self.column),) if self.column else ()
+        return (extract(self.column),) if self.column is not None else ()
 
     def validate(self, in_dshape):
         pass
@@ -382,6 +383,111 @@ def _finalize(bases, **kwargs):
         return xr.DataArray(x, **kwargs)
 
 
+class first(Reduction):
+    """First value encountered in ``column``.
+
+    Useful for categorical data where an actual value must always be returned, 
+    not an average or other numerical calculation.
+    
+    Currently only supported for rasters, externally to this class.
+
+    Parameters
+    ----------
+    column : str
+        Name of the column to aggregate over. If the data type is floating point, 
+        ``NaN`` values in the column are skipped.
+    """
+    _dshape = dshape(Option(ct.float64))
+
+    @staticmethod 
+    def _append(x, y, agg):
+        raise NotImplementedError("first is currently implemented only for rasters")
+
+    @staticmethod 
+    def _create(shape):
+        raise NotImplementedError("first is currently implemented only for rasters")
+
+    @staticmethod
+    def _combine(aggs):
+        raise NotImplementedError("first is currently implemented only for rasters")
+
+    @staticmethod
+    def _finalize(bases, **kwargs):
+        raise NotImplementedError("first is currently implemented only for rasters")
+
+
+
+class last(Reduction):
+    """Last value encountered in ``column``.
+
+    Useful for categorical data where an actual value must always be returned, 
+    not an average or other numerical calculation.
+    
+    Currently only supported for rasters, externally to this class.
+
+    Parameters
+    ----------
+    column : str
+        Name of the column to aggregate over. If the data type is floating point, 
+        ``NaN`` values in the column are skipped.
+    """
+    _dshape = dshape(Option(ct.float64))
+
+    @staticmethod 
+    def _append(x, y, agg):
+        raise NotImplementedError("last is currently implemented only for rasters")
+
+    @staticmethod 
+    def _create(shape):
+        raise NotImplementedError("last is currently implemented only for rasters")
+
+    @staticmethod
+    def _combine(aggs):
+        raise NotImplementedError("last is currently implemented only for rasters")
+
+    @staticmethod
+    def _finalize(bases, **kwargs):
+        raise NotImplementedError("last is currently implemented only for rasters")
+
+
+
+class mode(Reduction):
+    """Mode (most common value) of all the values encountered in ``column``.
+
+    Useful for categorical data where an actual value must always be returned, 
+    not an average or other numerical calculation.
+    
+    Currently only supported for rasters, externally to this class.
+    Implementing it for other glyph types would be difficult due to potentially
+    unbounded data storage requirements to store indefinite point or line
+    data per pixel.
+
+    Parameters
+    ----------
+    column : str
+        Name of the column to aggregate over. If the data type is floating point, 
+        ``NaN`` values in the column are skipped.
+    """
+    _dshape = dshape(Option(ct.float64))
+
+    @staticmethod 
+    def _append(x, y, agg):
+        raise NotImplementedError("mode is currently implemented only for rasters")
+
+    @staticmethod 
+    def _create(shape):
+        raise NotImplementedError("mode is currently implemented only for rasters")
+
+    @staticmethod
+    def _combine(aggs):
+        raise NotImplementedError("mode is currently implemented only for rasters")
+
+    @staticmethod
+    def _finalize(bases, **kwargs):
+        raise NotImplementedError("mode is currently implemented only for rasters")
+
+
+
 class summary(Expr):
     """A collection of named reductions.