# AnnData-Dask Operations Test
A collection of example operations on AnnData objects, and their constituent members, demonstrating what works and doesn't when AnnData are loaded with Dask DataFrames and Arrays.

* Currently only the `X`, `obs`, and `var` members are loaded in Dask mode
* Three versions of loading from the same `.h5ad` are compared:
  * `backed` mode (out-of-core, lazy pointer to an `.h5ad` on disk)
  * `dask` mode (members are lazy Dask objects)
  * "memory" mode (default, eager in-memory AnnData, made of Pandas DataFrames and Numpy ndarrays)

- [Setup](#Setup)
- [Basic `X` operations](#Basic,-working-operations)
- [`AnnData.iloc` fails](#AnnData.iloc-fails)
- [`obs`](#obs)

## Setup

In [24]:
import anndata
path = Path(anndata.__file__).parent.parent / 'old.h5ad'
assert(path.exists())

from anndata import read_h5ad
ad_bak = b = read_h5ad(path, backed='r', dask=False)  # "backed" mode
ad_dsk = d = read_h5ad(path, backed='r', dask= True)  # Dask
ad_mem = m = read_h5ad(path)                          # in-memory

read_dataframe_legacy: <HDF5 dataset "obs": shape (100,), type "|V25"> (dask False)
read_dataframe_legacy: <HDF5 dataset "var": shape (200,), type "|V24"> (dask False)
Calling ctor: ['filename', 'filemode', 'obs', 'var', 'raw', 'dtype', 'dask']
_init_as_actual: X is None…
read_dataframe_legacy: <HDF5 dataset "obs": shape (100,), type "|V25"> (dask True)
read_dataframe_legacy: <HDF5 dataset "var": shape (200,), type "|V24"> (dask True)
Calling ctor: ['filename', 'filemode', 'obs', 'var', 'raw', 'dtype', 'dask']
_init_as_actual: X is None…
read_dataframe_legacy: <HDF5 dataset "obs": shape (100,), type "|V25"> (dask False)
read_dataframe_legacy: <HDF5 dataset "var": shape (200,), type "|V24"> (dask False)


In [25]:
from anndata.tests.utils.eq import try_eq, normalize as norm
from inspect import getfullargspec
from sys import stderr

def cmp(fn):
    '''Compare the results of running a given function on (in-memory, "backed", and "dask") AnnData.
    
    - Returns the three results, as well as comparisons of "dask" mode to the other two.
    - Exceptions in evaluating the `fn` are caught and returned in place of a value.
    - Comparisons with "Dask mode" are skipped if either side raised instead of returning value
    '''

    [arg] = getfullargspec(fn).args
    if arg != 'a':
        _fn = lambda a: fn(getattr(a, arg))
    else:
        _fn = fn

    def _try(ad):
        '''Trying applying `fn` to a given AnnData
        
        Return a 2-tuple containing [the value] xor [an Exception that was raised]'''
        try:
            return (_fn(ad), None)
        except Exception as e:
            return (None, e)

    bak_val, bak_err = _try(ad_bak)
    dsk_val, dsk_err = _try(ad_dsk)
    mem_val, mem_err = _try(ad_mem)
    
    obj = {}

    def add_val_err(*args):
        for k, val, err in args:
            if err:
                stderr.write(f'{k}.err: {err}\n')
                obj[k] = { 'err': err }
            else:
                obj[k] = { 'val': val }
    
    add_val_err(
        ('bak', bak_val, bak_err),
        ('dsk', dsk_val, dsk_err),
        ('mem', mem_val, mem_err),
    )
    
    def diff(*args):
        for k, a_val, a_err, b_val, b_err in args:
            result = \
                None \
                if a_err or b_err \
                else try_eq(
                    norm(a_val), 
                    norm(b_val),
                )

            if result:
                stderr.write(f'{k}: {result}\n')
                obj[k] = result


    diff(
        ('bak_vs_mem', bak_val, bak_err, mem_val, mem_err),
        ('bak_vs_dsk', bak_val, bak_err, dsk_val, dsk_err),
        ('mem_vs_dsk', mem_val, mem_err, dsk_val, dsk_err),
    )

    return obj

def check(*fns):
    if len(fns) == 1:
        (fn,) = fns
        o = cmp(fn)
        if 'err' in o['dsk']: raise o['dsk']['err']
        if any([
            k in o and o[k]
            for o, k in [
                (o,'bak_vs_mem'),
                (o,'bak_vs_dsk'),
                (o,'mem_vs_dsk'),
                (o['dsk'],'err'),
            ]
        ]):
            return o
        else:
            return None
    
    results = [ check(fn) for fn in fns ]
    if all([ result is None for result in results ]):
        return None
    else:
        return results

## Basic, working operations

### Basic load: `X`, `obs`, `var`

In [26]:
check(
    lambda X:X,
    lambda obs:obs,
    lambda var:var,
)

### `X`: integer/range slicing
…along either/both dimensions:

In [27]:
check(
    lambda X:X[:10,   ],
    lambda X:X[:10,:  ],
    lambda X:X[:10, 10],
    lambda X:X[:  , 10],
    lambda X:X[ 10, 10],
    lambda X:X[:10,:10],
    lambda X:X[:  ,:10],
    lambda X:X[ 10,:10],
)

### Arithmetic on `X`
`backed` mode actually fails here, because its `X` is a `SparseDataset` (lazy wrapper around a `spmatrix` backed by an HDF5 `Group`).

`SparseDataset` supports slicing, and exposes an `spmatrix` in memory via `.value`, but otherwise doesn't support direct array manipulation.

In [28]:
check(lambda X:X*2)

bak.err: unsupported operand type(s) for *: 'SparseDataset' and 'int'


## `AnnData.iloc` fails

### Single row

In [29]:
check(lambda a:a[10])

dsk.err: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'.


NotImplementedError: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'.

### Row range

In [30]:
check(lambda a:a[:10])

dsk.err: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'.


NotImplementedError: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'.

### Single column

In [31]:
raise check(lambda a:a[:,10])['dsk']['err']

dsk.err: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'.


NotImplementedError: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'.

### Column range

In [32]:
check(lambda a:a[:,:10])

dsk.err: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'.


NotImplementedError: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'.

## `obs`

### `.loc`

Slicing over all rows works:

In [33]:
check(
    lambda obs:obs.loc[:,'Prime'],
    lambda obs:obs.loc[:,['Prime']],
    lambda obs:obs.loc[:,['label','Prime']],
    lambda obs:obs.loc[:,:],
)

Dask leaves an extra dimension in several `DDF.loc` calls (i.e. returns a 1-row DataFrame instead of a Series, or a 1-row Series instead of a scalar):

In [34]:
check(lambda obs:obs.loc['2'])

bak_vs_dsk: Series Expected type <class 'pandas.core.series.Series'>, found <class 'pandas.core.frame.DataFrame'> instead
mem_vs_dsk: Series Expected type <class 'pandas.core.series.Series'>, found <class 'pandas.core.frame.DataFrame'> instead


{'bak': {'val': label    row 2
  idx²         4
  Prime     True
  Name: 2, dtype: object},
 'dsk': {'val': Dask DataFrame Structure:
                  label   idx² Prime
  npartitions=1                     
                 object  int64  bool
                    ...    ...   ...
  Dask Name: try_loc, 3 tasks},
 'mem': {'val': label    row 2
  idx²         4
  Prime     True
  Name: 2, dtype: object},
 'bak_vs_dsk': AssertionError("Series Expected type <class 'pandas.core.series.Series'>, found <class 'pandas.core.frame.DataFrame'> instead"),
 'mem_vs_dsk': AssertionError("Series Expected type <class 'pandas.core.series.Series'>, found <class 'pandas.core.frame.DataFrame'> instead")}

In [35]:
check(lambda obs:obs.loc['2','Prime'])

bak_vs_dsk: Mismatched types: <class 'numpy.bool_'> vs. <class 'pandas.core.series.Series'>
mem_vs_dsk: Mismatched types: <class 'numpy.bool_'> vs. <class 'pandas.core.series.Series'>


{'bak': {'val': True},
 'dsk': {'val': Dask Series Structure:
  npartitions=1
      bool
       ...
  Name: Prime, dtype: bool
  Dask Name: try_loc, 3 tasks},
 'mem': {'val': True},
 'bak_vs_dsk': AssertionError("Mismatched types: <class 'numpy.bool_'> vs. <class 'pandas.core.series.Series'>"),
 'mem_vs_dsk': AssertionError("Mismatched types: <class 'numpy.bool_'> vs. <class 'pandas.core.series.Series'>")}

Slicing multiple rows is unsupported:

In [36]:
check(lambda obs:obs.loc[['2']])

dsk.err: 'Cannot index with list against unknown division'


KeyError: 'Cannot index with list against unknown division'

Indexing rows as integers fails normally, but succeeds in Dask mode:

In [14]:
cmp(lambda obs:obs.loc[2])

bak.err: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2] of <class 'int'>
mem.err: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2] of <class 'int'>


{'bak': {'err': TypeError("cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2] of <class 'int'>")},
 'dsk': {'val': Dask DataFrame Structure:
                  label   idx² Prime
  npartitions=1                     
                 object  int64  bool
                    ...    ...   ...
  Dask Name: try_loc, 3 tasks},
 'mem': {'err': TypeError("cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2] of <class 'int'>")}}