Skip to content

[RFC]: add APIs for array equality to a scalar #985

@hmaarrfk

Description

@hmaarrfk

Today, there are a few cases where people might want to check equality to a scalar.

import numpy as np

is_all_zero = np.all(arr == 0)

However, this code is terribly inefficient for out of memory array.

It can force them to go through the entire memory.

I think that we can do much better to define an operation that will check for

  • All equal
  • All not equal

and those can be implemented in more "streamed" fashion, that allow the underlying implementation to not create a full boolean array for arr == 0.

Second, I would like to propose an API for "likely not equal", where the strick inequality is not guaranteed.

For data compression, we might just be intrested in learning if the dataset is worth compressing or not:
Zarr for example does this:
zarr-developers/zarr-python#3627

For example, consider the task to compress a 800MB array.
Zarr today:

  1. Checks for equality to zero
  2. If all zeros, it skips things
  3. if not zeros, it compresses things

But if you have an out of memory data array, it may be hard to guarantee that things are not all zero, but the check likely doesn't matter, since with modern compression algorithm all zeros can be efficiently compressed.

>>> import numpy as np
>>> import numcodecs
>>> a = np.zeros((100, 1024, 1024), dtype='float64')
>>> len(numcodecs.blosc.compress(a, b'zstd', 7))
67216

On my computer, the equality check is roughly the same order of magnitude as the compression itself:

In [14]: %time numcodecs.blosc.compress(a, b'zstd', 7);
CPU times: user 2.04 s, sys: 903 μs, total: 2.04 s
Wall time: 266 ms

In [15]: %time np.all(a==0 );
CPU times: user 21.4 ms, sys: 109 ms, total: 130 ms
Wall time: 140 ms

so if zarr misses one of my images, because I think it has non-zero element, its likely not the end of the world in terms of system performance, but if I have to "guarantee that the images are non zero" that can be a costly operation that can't be done through my implementation easily "without checking every single element".

For not, without these APIs, today, zarr does something like:

  1. Create an empty array
  2. Check for equality with that empty array

Alternative:

I could likely check numpy metadata like strides and create my own fastpath for this:

>>> zero = np.zeros(1)
>>> a, zero = np.broadcast_arrays(a, zero)
>>> zero.strides
(0, 0, 0)

but this seems more like a hack and would be implicitely redefining "equality operation" to "likely equal" which isn't correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs DiscussionNeeds further discussion.RFCRequest for comments. Feature requests and proposed changes.

    Type

    No type

    Projects

    Status

    Stage 0

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions