Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1569: [C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types #11937

Closed
wants to merge 9 commits into from

Conversation

mbrobbel
Copy link
Contributor

@mbrobbel mbrobbel commented Dec 13, 2021

Initially I tried to implement this as a ScalarAggregateFunction (as suggested in the issue), however given that there is no way to express sensitivity to order, it's currently not possible to correctly implement the ScalarAggregator::MergeFrom function. This is now implemented as a VectorFunction.

I'm still working on supporting more types:

  • Booleans
  • Signed and unsigned integer types
  • Floating point types
    • Implementation
    • Tests
  • Temporal types
    • Implementation
    • Tests
  • Decimal types

Todo:

  • Documentation

I'll create follow-up JIRA issues to:

  • Add support for half floats, missing temporal types (INTERVAL_DAY_TIME and INTERVAL_MONTH_DAY_NANO) and decimal types.
  • Add support for String arrays (via lexicographical order):
Input: utf8
[
  "a",
  "b",
  "c"
]

Output: struct<increasing:boolean, strictly_decreasing:boolean, decreasing:boolean, strictly_decreasing:boolean>
{increasing: true, strictly_increasing: true, decreasing: false, strictly_decreasing: false}
  • Implement this function for list arrays with well-ordered element types:
Input: list<uint8>
[
  [1, 2, 3],
  [3, 3, 2],
  [1, 2, 1, 2]
]

Output list<struct<increasing:boolean, strictly_decreasing:boolean, decreasing:boolean, strictly_decreasing:boolean>>
[
  {increasing: true, strictly_increasing: true, decreasing: false, strictly_decreasing: false},
  {increasing: false, strictly_increasing: false, decreasing: true, strictly_decreasing: false},
  {increasing: false, stirctly_increasing: false, decreasing: false, strictly_decreasing: false}
]

cc @bkietz

@github-actions
Copy link

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments for now

cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved
@mbrobbel mbrobbel marked this pull request as ready for review December 16, 2021 09:37
// Approximately equal within some error bound (epsilon).
(options.floating_approximate &&
(fabs(current - next) <=
static_cast<typename DataType::c_type>(options.epsilon))) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There exists support for floating-point comparisons, maybe you can reuse this here.
Also, this check is not considering sign bit for special cases such as: signed zeros and signed Inf.
Ex. { -0.0, 0.0 } != { 0.0, -0.0 }.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou Should we consider signed zero/Inf here? Not sure if sorting function does it. In any case, consistency is desired and can be resolved in a follow-up JIRA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting doesn't, AFAIR. Signed zeros are considered equal, I'm not sure there's any particular reason to deviate from that (what are the use cases for this kernel?).

template <typename DataType>
enable_if_not_floating_point<DataType, bool> isnan(
const util::optional<typename DataType::c_type>& opt) {
return false;
Copy link
Contributor

@edponce edponce Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, isnan() should only be used in floating-point-enabled functions, so maybe you can function overload (using enable-if magic) the code blocks that use isnan() in a more generic manner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment below suggests a change that would allow you to get rid of the enable_if_not_floating_point<...> isnan() variant.

auto options = IsMonotonicState::Get(ctx);

// Check batch size
if (batch.values.size() != 1) {
Copy link
Contributor

@edponce edponce Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the number of arguments to a function are validated in the compute layer mechanism. When the IsMonotonic function is registered, it specifies a single input argument, so this should already be guaranteed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly invocation of this kernel is not possible through the public API, however internally this function could be invoked directly and skip those checks. @bkietz what do you suggest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, this check is trivial, it doesn't hurt having it.

cpp/src/arrow/compute/kernels/vector_is_monotonic.cc Outdated Show resolved Hide resolved
// Return early if there are NaNs, zero elements or one element in the array.
// And return early if there are only nulls.
if (array.length() <= 1 || array.null_count() == array.length()) {
if (std::any_of(array.begin(), array.end(), isnan<DataType>)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in this functions seems general enough to handle most data types, except for isnan. Maybe specialize this code block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would that look like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more careful thought, what we want is for only floating-point types to do the std::any() check. You can use TypeTraits for those that have a c_type defined to get a type_id variable which can be checked during runtime.

if (array.length() <= 1 || array.null_count() == array.length()) {
  auto type_id = TypeTraits<DataType>::type_singleton();
  if (!is_floating(type_id) || std::any_of(array.begin(), array.end(), std::isnan)) {
    return IsMonotonicOutput(false, false, false, false, out);
  } else {
  ...
  }
}

is_floating(type_id) is defined here.

P.S. I did not ran this code, but something along these lines should work.

/// Use max value of element type as the value of nulls.
/// Inf for floating point numbers.
USE_MAX_VALUE
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordering of nulls and NaNs were also discussed in the sorting function. I would expect that IsMonotonic and sorting function are consistent. That is, if a sort operation is performed first, then the corresponding IsMonotonic should result in true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that a sort before invoking this kernel should result in true for the corresponding check. However I feel the null handling variants are a bit confusing: AtStart defines NaN > null and AtEnd defines NaN < null. Also, the sorting kernel can ignore equality, but this kernels considers it to check if values are unique (strictly increasing/decreasing).

I think if we want to allow users to define order of unordered values (both for sorting and this kernel) we need something like this:

bool compare_nulls = false; // default: any null results in false outputs (or error in case of sort)
bool compare_nans = false; // default: any nan results in false outputs (or error in case of sort)

// these are not needed when sorting
bool nulls_equal = false; // when nulls are compared, are they considered equal?
bool nans_equal = false; // when nans are compared, are they considered equal?

// when both nulls and nans are compared
enum Ordering { Less, Equal, Greater }
Ordering nan_compared_with_null; // when comparing nulls and nans, what ordering should be used?

@edponce
Copy link
Contributor

edponce commented Dec 16, 2021

Some general comments:

  • IsMonotonic needs to be consistent with corresponding sorting functions, such that IsMonotonic(input) == (Sort(input) == input).
  • Currently IsMonotonic outputs a struct describing the monotonic properties of the data. What are your thoughts on having a convenience wrapper function that receives FunctionOptions with a single requested monotonic behavior? For example, IsMonotonic(input, MonotonicOptions.StrictlyIncreasing). This would provide a more readable API for client code and bypass the need to unpack the StructScalar output to check the corresponding monotonic behavior of interest.

@mbrobbel
Copy link
Contributor Author

mbrobbel commented Dec 17, 2021

* `IsMonotonic` needs to be consistent with corresponding [sorting functions](https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions), such that `IsMonotonic(input) == (Sort(input) == input)`.

I agree.

* Currently `IsMonotonic` outputs a struct describing the monotonic properties of the data. What your thoughts on having a convenience wrapper function that receives `FunctionOptions` with a single requested monotonic behavior? For example, `IsMonotonic(input, MonotonicOptions.StrictlyIncreasing)`. This would provide a more readable API for client code and bypass the need to unpack the `StructScalar` output to check the corresponding monotonic behavior of interest.

I initially set it up like that but @bkietz suggested to output a struct scalar instead (like the min/max kernel).

public:
enum NullHandling {
/// Ignore nulls.
IGNORE_NULLS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the other enum names, use only IGNORE, since enum NullHandling already specifies this is for nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that initially, but IGNORE caused compilation issues on Windows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, nevermind.


// Safety:
// - Made sure that the input datum is an array.
const std::shared_ptr<ArrayData>& array_data = input.array();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use auto array_data = input.array().


template <typename DataType>
enable_if_not_floating_point<DataType> IsMonotonicCheck(
const typename DataType::c_type& current, const typename DataType::c_type& next,
Copy link
Contributor

@edponce edponce Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing this PR, I think implementations of IsMonotonicCheck can be categorized as follows based on DataType:

  • have c_type (e.g., primitive numeric, datetime, timestamp)
    • c_type is floating-point
    • c_type is not floating-point
  • do not have c_type (binary, string, intervals)
    • custom implementation for binary and string
    • custom implementation for intervals

There would be at least 4 type-specific implementations of IsMonotonicCheck, where the
the enable_if for this case would be of the form (pseudocode) enable_if_has_c_type and enable_if_not_floating_point

return Status::OK();
}

template <typename DataType>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ArrowType would be more appropriate than DataType.

}

template <typename DataType>
Status IsMonotonic(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this version of IsMonotonic requires DataType having a c_type, this needs to be guarded with enable_if_has_c_type. You would need to make other versions for binary/string and interval types.

// Return early if there are NaNs, zero elements or one element in the array.
// And return early if there are only nulls.
if (array.length() <= 1 || array.null_count() == array.length()) {
if (std::any_of(array.begin(), array.end(), isnan<DataType>)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more careful thought, what we want is for only floating-point types to do the std::any() check. You can use TypeTraits for those that have a c_type defined to get a type_id variable which can be checked during runtime.

if (array.length() <= 1 || array.null_count() == array.length()) {
  auto type_id = TypeTraits<DataType>::type_singleton();
  if (!is_floating(type_id) || std::any_of(array.begin(), array.end(), std::isnan)) {
    return IsMonotonicOutput(false, false, false, false, out);
  } else {
  ...
  }
}

is_floating(type_id) is defined here.

P.S. I did not ran this code, but something along these lines should work.

@amol-
Copy link
Contributor

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants