New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1569: [C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types #11937
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments for now
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
// Approximately equal within some error bound (epsilon). | ||
(options.floating_approximate && | ||
(fabs(current - next) <= | ||
static_cast<typename DataType::c_type>(options.epsilon))) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There exists support for floating-point comparisons, maybe you can reuse this here.
Also, this check is not considering sign bit for special cases such as: signed zeros and signed Inf.
Ex. { -0.0, 0.0 } != { 0.0, -0.0 }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou Should we consider signed zero/Inf here? Not sure if sorting function does it. In any case, consistency is desired and can be resolved in a follow-up JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorting doesn't, AFAIR. Signed zeros are considered equal, I'm not sure there's any particular reason to deviate from that (what are the use cases for this kernel?).
template <typename DataType> | ||
enable_if_not_floating_point<DataType, bool> isnan( | ||
const util::optional<typename DataType::c_type>& opt) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, isnan()
should only be used in floating-point-enabled functions, so maybe you can function overload (using enable-if magic) the code blocks that use isnan()
in a more generic manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment below suggests a change that would allow you to get rid of the enable_if_not_floating_point<...> isnan()
variant.
auto options = IsMonotonicState::Get(ctx); | ||
|
||
// Check batch size | ||
if (batch.values.size() != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the number of arguments to a function are validated in the compute layer mechanism. When the IsMonotonic
function is registered, it specifies a single input argument, so this should already be guaranteed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directly invocation of this kernel is not possible through the public API, however internally this function could be invoked directly and skip those checks. @bkietz what do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, this check is trivial, it doesn't hurt having it.
// Return early if there are NaNs, zero elements or one element in the array. | ||
// And return early if there are only nulls. | ||
if (array.length() <= 1 || array.null_count() == array.length()) { | ||
if (std::any_of(array.begin(), array.end(), isnan<DataType>)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything in this functions seems general enough to handle most data types, except for isnan
. Maybe specialize this code block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would that look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After more careful thought, what we want is for only floating-point types to do the std::any()
check. You can use TypeTraits
for those that have a c_type
defined to get a type_id
variable which can be checked during runtime.
if (array.length() <= 1 || array.null_count() == array.length()) {
auto type_id = TypeTraits<DataType>::type_singleton();
if (!is_floating(type_id) || std::any_of(array.begin(), array.end(), std::isnan)) {
return IsMonotonicOutput(false, false, false, false, out);
} else {
...
}
}
is_floating(type_id)
is defined here.
P.S. I did not ran this code, but something along these lines should work.
/// Use max value of element type as the value of nulls. | ||
/// Inf for floating point numbers. | ||
USE_MAX_VALUE | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordering of nulls and NaNs were also discussed in the sorting function. I would expect that IsMonotonic
and sorting function are consistent. That is, if a sort operation is performed first, then the corresponding IsMonotonic
should result in true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that a sort before invoking this kernel should result in true for the corresponding check. However I feel the null handling variants are a bit confusing: AtStart
defines NaN > null
and AtEnd
defines NaN < null
. Also, the sorting kernel can ignore equality, but this kernels considers it to check if values are unique (strictly increasing/decreasing).
I think if we want to allow users to define order of unordered values (both for sorting and this kernel) we need something like this:
bool compare_nulls = false; // default: any null results in false outputs (or error in case of sort)
bool compare_nans = false; // default: any nan results in false outputs (or error in case of sort)
// these are not needed when sorting
bool nulls_equal = false; // when nulls are compared, are they considered equal?
bool nans_equal = false; // when nans are compared, are they considered equal?
// when both nulls and nans are compared
enum Ordering { Less, Equal, Greater }
Ordering nan_compared_with_null; // when comparing nulls and nans, what ordering should be used?
Some general comments:
|
I agree.
I initially set it up like that but @bkietz suggested to output a struct scalar instead (like the min/max kernel). |
public: | ||
enum NullHandling { | ||
/// Ignore nulls. | ||
IGNORE_NULLS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the other enum names, use only IGNORE
, since enum NullHandling
already specifies this is for nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had that initially, but IGNORE caused compilation issues on Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, nevermind.
|
||
// Safety: | ||
// - Made sure that the input datum is an array. | ||
const std::shared_ptr<ArrayData>& array_data = input.array(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use auto array_data = input.array()
.
|
||
template <typename DataType> | ||
enable_if_not_floating_point<DataType> IsMonotonicCheck( | ||
const typename DataType::c_type& current, const typename DataType::c_type& next, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing this PR, I think implementations of IsMonotonicCheck
can be categorized as follows based on DataType
:
- have
c_type
(e.g., primitive numeric, datetime, timestamp)c_type
is floating-pointc_type
is not floating-point
- do not have
c_type
(binary, string, intervals)- custom implementation for
binary
andstring
- custom implementation for intervals
- custom implementation for
There would be at least 4 type-specific implementations of IsMonotonicCheck
, where the
the enable_if
for this case would be of the form (pseudocode) enable_if_has_c_type and enable_if_not_floating_point
return Status::OK(); | ||
} | ||
|
||
template <typename DataType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: ArrowType
would be more appropriate than DataType
.
} | ||
|
||
template <typename DataType> | ||
Status IsMonotonic(KernelContext* ctx, const ExecBatch& batch, Datum* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this version of IsMonotonic
requires DataType
having a c_type
, this needs to be guarded with enable_if_has_c_type
. You would need to make other versions for binary/string
and interval types.
// Return early if there are NaNs, zero elements or one element in the array. | ||
// And return early if there are only nulls. | ||
if (array.length() <= 1 || array.null_count() == array.length()) { | ||
if (std::any_of(array.begin(), array.end(), isnan<DataType>)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After more careful thought, what we want is for only floating-point types to do the std::any()
check. You can use TypeTraits
for those that have a c_type
defined to get a type_id
variable which can be checked during runtime.
if (array.length() <= 1 || array.null_count() == array.length()) {
auto type_id = TypeTraits<DataType>::type_singleton();
if (!is_floating(type_id) || std::any_of(array.begin(), array.end(), std::isnan)) {
return IsMonotonicOutput(false, false, false, false, out);
} else {
...
}
}
is_floating(type_id)
is defined here.
P.S. I did not ran this code, but something along these lines should work.
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
Initially I tried to implement this as a
ScalarAggregateFunction
(as suggested in the issue), however given that there is no way to express sensitivity to order, it's currently not possible to correctly implement theScalarAggregator::MergeFrom
function. This is now implemented as aVectorFunction
.I'm still working on supporting more types:
Decimal typesTodo:
I'll create follow-up JIRA issues to:
INTERVAL_DAY_TIME
andINTERVAL_MONTH_DAY_NANO
) and decimal types.cc @bkietz