-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-37059: [C++][Python] Add rolling_* functions #37060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
|
@js8544 ping me when this is ready for the initial reviews. |
Thanks in advance! I'm quite busy with my day job this week. Will deal with this over the weekend. |
511aea4 to
2ea7bbe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current visit_valid must have 1 arg and visit_null must have no args. However, I want to pass VisitRightNull and VisitLeftNull (both having 0 args) as visit_valid, so I need to use InvokeWithRequiredArgs for all the lambdas passed.
cpp/src/arrow/compute/registry.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to order it alphabetically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe [&] should be replaced by [visit_both_not_null = std::move(visit_both_not_null)] so the r-value parameters are moved into the local lambdas. As they are not shared, they can be moved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arr0_it and arr1_it would be captured by & of course as they are shared among the different lambdas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think capturing by reference is better here because the lambda won't need to create a variable for the func. See https://cppinsights.io/s/690fcf31 for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. Nothing changes in the final assembly anyway. https://godbolt.org/z/Edaqefsc9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you call the template parameter ArrowType or DataType you can use it in place of ArgType and OutType removing the need for the aliasing and different names for a type that is the same.
That would also free up ArgType and OutType to refer to the primitive C types and remove the need for ArgValue and OutValue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you think some Windows might need to differentiate here, keep it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I would prefer to keep this because Mean has a different out type than arg type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why double instead of ArgValue here? To avoid having to cast sum to double in GetValue? But how does that trade off with the error accumulation due to sums and subs on the double?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to avoid overflow. Errors would only occur with large values, which are also susceptible to overflow errors. I think overflow is the worse problem here. Also, to improve accuracy, we can later switch to Kahan Sum or Pairwise Sum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two can come from Window::... as well I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, and a large amount of template parameter thanks to this. I've updated the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's possible to pass IgnoreNulls as a template parameter and de-duplicate the implementation by adding a few if constexpr (IgnoreNulls) in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this but there will be so many if constexprs that it hurts readability. The main problem is one needs to keep window_valid_count and the other needs window_null_count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I've gave up on some if constexpr-based implementations in the past because of this as well.
felipecrv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but you will need a commiter's approval to merge it :)
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for great code! I may ask a stupid question that, how do Prod handles value like NaN, 0? are they regarded as "invalid" like null?
There's no special handling of them. So they are dealt with the same way C++ deals with them. |
Thanks! @pitrou would you mind taking a look? |
TEST(TestRollingProd, ZeroHandling) {
for (std::string func : {"rolling_prod", "rolling_prod_checked"}) {
for (auto ty : NumericTypes()) {
if (ty == uint8() || ty == int8()) { // out of bounds
continue;
}
RollingOptions options(/*window_length=*/3, 3, true);
CheckRolling(func, ArrayFromJSON(ty, "[1, 2, 0, 4, 5, 6, 10]"),
ArrayFromJSON(ty, "[1, 2, 0, 0, 0, 120, 300]"), &options);
}
}
}Yeah, I mean when handling window with |
|
I see. They do require some special handling. |
|
I've add special handling of NaN and zero for rolling_prod and rolling_prod_checked, along with a test case for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid division by zero! Clever.
|
Thanks, a clever fix! Would "sum" or mean also suffer from nan? should we fixing them using the same way? |
|
Fixed for sum and mean. Also I noticed that the default options for |
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with arrow compute, but the code looks clean and good to me! Thanks
7e83d8d to
093b8aa
Compare
|
@pitrou This PR is also ready for review. Would you mind having a look? Thanks! |
093b8aa to
f36b14b
Compare
|
@felipecrv @mapleFU Hi, since both of you have become committers (congrats btw), would you mind have another look at this PR? You've approved this before but it wasn't merged. |
I will take another look before merging. Soon-ish. |
|
Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer. |
Rationale for this change
Add rolling_* functions.
What changes are included in this PR?
Added their implementations, tests and python ports.
Are these changes tested?
Yes.
Are there any user-facing changes?
No.