-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17412: [C++] AsofJoin multiple keys and types #13880
Conversation
A MacOS job failed on timeout of the |
Thanks for doing this @rtpsw! |
@rok In principle I think it makes sense to support date32/64 and perhaps time32//64? (Or really, any type that supports ordering and distance). In practice, we found the predominant use case is either TImestamp type or some other thing that gets encoded into int/long (for example, number of trading days since a certain point). I would prefer not to add support for date32/64 and perhaps time32//64 just yet mostly because we are working towards a internal product release and would prefer to focus issues related to that if that is ok. |
I am not sure and might seek help from the maintainers. |
@rtpsw At high level the changes make sense to me. My two main questions are
|
Sure. If there is a cleaner way to code this, I'd be happy to do so.
I'll try to explain a bit more. One observation was that the Presumably, the |
Benchmark results comparing the baseline code of this PR with its latest
are as follows:
|
Thanks @rtpsw for the benchmark numbers. Do you think the regression are caused by key hashing? |
No, because the benchmark was written prior to this feature and must invoke the fast path. I suspect it is overhead in hot-spots such as |
The latest commit removes the
|
For posterity, I also made an attempt to improve performance by refactoring
The code diff for this benchmark is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around a bit and I'm not entirely sure about the StopProducing
logic. StopProducing
is typically a request from the user to cancel a plan. You should not need to call it when an error occurs.
Today, errors propagate to the sink, where the sink makes the single StopProducing call to abort the entire plan. The only other place that StopProducing should be called internally is something like a "LIMIT 10" (fetch) node.
In the future, @save-buffer has proposed some cleanup here (#13848) which will change how errors are handled so we won't have that propagation (the idea, I think, was that some node might want to "handle" an error yet I've not seen any real proposal for what this might be) and an error will just immediately stop the plan.
Also, you shouldn't be calling StopProducing when you've finished.
All of this is somewhat theoretical, we don't yet really have enough extensive testing of abort scenarios. Anyways, I've proposed an alternate handling here: 2a88750
I think the queue clearing and pushing false is probably a bit redundant since the destructor does this so that part isn't really needed. However, the change that makes it so StopProducing is never called by this node (and gets rid of the concern about marking finished twice) might be useful.
void StopProducing() override { | ||
// avoid finishing twice, to prevent "Plan was destroyed before finishing" error | ||
if (finished_.state() == FutureState::PENDING) { | ||
finished_.MarkFinished(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems suspicious. What is happening here that we need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code just called finished_.MarkFinished()
and I observed a failure, with the documented message as well as the message "Future already marked finished"
in a debugging session,, that I haven't figured out. I needed some way to get runs to complete and this led me to the code-change here to avoid finishing twice. I don't know enough about the stopping logic myself to say what is a clean solution here; probably both the original code and the code-change here are not as clean as we'd like.
->GetValues<int64_t>(1)[latest_ref_row_]; | ||
OnType GetLatestTime() const { | ||
auto data = queue_.UnsyncFront()->column_data(time_col_index_); | ||
switch (time_type_id_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around with this briefly, trying to come up with anything more clever, and failed. I had missed the distinction between time_value
and key_value
and thought you were just coercing to uint64_t in both cases. I think the switch case is growing on me :)
This proposed change passes my local tests, so I included it. |
I recently read this key-map doc, and in particular this section in it. which describes a practical limit of ~16 million inserted keys. Thinking about similar implications here, the latest version of PR also has some practical limit on the number of inserted keys due to the potential for collisions; specifically, these collisions are currently not arbitrated, i.e., if a collision occurs then the result is undefined. The probability of a collision after inserting 16 million (2^{24}) keys (using the slow-path) can be estimated using the square-approximation of the birthday problem to be 2^{2*24}/2/2^{64} = 2^{-17}, which some applications (or people) won't see as negligible. Given the above, I see several options we have:
@westonpace, WDYT? |
ping @westonpace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to go at this point. We can work out details and performance as we go but this is a substantial improvement to the capability and those can be done as follow-on.
Let's do 1 for now. I think this PR is probably ready to merge and it makes sense to chase those things down as follow-ups. Otherwise it might be too much to keep track of. Long term it sounds like 3 would be simpler if we could show it worked well. That being said, don't we only need to worry about collisions within the on field's tolerance? Or do we have to worry about a collision anywhere in the dataset? |
@icexelloss are you good with merging this? Or are there still changes you would like to see? |
Will do.
It is normal for an on-key value of one row to be equal to that of the previous one, and the |
@rtpsw I left some question since the code looks a bit from the last time I looked. Otherwise looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for approving. Give me some time to add the doc per item 1 before merging. |
Created https://issues.apache.org/jira/browse/ARROW-17653 for item 3. |
Benchmark runs are scheduled for baseline = 7475605 and contender = 99b57e8. 99b57e8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
See https://issues.apache.org/jira/browse/ARROW-17412 Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com> Co-authored-by: rtpsw <rtpsw@hotmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
See https://issues.apache.org/jira/browse/ARROW-17412 Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com> Co-authored-by: rtpsw <rtpsw@hotmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
See https://issues.apache.org/jira/browse/ARROW-17412