-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34235: [Python] Add join_asof
binding
#34234
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
join_asof
bindingjoin_asof
binding
|
I'm struggling to run the tests locally as I can't get Arrow to build on an M1 Mac.
|
d095ec9
to
cd00736
Compare
Can you try adding |
16f4b26
to
22d6c83
Compare
@AlenkaF I believe that this is ready for a round of review when you have time. |
22d6c83
to
2e678f4
Compare
Hi @judahrand , thank you for your contribution! ⭐ The code LGTM 👍 I would maybe add tests to |
Done |
The Appveyor failure looks like a timeout? |
Yeah, I see the Appveyor build is failing on other PRs also. |
@AlenkaF Do you need me to do anything else on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks!
Will wait for Joris to have time to look at it also.
Just one more thing: could you rebase to latest master to get AppVeyor working? (related issue was closed: #34296) |
d65ea5a
to
62ed964
Compare
@judahrand thanks for working on this! Before diving into the details, I have two general comments:
|
by : str or list[str] | ||
The columns from current dataset that should be used as the by keys | ||
of the join operation left side. | ||
tolerance : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the "on" key is a timestamp column, what value can be used here? (not an int?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question actually... I'm not 100% sure how this is intended to work. The C++ implementation exclusively accepts an int64_t
for the tolerance. It simply states that it will use the same units as the on
column... it is unclear what that means. I'd assumed it meant the resolution of the timestamp in a timestamp case.
python/pyarrow/_exec_plan.pyx
Outdated
# By default expose all columns on both left and right table | ||
if isinstance(left_operand, Table): | ||
left_columns = left_operand.column_names | ||
elif isinstance(left_operand, Dataset): | ||
left_columns = left_operand.schema.names | ||
else: | ||
raise TypeError("Unsupported left join member type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those left_columns
/right_columns
is not really used, except for checking the column collisions?
What happens if we don't check for this here in cython and there actually is a column collision? Does the C++ implementation give an error for that as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the reason I added this in is that it was causing the C++ implementation to segfault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The left_columns/right_columns
variables were also used to filter out the 'special' Dataset columns which we get back if the operands are datasets. This isn't currently an issue due to the temporary conversion to Tables due to the lack of ScanNodeOptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. I added ScanNodeOptions, so you should be able to update this now.
python/pyarrow/tests/test_table.py
Outdated
"colC": [1., 3., 5.] | ||
}) | ||
|
||
r = t1.join_asof(t2, "colA", "col2", 1, "colB", "col3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional test case ideas to ensure good coverage:
- A test where the left/right column names are the same, so you can rely on not having to specify right_on/by
- A test where the
by
keys is a list of columns instead of a single one (and what happens if passing an empty list?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test where the left/right column names are the same, so you can rely on not having to specify right_on/by
This is now tested.
A test where the by keys is a list of columns instead of a single one
This is now tested.
and what happens if passing an empty list?
It seems like it just doesn't perform the join over any partitions - this is also now tested.
@judahrand I merged #34102, so you should be able to add a |
@jorisvandenbossche @AlenkaF ping? |
1 similar comment
@jorisvandenbossche @AlenkaF ping? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again apologies for the slow review! (and thanks for keep pinging us)
I added some comments on the docstring in acero.pyx (the options class), but most of them should also apply on the docstrings in table/dataset methods.
Perform an asof join between this table and another one. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still like to see a bit more expanded explanation (apart from the individual keyword explanations) about what and asof join exactly is.
Something indicating it does 1) an inexact join, 2) on a sorted dataset, 3) potentially first joining on other attributes, and 4) typically useful for time series data that are not perfectly aligned. Are that the most relevant characteristics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyarrow/table.pxi
Outdated
>>> t2 = pa.Table.from_pandas(df2).sort_by('year') | ||
|
||
>>> t1.join_asof( | ||
... t2, on='year', by='id', tolerance=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that there is no repetition of the vaues in the by "id" key in the example data makes that it is difficult to see what exactly happens with the "on" key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
These descriptions are mostly lifted from the Pandas `merge_asof` docs. https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html#pandas.merge_asof
This example now shows the behavior of duplicate values in a `by` predicate as well as how `tolerance` works.
e37f75d
to
df46f9e
Compare
@jorisvandenbossche I believe I've dealt will all the feedback 😄 |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 681be03. There were 9 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
join_asof
binding #34235