Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34235: [Python] Add join_asof binding #34234

Merged
merged 36 commits into from
Mar 15, 2024

Conversation

judahrand
Copy link
Contributor

@judahrand judahrand commented Feb 17, 2023

@judahrand judahrand marked this pull request as draft February 17, 2023 12:27
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@judahrand judahrand changed the title [Python] Add join_asof binding GH-34235: [Python] Add join_asof binding Feb 17, 2023
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #34235 has been automatically assigned in GitHub to PR creator.

@judahrand
Copy link
Contributor Author

I'm struggling to run the tests locally as I can't get Arrow to build on an M1 Mac.

Undefined symbols for architecture arm64:
  "testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&>::Matcher(char const*)", referenced from:
      testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&> testing::internal::MatcherCastImpl<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&, char const*>::CastImpl<true>(char const* const&, std::__1::integral_constant<bool, true>, std::__1::integral_constant<bool, true>) in array_test.cc.o
      testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&> testing::internal::MatcherCastImpl<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&, char const*>::CastImpl<true>(char const* const&, std::__1::integral_constant<bool, true>, std::__1::integral_constant<bool, true>) in array_binary_test.cc.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:208: build/debug/arrow-array-test] Error 1
make[1]: *** [CMakeFiles/Makefile2:1714: src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 54%] Built target arrow-public-api-test

@judahrand judahrand force-pushed the python/join_asof branch 2 times, most recently from d095ec9 to cd00736 Compare February 17, 2023 19:47
@AlenkaF
Copy link
Member

AlenkaF commented Feb 20, 2023

I'm struggling to run the tests locally as I can't get Arrow to build on an M1 Mac.

Can you try adding -DGTest_SOURCE="BUNDLED" CMake flag? See #14917

@judahrand judahrand marked this pull request as ready for review February 20, 2023 11:08
@judahrand
Copy link
Contributor Author

@AlenkaF I believe that this is ready for a round of review when you have time.

@AlenkaF
Copy link
Member

AlenkaF commented Feb 20, 2023

Hi @judahrand , thank you for your contribution! ⭐

The code LGTM 👍 I would maybe add tests to test_table.py also, as in https://github.com/apache/arrow/pull/12452/files#diff-72bd29ba764c85f18fbf9e74898b72d47ed8a1b04be879c1a5b2a59382e2eaef

@judahrand
Copy link
Contributor Author

Hi @judahrand , thank you for your contribution! ⭐

The code LGTM 👍 I would maybe add tests to test_table.py also, as in https://github.com/apache/arrow/pull/12452/files#diff-72bd29ba764c85f18fbf9e74898b72d47ed8a1b04be879c1a5b2a59382e2eaef

Done

@judahrand
Copy link
Contributor Author

The Appveyor failure looks like a timeout?

@AlenkaF
Copy link
Member

AlenkaF commented Feb 21, 2023

Yeah, I see the Appveyor build is failing on other PRs also.

@judahrand
Copy link
Contributor Author

@AlenkaF Do you need me to do anything else on this PR?

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks!
Will wait for Joris to have time to look at it also.

@AlenkaF
Copy link
Member

AlenkaF commented Feb 27, 2023

Just one more thing: could you rebase to latest master to get AppVeyor working? (related issue was closed: #34296)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 27, 2023

@judahrand thanks for working on this!

Before diving into the details, I have two general comments:

  • We are in the middle of refactoring how we expose the Acero / ExecPlan features, and I have a PR that exposes the Declaration object and ExecNodeOptions subclasses in Python (GH-33976: [Python] Initial bindings for acero Declaration and ExecNodeOptions classes #34102). Once that is merged, it should be the goal that also the asof join could be exposed by adding an AsofJoinNodeOptions class in pyarrow.
    Ideally, I would prefer that we can do this directly, but I know that the mentioned PR isn't merged yet (I hope it can be merged in one of the coming days, though). Anyway, most of the work you are doing in this PR is needed anyway (public API, docs, tests, etc), it's only the part in _exec_plan.pyx that would change a bit.

  • I think we should try to do a better job of explaining what the "asof" exactly is and does in the docstring (I also noted that AsOfJoin is missing in the C++ user guide (will open an issue about that), although it has some reference docs), since I think this is generally not a very well known join type: what is the difference with a normal join? What is the difference between the "on" and "by" join keys?

by : str or list[str]
The columns from current dataset that should be used as the by keys
of the join operation left side.
tolerance : int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the "on" key is a timestamp column, what value can be used here? (not an int?)

Copy link
Contributor Author

@judahrand judahrand Mar 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question actually... I'm not 100% sure how this is intended to work. The C++ implementation exclusively accepts an int64_t for the tolerance. It simply states that it will use the same units as the on column... it is unclear what that means. I'd assumed it meant the resolution of the timestamp in a timestamp case.

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved
Comment on lines 456 to 462
# By default expose all columns on both left and right table
if isinstance(left_operand, Table):
left_columns = left_operand.column_names
elif isinstance(left_operand, Dataset):
left_columns = left_operand.schema.names
else:
raise TypeError("Unsupported left join member type")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those left_columns/right_columns is not really used, except for checking the column collisions?
What happens if we don't check for this here in cython and there actually is a column collision? Does the C++ implementation give an error for that as well?

Copy link
Contributor Author

@judahrand judahrand Mar 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the reason I added this in is that it was causing the C++ implementation to segfault.

Copy link
Contributor Author

@judahrand judahrand Mar 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The left_columns/right_columns variables were also used to filter out the 'special' Dataset columns which we get back if the operands are datasets. This isn't currently an issue due to the temporary conversion to Tables due to the lack of ScanNodeOptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. I added ScanNodeOptions, so you should be able to update this now.

python/pyarrow/table.pxi Outdated Show resolved Hide resolved
python/pyarrow/tests/test_dataset.py Outdated Show resolved Hide resolved
"colC": [1., 3., 5.]
})

r = t1.join_asof(t2, "colA", "col2", 1, "colB", "col3")
Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional test case ideas to ensure good coverage:

  • A test where the left/right column names are the same, so you can rely on not having to specify right_on/by
  • A test where the by keys is a list of columns instead of a single one (and what happens if passing an empty list?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test where the left/right column names are the same, so you can rely on not having to specify right_on/by

This is now tested.

A test where the by keys is a list of columns instead of a single one

This is now tested.

and what happens if passing an empty list?

It seems like it just doesn't perform the join over any partitions - this is also now tested.

@jorisvandenbossche
Copy link
Member

@judahrand I merged #34102, so you should be able to add a AsofJoinNodeOptions subclass, and then update _perform_join_asof to use that options class together with Declaration (similar as I am doing in #3440)

@judahrand
Copy link
Contributor Author

judahrand commented Jan 5, 2024

@jorisvandenbossche @AlenkaF ping?

1 similar comment
@judahrand
Copy link
Contributor Author

@jorisvandenbossche @AlenkaF ping?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again apologies for the slow review! (and thanks for keep pinging us)

I added some comments on the docstring in acero.pyx (the options class), but most of them should also apply on the docstrings in table/dataset methods.

python/pyarrow/_acero.pyx Outdated Show resolved Hide resolved
python/pyarrow/_acero.pyx Outdated Show resolved Hide resolved
python/pyarrow/_acero.pyx Outdated Show resolved Hide resolved
python/pyarrow/_acero.pyx Show resolved Hide resolved
python/pyarrow/_acero.pyx Outdated Show resolved Hide resolved
python/pyarrow/_acero.pyx Outdated Show resolved Hide resolved
Comment on lines +4864 to +4865
Perform an asof join between this table and another one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like to see a bit more expanded explanation (apart from the individual keyword explanations) about what and asof join exactly is.

Something indicating it does 1) an inexact join, 2) on a sorted dataset, 3) potentially first joining on other attributes, and 4) typically useful for time series data that are not perfectly aligned. Are that the most relevant characteristics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> t2 = pa.Table.from_pandas(df2).sort_by('year')

>>> t1.join_asof(
... t2, on='year', by='id', tolerance=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that there is no repetition of the vaues in the by "id" key in the example data makes that it is difficult to see what exactly happens with the "on" key?

Copy link
Contributor Author

@judahrand judahrand Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python/pyarrow/tests/test_exec_plan.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 29, 2024
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 4, 2024
This example now shows the behavior of duplicate values in a `by`
predicate as well as how `tolerance` works.
@judahrand
Copy link
Contributor Author

@jorisvandenbossche I believe I've dealt will all the feedback 😄

@jorisvandenbossche jorisvandenbossche merged commit 681be03 into apache:main Mar 15, 2024
35 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting change review Awaiting change review label Mar 15, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 681be03.

There were 9 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

jorisvandenbossche added a commit that referenced this pull request Mar 19, 2024
Small follow-up on #34234 fixing the marker for a newly added test, fixing the minimal builds
* GitHub Issue: #34235

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Add join_asof binding
5 participants