GH-34235: [Python] Add `join_asof` binding #34234

judahrand · 2023-02-17T12:27:37Z

Closes: [Python] Add join_asof binding #34235

github-actions · 2023-02-17T12:27:57Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2023-02-17T12:31:09Z

Closes: [Python] Add join_asof binding #34235

github-actions · 2023-02-17T12:31:11Z

⚠️ GitHub issue #34235 has been automatically assigned in GitHub to PR creator.

judahrand · 2023-02-17T12:58:16Z

I'm struggling to run the tests locally as I can't get Arrow to build on an M1 Mac.

Undefined symbols for architecture arm64:
  "testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&>::Matcher(char const*)", referenced from:
      testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&> testing::internal::MatcherCastImpl<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&, char const*>::CastImpl<true>(char const* const&, std::__1::integral_constant<bool, true>, std::__1::integral_constant<bool, true>) in array_test.cc.o
      testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&> testing::internal::MatcherCastImpl<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&, char const*>::CastImpl<true>(char const* const&, std::__1::integral_constant<bool, true>, std::__1::integral_constant<bool, true>) in array_binary_test.cc.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:208: build/debug/arrow-array-test] Error 1
make[1]: *** [CMakeFiles/Makefile2:1714: src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 54%] Built target arrow-public-api-test

AlenkaF · 2023-02-20T05:11:20Z

I'm struggling to run the tests locally as I can't get Arrow to build on an M1 Mac.

Can you try adding -DGTest_SOURCE="BUNDLED" CMake flag? See #14917

judahrand · 2023-02-20T11:11:51Z

@AlenkaF I believe that this is ready for a round of review when you have time.

AlenkaF · 2023-02-20T15:21:47Z

Hi @judahrand , thank you for your contribution! ⭐

The code LGTM 👍 I would maybe add tests to test_table.py also, as in https://github.com/apache/arrow/pull/12452/files#diff-72bd29ba764c85f18fbf9e74898b72d47ed8a1b04be879c1a5b2a59382e2eaef

judahrand · 2023-02-20T16:16:32Z

Hi @judahrand , thank you for your contribution! ⭐

The code LGTM 👍 I would maybe add tests to test_table.py also, as in https://github.com/apache/arrow/pull/12452/files#diff-72bd29ba764c85f18fbf9e74898b72d47ed8a1b04be879c1a5b2a59382e2eaef

Done

judahrand · 2023-02-21T09:19:27Z

The Appveyor failure looks like a timeout?

AlenkaF · 2023-02-21T09:29:55Z

Yeah, I see the Appveyor build is failing on other PRs also.

judahrand · 2023-02-27T10:41:48Z

@AlenkaF Do you need me to do anything else on this PR?

AlenkaF

This looks good to me, thanks!
Will wait for Joris to have time to look at it also.

AlenkaF · 2023-02-27T10:52:31Z

Just one more thing: could you rebase to latest master to get AppVeyor working? (related issue was closed: #34296)

jorisvandenbossche · 2023-02-27T11:42:22Z

@judahrand thanks for working on this!

Before diving into the details, I have two general comments:

We are in the middle of refactoring how we expose the Acero / ExecPlan features, and I have a PR that exposes the Declaration object and ExecNodeOptions subclasses in Python (GH-33976: [Python] Initial bindings for acero Declaration and ExecNodeOptions classes #34102). Once that is merged, it should be the goal that also the asof join could be exposed by adding an AsofJoinNodeOptions class in pyarrow.
Ideally, I would prefer that we can do this directly, but I know that the mentioned PR isn't merged yet (I hope it can be merged in one of the coming days, though). Anyway, most of the work you are doing in this PR is needed anyway (public API, docs, tests, etc), it's only the part in _exec_plan.pyx that would change a bit.
I think we should try to do a better job of explaining what the "asof" exactly is and does in the docstring (I also noted that AsOfJoin is missing in the C++ user guide (will open an issue about that), although it has some reference docs), since I think this is generally not a very well known join type: what is the difference with a normal join? What is the difference between the "on" and "by" join keys?

jorisvandenbossche · 2023-02-27T11:43:46Z

python/pyarrow/_dataset.pyx

+        by : str or list[str]
+            The columns from current dataset that should be used as the by keys
+            of the join operation left side.
+        tolerance : int


If the "on" key is a timestamp column, what value can be used here? (not an int?)

This is a good question actually... I'm not 100% sure how this is intended to work. The C++ implementation exclusively accepts an int64_t for the tolerance. It simply states that it will use the same units as the on column... it is unclear what that means. I'd assumed it meant the resolution of the timestamp in a timestamp case.

python/pyarrow/_dataset.pyx

jorisvandenbossche · 2023-02-27T11:49:57Z

python/pyarrow/_exec_plan.pyx

+    # By default expose all columns on both left and right table
+    if isinstance(left_operand, Table):
+        left_columns = left_operand.column_names
+    elif isinstance(left_operand, Dataset):
+        left_columns = left_operand.schema.names
+    else:
+        raise TypeError("Unsupported left join member type")


Those left_columns/right_columns is not really used, except for checking the column collisions?
What happens if we don't check for this here in cython and there actually is a column collision? Does the C++ implementation give an error for that as well?

I believe the reason I added this in is that it was causing the C++ implementation to segfault.

The left_columns/right_columns variables were also used to filter out the 'special' Dataset columns which we get back if the operands are datasets. This isn't currently an issue due to the temporary conversion to Tables due to the lack of ScanNodeOptions.

OK, I see. I added ScanNodeOptions, so you should be able to update this now.

python/pyarrow/table.pxi

python/pyarrow/tests/test_dataset.py

jorisvandenbossche · 2023-02-27T11:54:15Z

python/pyarrow/tests/test_table.py

+        "colC": [1., 3., 5.]
+    })
+
+    r = t1.join_asof(t2, "colA", "col2", 1, "colB", "col3")


Some additional test case ideas to ensure good coverage:

A test where the left/right column names are the same, so you can rely on not having to specify right_on/by

A test where the by keys is a list of columns instead of a single one (and what happens if passing an empty list?)

A test where the left/right column names are the same, so you can rely on not having to specify right_on/by

This is now tested.

A test where the by keys is a list of columns instead of a single one

This is now tested.

and what happens if passing an empty list?

It seems like it just doesn't perform the join over any partitions - this is also now tested.

jorisvandenbossche · 2023-03-03T12:49:48Z

@judahrand I merged #34102, so you should be able to add a AsofJoinNodeOptions subclass, and then update _perform_join_asof to use that options class together with Declaration (similar as I am doing in #3440)

judahrand · 2024-01-05T15:29:12Z

@jorisvandenbossche @AlenkaF ping?

judahrand · 2024-02-28T16:20:52Z

@jorisvandenbossche @AlenkaF ping?

jorisvandenbossche

Again apologies for the slow review! (and thanks for keep pinging us)

I added some comments on the docstring in acero.pyx (the options class), but most of them should also apply on the docstrings in table/dataset methods.

python/pyarrow/_acero.pyx

jorisvandenbossche · 2024-02-29T15:21:33Z

python/pyarrow/table.pxi

+        Perform an asof join between this table and another one.
+


I would still like to see a bit more expanded explanation (apart from the individual keyword explanations) about what and asof join exactly is.

Something indicating it does 1) an inexact join, 2) on a sorted dataset, 3) potentially first joining on other attributes, and 4) typically useful for time series data that are not perfectly aligned. Are that the most relevant characteristics?

jorisvandenbossche · 2024-02-29T15:23:13Z

python/pyarrow/table.pxi

+        >>> t2 = pa.Table.from_pandas(df2).sort_by('year')
+
+        >>> t1.join_asof(
+        ...     t2, on='year', by='id', tolerance=1


The fact that there is no repetition of the vaues in the by "id" key in the example data makes that it is difficult to see what exactly happens with the "on" key?

python/pyarrow/tests/test_exec_plan.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

These descriptions are mostly lifted from the Pandas `merge_asof` docs. https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html#pandas.merge_asof

This example now shows the behavior of duplicate values in a `by` predicate as well as how `tolerance` works.

judahrand · 2024-03-04T12:13:35Z

@jorisvandenbossche I believe I've dealt will all the feedback 😄

conbench-apache-arrow · 2024-03-17T06:25:53Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 681be03.

There were 9 benchmark results indicating a performance regression:

Commit Run on ursa-i9-9960x at 2024-03-16 07:11:19Z
- file-read (R) with compression=uncompressed, dataset=fanniemae_2016Q4, file_type=feather, language=R, output_type=dataframe
- file-read (R) with compression=lz4, dataset=nyctaxi_2010-01, file_type=feather, language=R, output_type=table
and 7 more (see the report linked below)

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Small follow-up on #34234 fixing the marker for a newly added test, fixing the minimal builds * GitHub Issue: #34235 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

judahrand requested a review from AlenkaF as a code owner February 17, 2023 12:27

judahrand marked this pull request as draft February 17, 2023 12:27

github-actions bot added the Component: Python label Feb 17, 2023

judahrand changed the title ~~[Python] Add join_asof binding~~ GH-34235: [Python] Add join_asof binding Feb 17, 2023

judahrand force-pushed the python/join_asof branch 2 times, most recently from d095ec9 to cd00736 Compare February 17, 2023 19:47

judahrand marked this pull request as ready for review February 20, 2023 11:08

judahrand force-pushed the python/join_asof branch from 16f4b26 to 22d6c83 Compare February 20, 2023 11:08

judahrand force-pushed the python/join_asof branch from 22d6c83 to 2e678f4 Compare February 20, 2023 11:44

jorisvandenbossche mentioned this pull request Feb 21, 2023

[Python] Build instructions don't work #34240

Closed

AlenkaF approved these changes Feb 27, 2023

View reviewed changes

judahrand force-pushed the python/join_asof branch from d65ea5a to 62ed964 Compare February 27, 2023 11:05

jorisvandenbossche reviewed Feb 27, 2023

View reviewed changes

judahrand added 3 commits March 5, 2023 19:56

Add join_asof binding

078b9a9

Add test for join_asof binding

2aa990b

Handle removal of special Dataset columns

dad0416

bascheibler mentioned this pull request Nov 20, 2023

[Python] Build in Amazon Linux 2023 fails #38810

Closed

Merge branch 'main' into python/join_asof

cae3e92

jorisvandenbossche reviewed Feb 29, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 29, 2024

Apply docstring suggestions from code review

7f7b877

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 4, 2024

judahrand added 5 commits March 4, 2024 10:04

Remove tests from test_exec_plan.py

5fcce5f

More docstring formatting changes

7dd4e21

Elaborate on join_asof description

e6da4df

These descriptions are mostly lifted from the Pandas `merge_asof` docs. https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html#pandas.merge_asof

Make join_as example usage more complex

0553f1d

This example now shows the behavior of duplicate values in a `by` predicate as well as how `tolerance` works.

Remove unused import

df46f9e

judahrand force-pushed the python/join_asof branch from e37f75d to df46f9e Compare March 4, 2024 12:11

judahrand requested a review from jorisvandenbossche March 4, 2024 12:13

judahrand and others added 4 commits March 5, 2024 15:03

Merge remote-tracking branch 'upstream/main' into python/join_asof

6e87867

Merge remote-tracking branch 'upstream/main' into python/join_asof

c2339b4

Merge remote-tracking branch 'upstream/main' into python/join_asof

adbf06c

clean-up docstrings

f30de79

jorisvandenbossche approved these changes Mar 15, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into python/join_asof

88b3c38

jorisvandenbossche merged commit 681be03 into apache:main Mar 15, 2024
35 checks passed

jorisvandenbossche removed the awaiting change review Awaiting change review label Mar 15, 2024

judahrand mentioned this pull request Mar 18, 2024

[Python] Add missing APIs to RecordBatch class #36399

Open

jorisvandenbossche mentioned this pull request Mar 19, 2024

GH-34235: [Python] Correct test marker for join_asof tests #40666

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-34235: [Python] Add `join_asof` binding #34234

GH-34235: [Python] Add `join_asof` binding #34234

judahrand commented Feb 17, 2023 •

edited by jorisvandenbossche

Loading

github-actions bot commented Feb 17, 2023

github-actions bot commented Feb 17, 2023

github-actions bot commented Feb 17, 2023

judahrand commented Feb 17, 2023

AlenkaF commented Feb 20, 2023

judahrand commented Feb 20, 2023

AlenkaF commented Feb 20, 2023

judahrand commented Feb 20, 2023

judahrand commented Feb 21, 2023

AlenkaF commented Feb 21, 2023

judahrand commented Feb 27, 2023

AlenkaF left a comment

AlenkaF commented Feb 27, 2023

jorisvandenbossche commented Feb 27, 2023 •

edited

Loading

jorisvandenbossche Feb 27, 2023

judahrand Mar 5, 2023 •

edited

Loading

jorisvandenbossche Feb 27, 2023

judahrand Mar 5, 2023 •

edited

Loading

judahrand Mar 5, 2023 •

edited

Loading

jorisvandenbossche Mar 14, 2023

jorisvandenbossche Feb 27, 2023 •

edited

Loading

judahrand Mar 5, 2023

jorisvandenbossche commented Mar 3, 2023

judahrand commented Jan 5, 2024 •

edited

Loading

judahrand commented Feb 28, 2024

jorisvandenbossche left a comment

jorisvandenbossche Feb 29, 2024

judahrand Mar 4, 2024

jorisvandenbossche Feb 29, 2024

judahrand Mar 4, 2024 •

edited

Loading

judahrand commented Mar 4, 2024

conbench-apache-arrow bot commented Mar 17, 2024

GH-34235: [Python] Add join_asof binding #34234

GH-34235: [Python] Add join_asof binding #34234

Conversation

judahrand commented Feb 17, 2023 • edited by jorisvandenbossche Loading

github-actions bot commented Feb 17, 2023

github-actions bot commented Feb 17, 2023

github-actions bot commented Feb 17, 2023

judahrand commented Feb 17, 2023

AlenkaF commented Feb 20, 2023

judahrand commented Feb 20, 2023

AlenkaF commented Feb 20, 2023

judahrand commented Feb 20, 2023

judahrand commented Feb 21, 2023

AlenkaF commented Feb 21, 2023

judahrand commented Feb 27, 2023

AlenkaF left a comment

Choose a reason for hiding this comment

AlenkaF commented Feb 27, 2023

jorisvandenbossche commented Feb 27, 2023 • edited Loading

jorisvandenbossche Feb 27, 2023

Choose a reason for hiding this comment

judahrand Mar 5, 2023 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Feb 27, 2023

Choose a reason for hiding this comment

judahrand Mar 5, 2023 • edited Loading

Choose a reason for hiding this comment

judahrand Mar 5, 2023 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Mar 14, 2023

Choose a reason for hiding this comment

jorisvandenbossche Feb 27, 2023 • edited Loading

Choose a reason for hiding this comment

judahrand Mar 5, 2023

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 3, 2023

judahrand commented Jan 5, 2024 • edited Loading

judahrand commented Feb 28, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 29, 2024

Choose a reason for hiding this comment

judahrand Mar 4, 2024

Choose a reason for hiding this comment

jorisvandenbossche Feb 29, 2024

Choose a reason for hiding this comment

judahrand Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

judahrand commented Mar 4, 2024

conbench-apache-arrow bot commented Mar 17, 2024

GH-34235: [Python] Add `join_asof` binding #34234

GH-34235: [Python] Add `join_asof` binding #34234

judahrand commented Feb 17, 2023 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Feb 27, 2023 •

edited

Loading

judahrand Mar 5, 2023 •

edited

Loading

judahrand Mar 5, 2023 •

edited

Loading

judahrand Mar 5, 2023 •

edited

Loading

jorisvandenbossche Feb 27, 2023 •

edited

Loading

judahrand commented Jan 5, 2024 •

edited

Loading

judahrand Mar 4, 2024 •

edited

Loading