[C++] Support joining tables with non-key fields as list #32504

asfimport · 2022-07-26T16:32:03Z

I am trying to join 2 Arrow tables where some columns are of list<float> data type. Note that my join columns/keys are primitive data types and some my non-join columns/keys are of {}list<float>{}. But, PyArrow join() cannot join such as table, although pandas can. It says

ArrowInvalid: Data type list<item: float> is not supported in join non-key field

when I execute this piece of code

joined_table = table_1.join(table_2, ['k1', 'k2', 'k3'])

A stackoverflow response pointed out that Arrow currently cannot handle non-fixed types for joins. Can this be fixed ? Or is this intentional ?

Reporter: Jayjeet Chakraborty / @JayjeetAtGithub

Related issues:

[C++][Compute] Add scalar_hash function (relates to)

_{Note: This issue was originally created as ARROW-17216. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-07-29T16:41:01Z

Carlos Maltzahn:
@JayjeetAtGithub and I are willing to help implement support for joining tables with lists in non-key values. But we might need some help on where to start.

asfimport · 2022-07-29T19:32:22Z

Weston Pace / @westonpace:
That'd be great. The starting point would be src/arrow/compute/exec/hash_join_node.cc. This is where you'll find the check itself that is currently failing, but this is not where most of the join logic lives. Fair warning: the hash-join node has been a bit of a staging ground for performance-critical arrow compute and so it relies on a number of utilities not used elsewhere. As such, this node has a pretty high learning curve at the moment (though my hope is that is more diffusely spread throughout the engine in the future).

As of the 9.0.0 release (still pending) there are two implementations of hash-join. The basic implementation (HashJoinImpl) is backed by std::unordered_map and can be found in src/arrow/compute/exec/hash_join.h. A newer version (SwissJoin) extends HashJoinImpl and is backed by a custom hash map and is found in src/arrow/compute/exec/swiss_join.h. I'd recommend testing and adding support to the newer version as the work required is going to be similar between the two. Note that the basic version supports dictionary types but not the newer version (and we just fall back to the basic version if needed) so that is an option if the newer version proves to be trouble.

Support for types here is mostly gated by support for some of the alternate views/encodings used by the hash join. One of these is a non-owning arraydata view called KeyColumnArray which is in src/arrow/compute/light_array.h. This view does not currently supported nested data. Note that ArraySpan is pretty similar (see ARROW-17257) and does support nested types (I think) so maybe it makes sense to tackle ARROW-17257 as part of this.

The second significant thing is RowTableImpl in src/arrow/compute/row/row_internal.h. This implements a row-major encoding for Arrow data. During the hash-join operation, the build data is placed into a table in this row-major form. Then, during materialization, it is converted back to a column-major form.

On top of those two key elements there are a number of other utilities like ExecBatchBuilder, RowArray (which should maybe be renamed to RowTable), RowArrayAccessor, RowArrayMerge, the hashing utilities themselves (there are two versions of this too, I'm pretty sure the older implementation uses arrow/util/hashing.h and I know the newer version uses arrow/compute/exec/key_hash.h), etc.

So I would probably start by looking at the unit tests that exists for those utilities encodings (this reminded me that I had some unit tests I had forgotten to push for ARROW-17022 so I will try and get those up today) and try to get these utilities working with nested types. Some of these utilities could probably also use some more unit tests too. Once the utilities are working with nested types you can enable them for the join itself and see what breaks.

CC @michalursa and @save-buffer as they are more knowledgeable in this area and might have some additional input / advice.

asfimport · 2022-07-31T05:08:45Z

Jayjeet Chakraborty / @JayjeetAtGithub:
Thanks a lot for the detailed pointers @westonpace.

asfimport · 2022-08-02T05:35:23Z

Aldrin Montana / @drin:
I am currently working on scalar hash functions, which might help to understand the hashing interfaces:

A way to interface with hashing functions from key_hash.h: scalar_hash.cc#L119
A way to interface with hashing function from hashing.h: scalar_hash.cc#L160

As for ArraySpan vs KeyColumnArray, they're both ultimately views into the same buffers (from ArrayData). I don't see anything that makes it difficult to support nested types in either, but I don't see anything that explicitly supports nested types (unless ArraySpan does so via some nuanced templates). I came across this issue while looking to see if anyone was working on convenient nested type support for ArraySpan, since it currently is a view into ArrayData buffers, but I don't see convenient functions in the spirit of Array level functions such as StructArray::field().

I will hopefully make more headway into this tomorrow. So, at least as far as nested type support via ArraySpan and/or KeyColumnArray I might be able to put together some sort of tutorial-style cookbook on how to access the data appropriately for nested data types.

asfimport · 2022-08-04T21:30:20Z

Aldrin Montana / @drin:
Okay, re-visiting this: I commented on ARROW-17257 about how ARROW-8991 is related, and for now KeyColumnArray is essentially a flattened array. In ARROW-8991, I will have code for flattening arrays to be able to hash them. I think once that is done, perhaps it can be used to support this

asfimport mentioned this issue Jan 11, 2023

[C++][Compute] Add scalar_hash function #17211

Open

1 task

westonpace mentioned this issue Oct 14, 2023

GH-38074: [C++] Fix Offset Size Calculation for Slicing Large String and Binary Types in Hash Join #38147

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Support joining tables with non-key fields as list #32504

[C++] Support joining tables with non-key fields as list #32504

asfimport commented Jul 26, 2022 •

edited

asfimport commented Jul 29, 2022

asfimport commented Jul 29, 2022

asfimport commented Jul 31, 2022

asfimport commented Aug 2, 2022

asfimport commented Aug 4, 2022

[C++] Support joining tables with non-key fields as list #32504

[C++] Support joining tables with non-key fields as list #32504

Comments

asfimport commented Jul 26, 2022 • edited

Related issues:

asfimport commented Jul 29, 2022

asfimport commented Jul 29, 2022

asfimport commented Jul 31, 2022

asfimport commented Aug 2, 2022

asfimport commented Aug 4, 2022

asfimport commented Jul 26, 2022 •

edited