Skip to content

C++: add hash table classes for fixed-byte-width and variable-length primitive arrays #15400

@asfimport

Description

@asfimport

Some of the most important in-memory analytical routines are:

  • unique
  • contains / is-in
  • match (see base::match in R or pandas.match)
  • dictionary-encode (aka "factorize" as I call it)
  • frequency-table (unique + observed frequencies)

At their lowest level these all involve either iterative hash table construction or construct-then-sweep (for the routines involving multiple arrays, e.g. contains/match).

Hashing more complex Arrow structures (e.g. structs or lists-of-structs) will require some more thought, but performing these operations on fixed-byte-width types and lists thereof (e.g. strings as List) is fairly straightforward and can be used to craft more complex hash-table based routines.

Reporter: Wes McKinney / @wesm
Assignee: Antoine Pitrou / @pitrou

Note: This issue was originally created as ARROW-32. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions