C++: add hash table classes for fixed-byte-width and variable-length primitive arrays

Some of the most important in-memory analytical routines are:

- unique
- contains / is-in
- match (see base::match in R or pandas.match)
- dictionary-encode (aka "factorize" as I call it)
- frequency-table (unique + observed frequencies)

At their lowest level these all involve either iterative hash table construction or construct-then-sweep (for the routines involving multiple arrays, e.g. contains/match). 

Hashing more complex Arrow structures (e.g. structs or lists-of-structs) will require some more thought, but performing these operations on fixed-byte-width types and lists thereof (e.g. strings as List<UInt8>) is fairly straightforward and can be used to craft more complex hash-table based routines. 

**Reporter**: [Wes McKinney](https://issues.apache.org/jira/browse/ARROW-32) / @wesm
**Assignee**: [Antoine Pitrou](https://issues.apache.org/jira/browse/ARROW-32) / @pitrou

<sub>**Note**: *This issue was originally created as [ARROW-32](https://issues.apache.org/jira/browse/ARROW-32). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++: add hash table classes for fixed-byte-width and variable-length primitive arrays #15400

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

C++: add hash table classes for fixed-byte-width and variable-length primitive arrays #15400

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions