Some of the most important in-memory analytical routines are:
- unique
- contains / is-in
- match (see base::match in R or pandas.match)
- dictionary-encode (aka "factorize" as I call it)
- frequency-table (unique + observed frequencies)
At their lowest level these all involve either iterative hash table construction or construct-then-sweep (for the routines involving multiple arrays, e.g. contains/match).
Hashing more complex Arrow structures (e.g. structs or lists-of-structs) will require some more thought, but performing these operations on fixed-byte-width types and lists thereof (e.g. strings as List) is fairly straightforward and can be used to craft more complex hash-table based routines.
Reporter: Wes McKinney / @wesm
Assignee: Antoine Pitrou / @pitrou
Note: This issue was originally created as ARROW-32. Please see the migration documentation for further details.
Some of the most important in-memory analytical routines are:
At their lowest level these all involve either iterative hash table construction or construct-then-sweep (for the routines involving multiple arrays, e.g. contains/match).
Hashing more complex Arrow structures (e.g. structs or lists-of-structs) will require some more thought, but performing these operations on fixed-byte-width types and lists thereof (e.g. strings as List) is fairly straightforward and can be used to craft more complex hash-table based routines.
Reporter: Wes McKinney / @wesm
Assignee: Antoine Pitrou / @pitrou
Note: This issue was originally created as ARROW-32. Please see the migration documentation for further details.