-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the enhancement requested
Currently, there is no type alias for dictionary class.
arrow/python/pyarrow/types.pxi
Lines 4739 to 4793 in 1deb740
| cdef dict _type_aliases = { | |
| 'null': null, | |
| 'bool': bool_, | |
| 'boolean': bool_, | |
| 'i1': int8, | |
| 'int8': int8, | |
| 'i2': int16, | |
| 'int16': int16, | |
| 'i4': int32, | |
| 'int32': int32, | |
| 'i8': int64, | |
| 'int64': int64, | |
| 'u1': uint8, | |
| 'uint8': uint8, | |
| 'u2': uint16, | |
| 'uint16': uint16, | |
| 'u4': uint32, | |
| 'uint32': uint32, | |
| 'u8': uint64, | |
| 'uint64': uint64, | |
| 'f2': float16, | |
| 'halffloat': float16, | |
| 'float16': float16, | |
| 'f4': float32, | |
| 'float': float32, | |
| 'float32': float32, | |
| 'f8': float64, | |
| 'double': float64, | |
| 'float64': float64, | |
| 'string': string, | |
| 'str': string, | |
| 'utf8': string, | |
| 'binary': binary, | |
| 'large_string': large_string, | |
| 'large_str': large_string, | |
| 'large_utf8': large_string, | |
| 'large_binary': large_binary, | |
| 'date32': date32, | |
| 'date64': date64, | |
| 'date32[day]': date32, | |
| 'date64[ms]': date64, | |
| 'time32[s]': time32('s'), | |
| 'time32[ms]': time32('ms'), | |
| 'time64[us]': time64('us'), | |
| 'time64[ns]': time64('ns'), | |
| 'timestamp[s]': timestamp('s'), | |
| 'timestamp[ms]': timestamp('ms'), | |
| 'timestamp[us]': timestamp('us'), | |
| 'timestamp[ns]': timestamp('ns'), | |
| 'duration[s]': duration('s'), | |
| 'duration[ms]': duration('ms'), | |
| 'duration[us]': duration('us'), | |
| 'duration[ns]': duration('ns'), | |
| 'month_day_nano_interval': month_day_nano_interval(), | |
| } |
Given that currently, there seems to be optimized kernels (*) only for `dictionary[int32,string], I'd suggest only adding this as a type alias at the time. Having a string alias is nice, particularly when one wants to save table schemas as config files.
(*) I noticed horrible performance when trying to load CSV data using column_types={"col": pa.dictionary(pa.int64(), pa.string())} or column_types={"col": pa.dictionary(pa.int16(), **pa.string())} or column_types={"col": pa.dictionary(pa.uint32(), pa.string())}. Only int32 seems to behave as expected, performance-wise. If this is a bug I can open another issue.
Some points to discuss
- Should other aliases be added?
Component(s)
Python