Skip to content

Add a type alias for pa.dictionary(pa.int32(), pa.string()) #35209

@randolf-scholz

Description

@randolf-scholz

Describe the enhancement requested

Currently, there is no type alias for dictionary class.

arrow/python/pyarrow/types.pxi

Lines 4739 to 4793 in 1deb740

cdef dict _type_aliases = {
'null': null,
'bool': bool_,
'boolean': bool_,
'i1': int8,
'int8': int8,
'i2': int16,
'int16': int16,
'i4': int32,
'int32': int32,
'i8': int64,
'int64': int64,
'u1': uint8,
'uint8': uint8,
'u2': uint16,
'uint16': uint16,
'u4': uint32,
'uint32': uint32,
'u8': uint64,
'uint64': uint64,
'f2': float16,
'halffloat': float16,
'float16': float16,
'f4': float32,
'float': float32,
'float32': float32,
'f8': float64,
'double': float64,
'float64': float64,
'string': string,
'str': string,
'utf8': string,
'binary': binary,
'large_string': large_string,
'large_str': large_string,
'large_utf8': large_string,
'large_binary': large_binary,
'date32': date32,
'date64': date64,
'date32[day]': date32,
'date64[ms]': date64,
'time32[s]': time32('s'),
'time32[ms]': time32('ms'),
'time64[us]': time64('us'),
'time64[ns]': time64('ns'),
'timestamp[s]': timestamp('s'),
'timestamp[ms]': timestamp('ms'),
'timestamp[us]': timestamp('us'),
'timestamp[ns]': timestamp('ns'),
'duration[s]': duration('s'),
'duration[ms]': duration('ms'),
'duration[us]': duration('us'),
'duration[ns]': duration('ns'),
'month_day_nano_interval': month_day_nano_interval(),
}

Given that currently, there seems to be optimized kernels (*) only for `dictionary[int32,string], I'd suggest only adding this as a type alias at the time. Having a string alias is nice, particularly when one wants to save table schemas as config files.

(*) I noticed horrible performance when trying to load CSV data using column_types={"col": pa.dictionary(pa.int64(), pa.string())} or column_types={"col": pa.dictionary(pa.int16(), **pa.string())} or column_types={"col": pa.dictionary(pa.uint32(), pa.string())}. Only int32 seems to behave as expected, performance-wise. If this is a bug I can open another issue.

Some points to discuss

  • Should other aliases be added?

Component(s)

Python

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions