You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks to some help I got from @jorisvandenbossche, I can create DictionaryArrays with ExtensionType (on just the dictionary, the dictionary array itself, or both). However, these extended-DictionaryArrays can't be written to Parquet files.
To start, let's set up my minimal reproducer ExtensionType, this time with an explicit ExtensionArray:
I can write it to a file and read it back, though the fact that it comes back as a non-DictionaryArray might be part of the problem. Is some decision being made about the array of indices being too short to warrant dictionary encoding?
Anyway, the next step is to make a DictionaryArray with ExtensionTypes. In this example, I'm making both the dictionary and the outer DictionaryArray itself be extended:
My first thought was maybe the data used in the dictionary must be simple (it's usually strings). So how about making the outer DictionaryArray extended, but the inner dictionary not extended? The type definitions are now inline.
I can write this, but it comes back as a non-extended type, maybe because it's a non-DictionaryArray with the type of the original's dictionary (non-extended).
I'm pretty sure I aligned all the types right. Perhaps only one of these cases should be supported as the way it ought to work, but there ought to be some way to get the annotations into a Parquet file and read them back. (Other than un-dictencoding the array.)
Similarly as @lidavidm mentioned in ARROW-14569, it seems we need to implement a cast from dictionary to extension.
Now, the reason this is needed for Parquet is that, currently, only dictionary with string/binary type is supported to be stored as is in Parquet. All other dictionary types are stored as materialized values (still with some parquet encoding of course, but not necessarily using dictionary encoding, or at least not directly writing/reading from/to arrow's dictionary type to parquet dictionary encoding without additional conversion). See ARROW-6140
Thanks to some help I got from @jorisvandenbossche, I can create DictionaryArrays with ExtensionType (on just the dictionary, the dictionary array itself, or both). However, these extended-DictionaryArrays can't be written to Parquet files.
To start, let's set up my minimal reproducer ExtensionType, this time with an explicit ExtensionArray:
A non-extended DictionaryArray could be built like this:
I can write it to a file and read it back, though the fact that it comes back as a non-DictionaryArray might be part of the problem. Is some decision being made about the array of indices being too short to warrant dictionary encoding?
Anyway, the next step is to make a DictionaryArray with ExtensionTypes. In this example, I'm making both the dictionary and the outer DictionaryArray itself be extended:
This can't be written to a Parquet file:
My first thought was maybe the data used in the dictionary must be simple (it's usually strings). So how about making the outer DictionaryArray extended, but the inner dictionary not extended? The type definitions are now inline.
I can write this, but it comes back as a non-extended type, maybe because it's a non-DictionaryArray with the type of the original's dictionary (non-extended).
Okay, since there's four possibilities here, what about making the dictionary an ExtensionType, but the outer DictionaryArray is not?
Nope, can't write this, either:
I'm pretty sure I aligned all the types right. Perhaps only one of these cases should be supported as the way it ought to work, but there ought to be some way to get the annotations into a Parquet file and read them back. (Other than un-dictencoding the array.)
Reporter: Jim Pivarski / @jpivarski
Related issues:
Note: This issue was originally created as ARROW-14525. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: