[Python] pa.array() doesn't respect specified dictionary type #23469

asfimport · 2019-11-14T12:21:15Z

This might be related to ARROW-6548 and others dealing with all NaN columns. When creating a dictionary array, even when fully specifying the desired type, this type is not respected when the data contains only NaNs:

# This may look a little artificial but easily occurs when processing categorial data in batches and a particular batch containing only NaNs
ser = pd.Series([None, None]).astype('object').astype('category')
typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), ordered=False)
pa.array(ser, type=typ).type

results in


>> DictionaryType(dictionary<values=null, indices=int8, ordered=0>)

which means that one cannot e.g. serialize batches of categoricals if the possibility of all-NaN batches exists, even when trying to enforce that each batch has the same schema (because the schema is not respected).

I understand that inferring the type in this case would be difficult, but I'd imagine that a fully specified type should be respected in this case?

In the meantime, is there a workaround to manually create a dictionary array of the desired type containing only NaNs?

Reporter: Thomas Buhrmann / @buhrmann
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

GitHub Pull Request #5866

_{Note: This issue was originally created as ARROW-7168. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2019-11-14T16:21:28Z

Thomas Buhrmann / @buhrmann:
Ok, I think I found a workaround for converting an all-NaN categorical pd.Series to dictionary array:

# Should be astype('string'), but pandas doesn't preserve NaNs
ser = pd.Series([np.nan, np.nan]).astype('object').astype('category')

arr = pa.DictionaryArray.from_arrays(
    indices=-np.ones(len(ser), dtype=ser.cat.codes.dtype),
    dictionary=np.array([], dtype='str'),
    mask=np.ones(len(ser), dtype='bool'),
    ordered=ser.cat.ordered)

print(arr.type)
pd.Series(arr.to_pandas())

which produces:


dictionary<values=string, indices=int8, ordered=0>

0    NaN
1    NaN
dtype: category
Categories (0, object): []

i.e. the 'str' value_type is now respected and the roundtrip produces the correct result.

asfimport · 2019-11-14T16:51:17Z

Thomas Buhrmann / @buhrmann:
Since I'm already at it, and in case somebody faces the same problem... To safely convert pandas categoricals to arrow, ensuring a constant type across batches, something like the following would work:

def categorical_to_arrow(ser, known_categories=None, ordered=None):
    """Safely create a pa.array from a categorial pd.Series.
    
    Args:
        ser (pd.Series): should be of CategorialDtype
        known_categories (np.array): force known categories. If None, and 
            the Series doesn't have any values to infer it from, will use 
            an empty array of the same dtype as the categories attribute
            of the Series
        ordered (bool): whether categories should be ordered 
    """
    n = len(ser)
    all_nan = ser.isna().sum() == n
       
    # Enforce provided categories, use the original ones, or enforce
    # the correct value_type if Arrow would otherwise change it to 'null'
    if isinstance(known_categories, np.ndarray):
        dictionary = known_categories
    elif all_nan:
        # value_type may be known, but Arrow doesn't understand 'object' dtype
        value_type = ser.cat.categories.dtype
        if value_type == 'object':
            value_type = 'str'
        dictionary = np.array([], dtype=value_type)
    else:
        dictionary = ser.cat.categories
        
    # Allow overwriting of ordered attribute
    if ordered is None:
        ordered = ser.cat.ordered

    if all_nan:
        return pa.DictionaryArray.from_arrays(
            indices=-np.ones(n, dtype=ser.cat.codes.dtype),
            dictionary=dictionary,
            mask=np.ones(n, dtype='bool'),
            ordered=ordered)
    else:
        return pa.DictionaryArray.from_arrays(
            indices=ser.cat.codes,
            dictionary=dictionary,
            ordered=ordered,
            from_pandas=True
        )

This seems to be the only ( ?) way to have control over the resulting dictionary type. E.g.:

# String categories with and without non-NaN values
sers = [
    pd.Series([None, None]).astype('object').astype('category'),
    pd.Series(['a', None, None]).astype('category')
]

# The categorical types we may want
known_categories = [
    None,
    np.array(['a', 'b', 'c'], dtype='str'),
    np.array([1, 2, 3], dtype='int8')
]

# Convert each series with each of the desired category types
for ser in sers:
    for cats in categories:
        arr = categorical_to_arrow(ser, known_categories=cats)
        ser2 = pd.Series(arr.to_pandas())
        print(f"Series: {list(ser)} | Known categories: {cats}")
        print(f"Dictionary type: {arr.type}")
        print(f"Roundtripped Series: \n{ser2}", "\n")

which produces:


Series: [nan, nan] | Known categories: None
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0    NaN
1    NaN
dtype: category
Categories (0, object): [] 

Series: [nan, nan] | Known categories: ['a' 'b' 'c']
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0    NaN
1    NaN
dtype: category
Categories (3, object): [a, b, c] 

Series: [nan, nan] | Known categories: [1 2 3]
Dictionary type: dictionary<values=int8, indices=int8, ordered=0>
Roundtripped Series: 
0    NaN
1    NaN
dtype: category
Categories (3, int64): [1, 2, 3] 

Series: ['a', nan, nan] | Known categories: None
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0      a
1    NaN
2    NaN
dtype: category
Categories (1, object): [a] 

Series: ['a', nan, nan] | Known categories: ['a' 'b' 'c']
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0      a
1    NaN
2    NaN
dtype: category
Categories (3, object): [a, b, c] 

Series: ['a', nan, nan] | Known categories: [1 2 3]
Dictionary type: dictionary<values=int8, indices=int8, ordered=0>
Roundtripped Series: 
0      1
1    NaN
2    NaN
dtype: category
Categories (3, int64): [1, 2, 3]

(the last example would correspond to a recoding of the categories, but that'd be a usage problem...)

asfimport · 2019-11-14T18:22:14Z

Joris Van den Bossche / @jorisvandenbossche:
@buhrmann thanks for the report. When passing a type like that, I agree it should be honoured.

Some other observations:

Also when it's not all-NaN, the specified type gets ignored:

In [19]: cat = pd.Categorical(['a', 'b']) 

In [20]: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64(), ordered=False)  

In [21]: pa.array(cat, type=typ) 
Out[21]: 
<pyarrow.lib.DictionaryArray object at 0x7ff87b6a50b8>

-- dictionary:
  [
    "a",
    "b"
  ]
-- indices:
  [
    0,
    1
  ]

In [22]: pa.array(cat, type=typ).type  
Out[22]: DictionaryType(dictionary<values=string, indices=int8, ordered=0>)

So I suppose it's a more general problem, not specifically related to this all-NaN case (it only appears for you in this case, as otherwise the specified type and the type from the data will probably match).

In the example I show here above, we should probably raise an error is the specified type is not compatible (string vs int categories).

asfimport · 2019-11-15T09:41:27Z

Thomas Buhrmann / @buhrmann:
Yes, that's right. I didn't notice it silently 'failing' in other cases because I usually construct the type explicitly to match.

I guess it should be a relatively easy fix, since as I show above, one can construct an all-NaN DictionaryArray using from_arrays() with negative indices, a np.array with desired type as dictionary, and setting the mask. I haven't checked under the hood why using -1 as indices works without setting from_pandas=True, and so I'm not sure if this is the best way to create the array, but it seems to work in practice...

asfimport · 2019-11-21T10:18:16Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5866
#5866

asfimport closed this as completed Nov 21, 2019

asfimport assigned jorisvandenbossche Jan 10, 2023

asfimport added this to the 0.16.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] pa.array() doesn't respect specified dictionary type #23469

[Python] pa.array() doesn't respect specified dictionary type #23469

asfimport commented Nov 14, 2019

asfimport commented Nov 14, 2019

asfimport commented Nov 14, 2019

asfimport commented Nov 14, 2019

asfimport commented Nov 15, 2019

asfimport commented Nov 21, 2019

[Python] pa.array() doesn't respect specified dictionary type #23469

[Python] pa.array() doesn't respect specified dictionary type #23469

Comments

asfimport commented Nov 14, 2019

PRs and other links:

asfimport commented Nov 14, 2019

asfimport commented Nov 14, 2019

asfimport commented Nov 14, 2019

asfimport commented Nov 15, 2019

asfimport commented Nov 21, 2019