Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pa.array() doesn't respect specified dictionary type #23469

Closed
asfimport opened this issue Nov 14, 2019 · 5 comments
Closed

[Python] pa.array() doesn't respect specified dictionary type #23469

asfimport opened this issue Nov 14, 2019 · 5 comments

Comments

@asfimport
Copy link

This might be related to ARROW-6548 and others dealing with all NaN columns. When creating a dictionary array, even when fully specifying the desired type, this type is not respected when the data contains only NaNs:

# This may look a little artificial but easily occurs when processing categorial data in batches and a particular batch containing only NaNs
ser = pd.Series([None, None]).astype('object').astype('category')
typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), ordered=False)
pa.array(ser, type=typ).type

results in


>> DictionaryType(dictionary<values=null, indices=int8, ordered=0>)

which means that one cannot e.g. serialize batches of categoricals if the possibility of all-NaN batches exists, even when trying to enforce that each batch has the same schema (because the schema is not respected).

I understand that inferring the type in this case would be difficult, but I'd imagine that a fully specified type should be respected in this case?

In the meantime, is there a workaround to manually create a dictionary array of the desired type containing only NaNs?

Reporter: Thomas Buhrmann / @buhrmann
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-7168. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Thomas Buhrmann / @buhrmann:
Ok, I think I found a workaround for converting an all-NaN categorical pd.Series to dictionary array:

# Should be astype('string'), but pandas doesn't preserve NaNs
ser = pd.Series([np.nan, np.nan]).astype('object').astype('category')

arr = pa.DictionaryArray.from_arrays(
    indices=-np.ones(len(ser), dtype=ser.cat.codes.dtype),
    dictionary=np.array([], dtype='str'),
    mask=np.ones(len(ser), dtype='bool'),
    ordered=ser.cat.ordered)

print(arr.type)
pd.Series(arr.to_pandas())

which produces:


dictionary<values=string, indices=int8, ordered=0>

0    NaN
1    NaN
dtype: category
Categories (0, object): []

i.e. the 'str' value_type is now respected and the roundtrip produces the correct result.

@asfimport
Copy link
Author

Thomas Buhrmann / @buhrmann:
Since I'm already at it, and in case somebody faces the same problem... To safely convert pandas categoricals to arrow, ensuring a constant type across batches, something like the following would work:

def categorical_to_arrow(ser, known_categories=None, ordered=None):
    """Safely create a pa.array from a categorial pd.Series.
    
    Args:
        ser (pd.Series): should be of CategorialDtype
        known_categories (np.array): force known categories. If None, and 
            the Series doesn't have any values to infer it from, will use 
            an empty array of the same dtype as the categories attribute
            of the Series
        ordered (bool): whether categories should be ordered 
    """
    n = len(ser)
    all_nan = ser.isna().sum() == n
       
    # Enforce provided categories, use the original ones, or enforce
    # the correct value_type if Arrow would otherwise change it to 'null'
    if isinstance(known_categories, np.ndarray):
        dictionary = known_categories
    elif all_nan:
        # value_type may be known, but Arrow doesn't understand 'object' dtype
        value_type = ser.cat.categories.dtype
        if value_type == 'object':
            value_type = 'str'
        dictionary = np.array([], dtype=value_type)
    else:
        dictionary = ser.cat.categories
        
    # Allow overwriting of ordered attribute
    if ordered is None:
        ordered = ser.cat.ordered

    if all_nan:
        return pa.DictionaryArray.from_arrays(
            indices=-np.ones(n, dtype=ser.cat.codes.dtype),
            dictionary=dictionary,
            mask=np.ones(n, dtype='bool'),
            ordered=ordered)
    else:
        return pa.DictionaryArray.from_arrays(
            indices=ser.cat.codes,
            dictionary=dictionary,
            ordered=ordered,
            from_pandas=True
        )

This seems to be the only ( ?) way to have control over the resulting dictionary type. E.g.:

# String categories with and without non-NaN values
sers = [
    pd.Series([None, None]).astype('object').astype('category'),
    pd.Series(['a', None, None]).astype('category')
]

# The categorical types we may want
known_categories = [
    None,
    np.array(['a', 'b', 'c'], dtype='str'),
    np.array([1, 2, 3], dtype='int8')
]

# Convert each series with each of the desired category types
for ser in sers:
    for cats in categories:
        arr = categorical_to_arrow(ser, known_categories=cats)
        ser2 = pd.Series(arr.to_pandas())
        print(f"Series: {list(ser)} | Known categories: {cats}")
        print(f"Dictionary type: {arr.type}")
        print(f"Roundtripped Series: \n{ser2}", "\n")

which produces:


Series: [nan, nan] | Known categories: None
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0    NaN
1    NaN
dtype: category
Categories (0, object): [] 

Series: [nan, nan] | Known categories: ['a' 'b' 'c']
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0    NaN
1    NaN
dtype: category
Categories (3, object): [a, b, c] 

Series: [nan, nan] | Known categories: [1 2 3]
Dictionary type: dictionary<values=int8, indices=int8, ordered=0>
Roundtripped Series: 
0    NaN
1    NaN
dtype: category
Categories (3, int64): [1, 2, 3] 

Series: ['a', nan, nan] | Known categories: None
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0      a
1    NaN
2    NaN
dtype: category
Categories (1, object): [a] 

Series: ['a', nan, nan] | Known categories: ['a' 'b' 'c']
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0      a
1    NaN
2    NaN
dtype: category
Categories (3, object): [a, b, c] 

Series: ['a', nan, nan] | Known categories: [1 2 3]
Dictionary type: dictionary<values=int8, indices=int8, ordered=0>
Roundtripped Series: 
0      1
1    NaN
2    NaN
dtype: category
Categories (3, int64): [1, 2, 3] 

(the last example would correspond to a recoding of the categories, but that'd be a usage problem...)

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
@buhrmann thanks for the report. When passing a type like that, I agree it should be honoured.

Some other observations:

Also when it's not all-NaN, the specified type gets ignored:

In [19]: cat = pd.Categorical(['a', 'b']) 

In [20]: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64(), ordered=False)  

In [21]: pa.array(cat, type=typ) 
Out[21]: 
<pyarrow.lib.DictionaryArray object at 0x7ff87b6a50b8>

-- dictionary:
  [
    "a",
    "b"
  ]
-- indices:
  [
    0,
    1
  ]

In [22]: pa.array(cat, type=typ).type  
Out[22]: DictionaryType(dictionary<values=string, indices=int8, ordered=0>)

So I suppose it's a more general problem, not specifically related to this all-NaN case (it only appears for you in this case, as otherwise the specified type and the type from the data will probably match).

In the example I show here above, we should probably raise an error is the specified type is not compatible (string vs int categories).

@asfimport
Copy link
Author

Thomas Buhrmann / @buhrmann:
Yes, that's right. I didn't notice it silently 'failing' in other cases because I usually construct the type explicitly to match.

I guess it should be a relatively easy fix, since as I show above, one can construct an all-NaN DictionaryArray using from_arrays() with negative indices, a np.array with desired type as dictionary, and setting the mask. I haven't checked under the hood why using -1 as indices works without setting from_pandas=True, and so I'm not sure if this is the best way to create the array, but it seems to work in practice...

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5866
#5866

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants