New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] pa.array() doesn't respect specified dictionary type #23469
Comments
Thomas Buhrmann / @buhrmann: # Should be astype('string'), but pandas doesn't preserve NaNs
ser = pd.Series([np.nan, np.nan]).astype('object').astype('category')
arr = pa.DictionaryArray.from_arrays(
indices=-np.ones(len(ser), dtype=ser.cat.codes.dtype),
dictionary=np.array([], dtype='str'),
mask=np.ones(len(ser), dtype='bool'),
ordered=ser.cat.ordered)
print(arr.type)
pd.Series(arr.to_pandas()) which produces:
i.e. the 'str' value_type is now respected and the roundtrip produces the correct result. |
Thomas Buhrmann / @buhrmann: def categorical_to_arrow(ser, known_categories=None, ordered=None):
"""Safely create a pa.array from a categorial pd.Series.
Args:
ser (pd.Series): should be of CategorialDtype
known_categories (np.array): force known categories. If None, and
the Series doesn't have any values to infer it from, will use
an empty array of the same dtype as the categories attribute
of the Series
ordered (bool): whether categories should be ordered
"""
n = len(ser)
all_nan = ser.isna().sum() == n
# Enforce provided categories, use the original ones, or enforce
# the correct value_type if Arrow would otherwise change it to 'null'
if isinstance(known_categories, np.ndarray):
dictionary = known_categories
elif all_nan:
# value_type may be known, but Arrow doesn't understand 'object' dtype
value_type = ser.cat.categories.dtype
if value_type == 'object':
value_type = 'str'
dictionary = np.array([], dtype=value_type)
else:
dictionary = ser.cat.categories
# Allow overwriting of ordered attribute
if ordered is None:
ordered = ser.cat.ordered
if all_nan:
return pa.DictionaryArray.from_arrays(
indices=-np.ones(n, dtype=ser.cat.codes.dtype),
dictionary=dictionary,
mask=np.ones(n, dtype='bool'),
ordered=ordered)
else:
return pa.DictionaryArray.from_arrays(
indices=ser.cat.codes,
dictionary=dictionary,
ordered=ordered,
from_pandas=True
) This seems to be the only ( ?) way to have control over the resulting dictionary type. E.g.: # String categories with and without non-NaN values
sers = [
pd.Series([None, None]).astype('object').astype('category'),
pd.Series(['a', None, None]).astype('category')
]
# The categorical types we may want
known_categories = [
None,
np.array(['a', 'b', 'c'], dtype='str'),
np.array([1, 2, 3], dtype='int8')
]
# Convert each series with each of the desired category types
for ser in sers:
for cats in categories:
arr = categorical_to_arrow(ser, known_categories=cats)
ser2 = pd.Series(arr.to_pandas())
print(f"Series: {list(ser)} | Known categories: {cats}")
print(f"Dictionary type: {arr.type}")
print(f"Roundtripped Series: \n{ser2}", "\n") which produces:
(the last example would correspond to a recoding of the categories, but that'd be a usage problem...) |
Joris Van den Bossche / @jorisvandenbossche: Some other observations: Also when it's not all-NaN, the specified type gets ignored: In [19]: cat = pd.Categorical(['a', 'b'])
In [20]: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64(), ordered=False)
In [21]: pa.array(cat, type=typ)
Out[21]:
<pyarrow.lib.DictionaryArray object at 0x7ff87b6a50b8>
-- dictionary:
[
"a",
"b"
]
-- indices:
[
0,
1
]
In [22]: pa.array(cat, type=typ).type
Out[22]: DictionaryType(dictionary<values=string, indices=int8, ordered=0>) So I suppose it's a more general problem, not specifically related to this all-NaN case (it only appears for you in this case, as otherwise the specified type and the type from the data will probably match). In the example I show here above, we should probably raise an error is the specified type is not compatible (string vs int categories). |
Thomas Buhrmann / @buhrmann: I guess it should be a relatively easy fix, since as I show above, one can construct an all-NaN DictionaryArray using from_arrays() with negative indices, a np.array with desired type as dictionary, and setting the mask. I haven't checked under the hood why using -1 as indices works without setting from_pandas=True, and so I'm not sure if this is the best way to create the array, but it seems to work in practice... |
Antoine Pitrou / @pitrou: |
This might be related to ARROW-6548 and others dealing with all NaN columns. When creating a dictionary array, even when fully specifying the desired type, this type is not respected when the data contains only NaNs:
results in
which means that one cannot e.g. serialize batches of categoricals if the possibility of all-NaN batches exists, even when trying to enforce that each batch has the same schema (because the schema is not respected).
I understand that inferring the type in this case would be difficult, but I'd imagine that a fully specified type should be respected in this case?
In the meantime, is there a workaround to manually create a dictionary array of the desired type containing only NaNs?
Reporter: Thomas Buhrmann / @buhrmann
Assignee: Joris Van den Bossche / @jorisvandenbossche
PRs and other links:
Note: This issue was originally created as ARROW-7168. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: