Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2432: [Python] Fix Pandas decimal type conversion with None values #1878

Conversation

@BryanCutler
Copy link
Member

BryanCutler commented Apr 10, 2018

This fixes conversion of Pandas decimal types to Arrow with None values. Previously, if the type was specified, an error would occur when checking if the object was Decimal. If the type was not specified, a segmentation fault would occur when attempting to find the max precision and scale.

Added new tests which include None values for both the above cases.

@BryanCutler BryanCutler requested a review from cpcloud Apr 10, 2018
@BryanCutler

This comment has been minimized.

Copy link
Member Author

BryanCutler commented Apr 10, 2018

Ping @cpcloud and @pitrou, please take a look when you can, thanks!

@@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t suggested_precision, int32_t suggested_sc
}

Status DecimalMetadata::Update(PyObject* object) {
DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal";
bool is_decimal = PyDecimal_Check(object);

This comment has been minimized.

Copy link
@pitrou

pitrou Apr 10, 2018

Contributor

I don't think it's ok to do this in optimized build. DecimalMetadata expects you to pass a decimal object. @cpcloud may confirm.

This comment has been minimized.

Copy link
@BryanCutler

BryanCutler Apr 10, 2018

Author Member

This isn't necessary because I added a check before calling Update but it does prevent a seg fault if for some reason it's called with non Decimal objects - which is not nice to get. If it's hurts an optimization though, I can remove it.

This comment has been minimized.

Copy link
@pitrou

pitrou Apr 10, 2018

Contributor

Right now we are doing the check twice in optimized builds, which is not nice IMHO. DecimalMetadata::Update is a private API so it's up to the caller to provide appropriate input.

This comment has been minimized.

Copy link
@BryanCutler

BryanCutler Apr 11, 2018

Author Member

So you mean remove PyDecimal_Check all together? This is only called when the type is not specified by the user and then yes, it will end doing 2 passes over the objects and checks both times if they are decimal. It might be possible to do less checks on the second pass if we keep a list of which ones are decimal objects, but I'm not sure that would be worth it.

This comment has been minimized.

Copy link
@pitrou

pitrou Apr 11, 2018

Contributor

Fair enough, we can optimize later if we find it too slow. The conversion itself is very slow anyway :-)

@@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() {

if (type_ == NULLPTR) {
for (PyObject* object : objects) {
RETURN_NOT_OK(max_decimal_metadata.Update(object));
if (!internal::PandasObjectIsNull(object)) {

This comment has been minimized.

Copy link
@pitrou

pitrou Apr 10, 2018

Contributor

Do we care about accepting other NULL-like objects such as float('nan')? Otherwise object != Py_None is a much faster check.

This comment has been minimized.

Copy link
@BryanCutler

BryanCutler Apr 10, 2018

Author Member

I'm not sure, is it possible to get NaNs from operations on Decimals? Or is that something the user might mix in somehow?

This comment has been minimized.

Copy link
@cpcloud

cpcloud Apr 10, 2018

Contributor

Python decimal objects can be nan, unfortunately:

>>> import decimal
>>> decimal.Decimal('nan')
Decimal('NaN')

This comment has been minimized.

Copy link
@BryanCutler

BryanCutler Apr 10, 2018

Author Member

Seems like it could be NaN also:

In [5]: s1 = pd.Series([Decimal('1.0'), Decimal('2.0')])

In [6]: s2 = pd.Series([Decimal('2.0'), None])

In [7]: s1 / s2
Out[7]: 
0    0.5
1    NaN
dtype: object
RETURN_NOT_OK(max_decimal_metadata.Update(object));
if (!internal::PandasObjectIsNull(object)) {
RETURN_NOT_OK(max_decimal_metadata.Update(object));
}
}

type_ =

This comment has been minimized.

Copy link
@pitrou

pitrou Apr 10, 2018

Contributor

By the way, what happens here if all items are None? Do we have a test for that?

This comment has been minimized.

Copy link
@BryanCutler

BryanCutler Apr 10, 2018

Author Member

I'll add that

This comment has been minimized.

Copy link
@BryanCutler

BryanCutler Apr 10, 2018

Author Member

done

Decimal128 value;
RETURN_NOT_OK(internal::DecimalFromPythonDecimal(object, decimal_type, &value));
RETURN_NOT_OK(builder.Append(value));
} else if (is_decimal == 0 && internal::PandasObjectIsNull(object)) {

This comment has been minimized.

Copy link
@pitrou

pitrou Apr 10, 2018

Contributor

Same question as above: do we care about other NULL-like values than simply None?

series = pd.Series(data)
_check_series_roundtrip(series, type_=pa.decimal128(12, 5))

def test_decimal_with_None_infer_type(self):

This comment has been minimized.

Copy link
@pitrou

pitrou Apr 10, 2018

Contributor

Can you also check that the expected type is inferred?

This comment has been minimized.

Copy link
@BryanCutler

BryanCutler Apr 10, 2018

Author Member

sure

BryanCutler added 2 commits Apr 10, 2018
_check_series_roundtrip(s, type_=pa.binary())
# Infer type from bytearrays
_check_series_roundtrip(s)

This comment has been minimized.

Copy link
@pitrou

pitrou Apr 11, 2018

Contributor

But you should pass expected_pa_type here.

This comment has been minimized.

Copy link
@BryanCutler

BryanCutler Apr 11, 2018

Author Member

ooops, right - thanks for catching that!

@pitrou
pitrou approved these changes Apr 11, 2018
Copy link
Contributor

pitrou left a comment

LGTM.

@pitrou pitrou closed this in 9ad8602 Apr 12, 2018
@BryanCutler

This comment has been minimized.

Copy link
Member Author

BryanCutler commented Apr 12, 2018

Thanks @pitrou!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.