-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9017: [C++][Python] Refactor scalar bindings #7519
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice!
Could we add a is_valid
attribute to the python scalar as well? Now the only way to check for a null value is to do .as_py() is None
?
Might not be related to the changes in this PR, but reviewing this triggered me to test scalar casting:
In [2]: s = pa.scalar(pd.Timestamp("2012-01-01"))
...:
...: import pyarrow.compute as pc
...: pc.cast(s, pa.timestamp('ns'))
../src/arrow/compute/kernels/scalar_cast_temporal.cc:130: Check failed: (batch[0].kind()) == (Datum::ARRAY)
...
Aborted (core dumped)
python/pyarrow/tests/test_scalars.py
Outdated
|
||
def test_null_equality(): | ||
assert (pa.NA == pa.NA) is pa.NA | ||
assert (pa.NA == 1) is pa.NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know to what extent we want to fully work out the scalars (so can certainly be a follow-up), but so the typed null scalars (not pa.NA), should probably behave the same as pa.NA, eg when it comes to equality (pa.NA == 1
gives pa.NA, but pa.scalar(None, pa.int64()) == 1
gives False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question, I assume it should follow the same semantics as Null
does.
Definitely!
I think the cast kernel doesn't support scalars yet, on the other hand the scalars have custom cc @bkietz |
Added. |
I'm considering to apply cython.freelist on the scalar extension classes. @wesm @pitrou @jorisvandenbossche do you have any positive experience with it? I assume we should measure its impact before using it, so I can defer it to a follow-up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work. Two pain points:
- you should add tests for the C++ changes
- the null equality tests look like a nuisance for regular Python usage (
__eq__
should return a boolean)
Ok, at a quick glance, it seems that null container tests work properly regardless: >>> s = set()
>>> s.add(pa.scalar(None))
>>> s
{<pyarrow.NullScalar: None>}
>>> pa.scalar(None) in s
True
>>> s.add(pa.scalar(None, pa.int64()))
>>> s.add(pa.scalar(12, pa.int64()))
>>> s
{<pyarrow.Int64Scalar: 12>,
<pyarrow.NullScalar: None>,
<pyarrow.Int64Scalar: None>}
>>> pa.scalar(None, pa.int64()) in s
True
>>> pa.scalar(None, pa.int32()) in s
False >>> l = [pa.scalar(None)]
>>> pa.scalar(None) in l
True
>>> pa.scalar(None, pa.int64()) in l
False |
The reason that those scalars return null on equality check and not True/False, it to ensure consistent behaviour between arrays and scalars (to ensure |
The problem is that it breaks Python semantics in potentially annoying places: >>> import pyarrow as pa
>>> na = pa.scalar(None)
>>> na in [5]
True |
We could "fix" that one by raising in |
Yes, we could. That may have other annoying implications, though (such as |
return str(self.as_py()) | ||
|
||
def equals(self, Scalar other): | ||
return self.wrapped.get().Equals(other.unwrap().get()[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the C++ Equals
for ==
might give some potentially "surprising" behaviours, eg pa.scalar(1, type=pa.int64()) == pa.scalar(1, type=pa.int32())
gives False (while pa.scalar(1, type=pa.int64()) == 1
or pa.scalar(1, type=pa.int64()) == np.int32(1)
gives True).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, also for arrays this currently doesn't work. But for arrays it gives an error about no kernel matching the types (which is more informative than "silently" giving False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the options .as_py()
conversion from the equality check, so now a scalar can be equal to another scalar only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the options .as_py() conversion from the equality check, so now a scalar can be equal to another scalar only.
Sounds good!
python/pyarrow/scalar.pxi
Outdated
----- | ||
Localized timestamps will currently be returned as UTC (pandas's native | ||
representation). Timezone-naive data will be implicitly interpreted as | ||
UTC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is fully correct? (or at least a bit misleading/confusing)
Timezone-naive data simply have no timezone, I am not aware of places in arrow where we implicitly assume UTC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing. Copied from pa.array()
's docstring, so we mey need to update it there as well.
Addressed the review comments and changed the equality checks to be strict, only allowing to compare scalars with scalars of the same type. One exception could be to compare invalid scalar values to @pitrou @jorisvandenbossche it should be ready for another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me this is fine to go. I think @pitrou also asked for some C++ tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks okay to me, it does look like adding a small amount of C++ test coverage for the things that changed there would be good
@pitrou added the C++ tests. |
Looks like the PR needs rebasing and fixing for the latest union changes. |
@pitrou updated. I assume now we propagate the null values from the selected child array, please double check the union tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! I haven't reviewed the Python changes again, so the following comments are only about the C++ changes.
ASSERT_TRUE(third->Equals(scalar_beta)); | ||
|
||
ASSERT_OK_AND_ASSIGN(auto fourth, arr.GetScalar(3)); | ||
ASSERT_TRUE(fourth->Equals(MakeNullScalar(ty))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... interesting. Shouldn't it be MakeNullScalar(utf8())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think GetScalar()
should return a Scalar with the same type as the Array, so in case of unions with the union type rather than the type selected by the union type id (the union scalar's value is another scalar with the type id).
I don't have a strong opinion on this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Yes, you're probably right. A pity we lose information about the value type, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the value type in UnionScalar.value.type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, perhaps check that in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for UnionScalar we have two is_valid
flags, une for the union scalar and one for the underlying value scalar. Currently MakeNullScalar()
sets the validity flag for the union scalar and leaves the value scalar uninitialized.
What do you think, shall we initialize the value scalar as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It the value scalar is not a nullptr, then it would seem better, unless that's complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can ensure to have an explicit type to the value scalar from array.GetScalar()
but not when a null union scalar is directly constructed, so MakeNullScalar(union_type)
won't know which type to choose for the underlying value scalar.
+1. Thanks @kszucs for this work! |
TODOs:
Closes ARROW-9017, ARROW-9153