ARROW-9017: [C++][Python] Refactor scalar bindings #7519

kszucs · 2020-06-22T20:17:13Z

TODOs:

split PRs into two, one with simplified python to arrow conversions with benchmarks
implement union scalar on the C++ side
store the index value for dictionary scalar
more tests

Closes ARROW-9017, ARROW-9153

github-actions · 2020-06-22T20:31:46Z

https://issues.apache.org/jira/browse/ARROW-9153

python/pyarrow/tests/test_misc.py

jorisvandenbossche

Really nice!

Could we add a is_valid attribute to the python scalar as well? Now the only way to check for a null value is to do .as_py() is None ?

Might not be related to the changes in this PR, but reviewing this triggered me to test scalar casting:

In [2]: s = pa.scalar(pd.Timestamp("2012-01-01")) 
   ...:  
   ...: import pyarrow.compute as pc 
   ...: pc.cast(s, pa.timestamp('ns'))  
../src/arrow/compute/kernels/scalar_cast_temporal.cc:130:  Check failed: (batch[0].kind()) == (Datum::ARRAY) 
...
Aborted (core dumped)

python/pyarrow/_dataset.pyx

python/pyarrow/tests/test_parquet.py

python/pyarrow/util.py

python/pyarrow/scalar.pxi

jorisvandenbossche · 2020-06-25T13:32:21Z

python/pyarrow/tests/test_scalars.py

+
+def test_null_equality():
+    assert (pa.NA == pa.NA) is pa.NA
+    assert (pa.NA == 1) is pa.NA


I don't know to what extent we want to fully work out the scalars (so can certainly be a follow-up), but so the typed null scalars (not pa.NA), should probably behave the same as pa.NA, eg when it comes to equality (pa.NA == 1 gives pa.NA, but pa.scalar(None, pa.int64()) == 1 gives False)

Great question, I assume it should follow the same semantics as Null does.

github-actions · 2020-06-25T13:46:56Z

https://issues.apache.org/jira/browse/ARROW-9017

kszucs · 2020-06-25T14:37:54Z

Really nice!

Could we add a is_valid attribute to the python scalar as well? Now the only way to check for a null value is to do .as_py() is None ?

Definitely!

Might not be related to the changes in this PR, but reviewing this triggered me to test scalar casting:

In [2]: s = pa.scalar(pd.Timestamp("2012-01-01")) 
   ...:  
   ...: import pyarrow.compute as pc 
   ...: pc.cast(s, pa.timestamp('ns'))  
../src/arrow/compute/kernels/scalar_cast_temporal.cc:130:  Check failed: (batch[0].kind()) == (Datum::ARRAY) 
...
Aborted (core dumped)

I think the cast kernel doesn't support scalars yet, on the other hand the scalars have custom CastTo implementation which we may want to remove in favor of compute.Cast?

cc @bkietz

kszucs · 2020-06-26T12:21:00Z

Could we add a is_valid attribute to the python scalar as well? Now the only way to check for a null value is to do .as_py() is None ?

Added.

kszucs · 2020-06-26T12:43:44Z

I'm considering to apply cython.freelist on the scalar extension classes.

@wesm @pitrou @jorisvandenbossche do you have any positive experience with it?

I assume we should measure its impact before using it, so I can defer it to a follow-up.

pitrou

Very nice work. Two pain points:

you should add tests for the C++ changes
the null equality tests look like a nuisance for regular Python usage (__eq__ should return a boolean)

python/pyarrow/includes/libarrow.pxd

python/pyarrow/lib.pxd

python/pyarrow/scalar.pxi

python/pyarrow/tests/test_scalars.py

pitrou · 2020-06-30T13:54:21Z

Ok, at a quick glance, it seems that null container tests work properly regardless:

>>> s = set()                                                                                                                                                                  
>>> s.add(pa.scalar(None))                                                                                                                                                     
>>> s                                                                                                                                                                          
{<pyarrow.NullScalar: None>}
>>> pa.scalar(None) in s                                                                                                                                                       
True
>>> s.add(pa.scalar(None, pa.int64()))                                                                                                                                         
>>> s.add(pa.scalar(12, pa.int64()))                                                                                                                                           
>>> s                                                                                                                                                                          
{<pyarrow.Int64Scalar: 12>,
 <pyarrow.NullScalar: None>,
 <pyarrow.Int64Scalar: None>}
>>> pa.scalar(None, pa.int64()) in s                                                                                                                                           
True
>>> pa.scalar(None, pa.int32()) in s                                                                                                                                           
False

>>> l = [pa.scalar(None)]                                                                                                                                                      
>>> pa.scalar(None) in l                                                                                                                                                       
True
>>> pa.scalar(None, pa.int64()) in l                                                                                                                                           
False

jorisvandenbossche · 2020-07-01T07:31:01Z

the null equality tests look like a nuisance for regular Python usage (__eq__ should return a boolean)

The reason that those scalars return null on equality check and not True/False, it to ensure consistent behaviour between arrays and scalars (to ensure (arr1 == arr2)[0] and arr1[0] == arr2[0]gives the same answer).
Our element-wise equality propagates null, and so IMO also the scalar should do.

pitrou · 2020-07-01T07:36:30Z

The problem is that it breaks Python semantics in potentially annoying places:

>>> import pyarrow as pa                                                                                                                                                 
>>> na = pa.scalar(None)                                                                                                                                                 
>>> na in [5]                                                                                                                                                            
True

jorisvandenbossche · 2020-07-01T07:38:51Z

We could "fix" that one by raising in __bool__ (meaning: it will at least give an error instead of silently returning a wrong answer)

pitrou · 2020-07-01T07:53:12Z

Yes, we could. That may have other annoying implications, though (such as __contains__ not working anymore). I've started a ML discussion.

python/pyarrow/scalar.pxi

jorisvandenbossche · 2020-07-01T08:01:55Z

python/pyarrow/scalar.pxi

+        return str(self.as_py())
+
+    def equals(self, Scalar other):
+        return self.wrapped.get().Equals(other.unwrap().get()[0])


Using the C++ Equals for == might give some potentially "surprising" behaviours, eg pa.scalar(1, type=pa.int64()) == pa.scalar(1, type=pa.int32()) gives False (while pa.scalar(1, type=pa.int64()) == 1 or pa.scalar(1, type=pa.int64()) == np.int32(1) gives True).

Although, also for arrays this currently doesn't work. But for arrays it gives an error about no kernel matching the types (which is more informative than "silently" giving False)

Removed the options .as_py() conversion from the equality check, so now a scalar can be equal to another scalar only.

Removed the options .as_py() conversion from the equality check, so now a scalar can be equal to another scalar only.

Sounds good!

jorisvandenbossche · 2020-07-01T08:11:34Z

python/pyarrow/scalar.pxi

+    -----
+    Localized timestamps will currently be returned as UTC (pandas's native
+    representation). Timezone-naive data will be implicitly interpreted as
+    UTC.


I don't think this is fully correct? (or at least a bit misleading/confusing)
Timezone-naive data simply have no timezone, I am not aware of places in arrow where we implicitly assume UTC?

Removing. Copied from pa.array()'s docstring, so we mey need to update it there as well.

kszucs · 2020-07-02T14:38:05Z

Addressed the review comments and changed the equality checks to be strict, only allowing to compare scalars with scalars of the same type. One exception could be to compare invalid scalar values to pa.NA, but even if we consider to choose that behavior, it should be implemented on the C++ side.

@pitrou @jorisvandenbossche it should be ready for another review.

jorisvandenbossche

For me this is fine to go. I think @pitrou also asked for some C++ tests?

python/pyarrow/tests/test_array.py

wesm

This looks okay to me, it does look like adding a small amount of C++ test coverage for the things that changed there would be good

kszucs · 2020-07-06T13:22:57Z

@pitrou added the C++ tests.

pitrou · 2020-07-06T13:26:52Z

Looks like the PR needs rebasing and fixing for the latest union changes.

kszucs · 2020-07-06T15:29:38Z

@pitrou updated. I assume now we propagate the null values from the selected child array, please double check the union tests.

pitrou

Thank you very much! I haven't reviewed the Python changes again, so the following comments are only about the C++ changes.

cpp/src/arrow/scalar.cc

cpp/src/arrow/scalar_test.cc

pitrou · 2020-07-06T15:47:09Z

cpp/src/arrow/scalar_test.cc

+  ASSERT_TRUE(third->Equals(scalar_beta));
+
+  ASSERT_OK_AND_ASSIGN(auto fourth, arr.GetScalar(3));
+  ASSERT_TRUE(fourth->Equals(MakeNullScalar(ty)));


Hmm... interesting. Shouldn't it be MakeNullScalar(utf8())?

I think GetScalar() should return a Scalar with the same type as the Array, so in case of unions with the union type rather than the type selected by the union type id (the union scalar's value is another scalar with the type id).

I don't have a strong opinion on this though.

Ah, I see. Yes, you're probably right. A pity we lose information about the value type, then.

We have the value type in UnionScalar.value.type.

Ah, perhaps check that in the test?

So for UnionScalar we have two is_valid flags, une for the union scalar and one for the underlying value scalar. Currently MakeNullScalar() sets the validity flag for the union scalar and leaves the value scalar uninitialized.

What do you think, shall we initialize the value scalar as well?

It the value scalar is not a nullptr, then it would seem better, unless that's complicated.

We can ensure to have an explicit type to the value scalar from array.GetScalar() but not when a null union scalar is directly constructed, so MakeNullScalar(union_type) won't know which type to choose for the underlying value scalar.

cpp/src/arrow/scalar_test.cc

wesm · 2020-07-06T22:15:26Z

+1. Thanks @kszucs for this work!

kszucs force-pushed the ARROW-9153 branch from 2a74140 to f55343a Compare June 24, 2020 20:35

kszucs changed the title ~~ARROW-9153: [C++][Python] Refactor scalar bindings [WIP]~~ ARROW-9153: [C++][Python] Refactor scalar bindings Jun 24, 2020

kszucs commented Jun 24, 2020

View reviewed changes

python/pyarrow/tests/test_misc.py Show resolved Hide resolved

jorisvandenbossche reviewed Jun 25, 2020

View reviewed changes

jorisvandenbossche changed the title ~~ARROW-9153: [C++][Python] Refactor scalar bindings~~ ARROW-9017: [C++][Python] Refactor scalar bindings Jun 25, 2020

kszucs force-pushed the ARROW-9153 branch from 47b1d7e to aaf2c00 Compare June 26, 2020 12:16

kszucs requested a review from pitrou June 30, 2020 11:50

pitrou requested changes Jun 30, 2020

View reviewed changes

kszucs force-pushed the ARROW-9153 branch from 977c4cc to 51ff10d Compare June 30, 2020 17:24

jorisvandenbossche reviewed Jul 1, 2020

View reviewed changes

fsaintjacques mentioned this pull request Jul 2, 2020

ARROW-9108: [C++][Dataset] Add supports for missing type in Statistics to Scalar conversion #7623

Closed

jorisvandenbossche approved these changes Jul 3, 2020

View reviewed changes

python/pyarrow/tests/test_array.py Outdated Show resolved Hide resolved

wesm reviewed Jul 3, 2020

View reviewed changes

kszucs added 4 commits July 6, 2020 17:02

refactor scalars

1193587

fic decimal and dictionary scalars

ee102e4

union

fce3844

deprecations

72152cb

kszucs added 19 commits July 6, 2020 17:02

address review comments

fac8f2e

update dictionary type to support the new dictionaly scalars

e0dae6c

KeyError

66c01a8

numpy test

974590b

remove magical comparison

39035b0

remove commented code

ed77d89

use base binary scalar

99e4abc

base list scalar

cd78337

support converting from pandas values

72e5dec

fix test cases

68b19d4

remove eq overload from nullscalar

494fd3a

fix gandiva tests

ba13538

force numpy type in test

8b5e148

remove leftovers

6a8ee03

GetScalar tests

3cfb993

tests

b856240

dense

a294d67

add DictionaryScala::GetEncodedValue

a5f3ea4

update union changes

6329d78

kszucs force-pushed the ARROW-9153 branch from 8140e16 to 6329d78 Compare July 6, 2020 15:29

pitrou reviewed Jul 6, 2020

View reviewed changes

kszucs added 3 commits July 6, 2020 18:05

address review comments

1e6aba3

check union scalars' underlying scalar value

ea56b10

fix ubsan

de11dd5

wesm closed this in 98f10dc Jul 6, 2020

This was referenced Jul 6, 2020

[Python] Refactor the Scalar classes #25134

Closed

[Python] Add bindings for StructScalar #25260

Closed

[Python] Revert Array.equals changes + expose comparison ops in compute #25518

Closed

ARROW-9017: [C++][Python] Refactor scalar bindings #7519

ARROW-9017: [C++][Python] Refactor scalar bindings #7519

Conversation

kszucs commented Jun 22, 2020 • edited Loading

github-actions bot commented Jun 22, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 25, 2020

kszucs commented Jun 25, 2020

kszucs commented Jun 26, 2020

kszucs commented Jun 26, 2020

pitrou left a comment

Choose a reason for hiding this comment

pitrou commented Jun 30, 2020

jorisvandenbossche commented Jul 1, 2020

pitrou commented Jul 1, 2020

jorisvandenbossche commented Jul 1, 2020

pitrou commented Jul 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kszucs commented Jul 2, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

wesm left a comment

Choose a reason for hiding this comment

kszucs commented Jul 6, 2020

pitrou commented Jul 6, 2020

kszucs commented Jul 6, 2020 • edited Loading

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kszucs Jul 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Jul 6, 2020

kszucs commented Jun 22, 2020 •

edited

Loading

kszucs commented Jul 6, 2020 •

edited

Loading

kszucs Jul 6, 2020 •

edited

Loading