Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9017: [C++][Python] Refactor scalar bindings #7519

Closed
wants to merge 33 commits into from

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Jun 22, 2020

TODOs:

  • split PRs into two, one with simplified python to arrow conversions with benchmarks
  • implement union scalar on the C++ side
  • store the index value for dictionary scalar
  • more tests

Closes ARROW-9017, ARROW-9153

@github-actions
Copy link

@kszucs kszucs changed the title ARROW-9153: [C++][Python] Refactor scalar bindings [WIP] ARROW-9153: [C++][Python] Refactor scalar bindings Jun 24, 2020
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice!

Could we add a is_valid attribute to the python scalar as well? Now the only way to check for a null value is to do .as_py() is None ?

Might not be related to the changes in this PR, but reviewing this triggered me to test scalar casting:

In [2]: s = pa.scalar(pd.Timestamp("2012-01-01")) 
   ...:  
   ...: import pyarrow.compute as pc 
   ...: pc.cast(s, pa.timestamp('ns'))  
../src/arrow/compute/kernels/scalar_cast_temporal.cc:130:  Check failed: (batch[0].kind()) == (Datum::ARRAY) 
...
Aborted (core dumped)

python/pyarrow/_dataset.pyx Show resolved Hide resolved
python/pyarrow/tests/test_parquet.py Outdated Show resolved Hide resolved
python/pyarrow/util.py Show resolved Hide resolved
python/pyarrow/scalar.pxi Show resolved Hide resolved
python/pyarrow/scalar.pxi Outdated Show resolved Hide resolved
python/pyarrow/scalar.pxi Outdated Show resolved Hide resolved

def test_null_equality():
assert (pa.NA == pa.NA) is pa.NA
assert (pa.NA == 1) is pa.NA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know to what extent we want to fully work out the scalars (so can certainly be a follow-up), but so the typed null scalars (not pa.NA), should probably behave the same as pa.NA, eg when it comes to equality (pa.NA == 1 gives pa.NA, but pa.scalar(None, pa.int64()) == 1 gives False)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, I assume it should follow the same semantics as Null does.

@jorisvandenbossche jorisvandenbossche changed the title ARROW-9153: [C++][Python] Refactor scalar bindings ARROW-9017: [C++][Python] Refactor scalar bindings Jun 25, 2020
@github-actions
Copy link

@kszucs
Copy link
Member Author

kszucs commented Jun 25, 2020

Really nice!

Could we add a is_valid attribute to the python scalar as well? Now the only way to check for a null value is to do .as_py() is None ?

Definitely!

Might not be related to the changes in this PR, but reviewing this triggered me to test scalar casting:

In [2]: s = pa.scalar(pd.Timestamp("2012-01-01")) 
   ...:  
   ...: import pyarrow.compute as pc 
   ...: pc.cast(s, pa.timestamp('ns'))  
../src/arrow/compute/kernels/scalar_cast_temporal.cc:130:  Check failed: (batch[0].kind()) == (Datum::ARRAY) 
...
Aborted (core dumped)

I think the cast kernel doesn't support scalars yet, on the other hand the scalars have custom CastTo implementation which we may want to remove in favor of compute.Cast?

cc @bkietz

@kszucs
Copy link
Member Author

kszucs commented Jun 26, 2020

Could we add a is_valid attribute to the python scalar as well? Now the only way to check for a null value is to do .as_py() is None ?

Added.

@kszucs
Copy link
Member Author

kszucs commented Jun 26, 2020

I'm considering to apply cython.freelist on the scalar extension classes.

@wesm @pitrou @jorisvandenbossche do you have any positive experience with it?

I assume we should measure its impact before using it, so I can defer it to a follow-up.

@kszucs kszucs requested a review from pitrou June 30, 2020 11:50
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work. Two pain points:

  1. you should add tests for the C++ changes
  2. the null equality tests look like a nuisance for regular Python usage (__eq__ should return a boolean)

python/pyarrow/includes/libarrow.pxd Show resolved Hide resolved
python/pyarrow/includes/libarrow.pxd Show resolved Hide resolved
python/pyarrow/lib.pxd Outdated Show resolved Hide resolved
python/pyarrow/scalar.pxi Show resolved Hide resolved
python/pyarrow/scalar.pxi Outdated Show resolved Hide resolved
python/pyarrow/tests/test_scalars.py Show resolved Hide resolved
python/pyarrow/tests/test_scalars.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_scalars.py Show resolved Hide resolved
python/pyarrow/tests/test_scalars.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_scalars.py Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Jun 30, 2020

Ok, at a quick glance, it seems that null container tests work properly regardless:

>>> s = set()                                                                                                                                                                  
>>> s.add(pa.scalar(None))                                                                                                                                                     
>>> s                                                                                                                                                                          
{<pyarrow.NullScalar: None>}
>>> pa.scalar(None) in s                                                                                                                                                       
True
>>> s.add(pa.scalar(None, pa.int64()))                                                                                                                                         
>>> s.add(pa.scalar(12, pa.int64()))                                                                                                                                           
>>> s                                                                                                                                                                          
{<pyarrow.Int64Scalar: 12>,
 <pyarrow.NullScalar: None>,
 <pyarrow.Int64Scalar: None>}
>>> pa.scalar(None, pa.int64()) in s                                                                                                                                           
True
>>> pa.scalar(None, pa.int32()) in s                                                                                                                                           
False
>>> l = [pa.scalar(None)]                                                                                                                                                      
>>> pa.scalar(None) in l                                                                                                                                                       
True
>>> pa.scalar(None, pa.int64()) in l                                                                                                                                           
False

@jorisvandenbossche
Copy link
Member

the null equality tests look like a nuisance for regular Python usage (__eq__ should return a boolean)

The reason that those scalars return null on equality check and not True/False, it to ensure consistent behaviour between arrays and scalars (to ensure (arr1 == arr2)[0] and arr1[0] == arr2[0]gives the same answer).
Our element-wise equality propagates null, and so IMO also the scalar should do.

@pitrou
Copy link
Member

pitrou commented Jul 1, 2020

The problem is that it breaks Python semantics in potentially annoying places:

>>> import pyarrow as pa                                                                                                                                                 
>>> na = pa.scalar(None)                                                                                                                                                 
>>> na in [5]                                                                                                                                                            
True

@jorisvandenbossche
Copy link
Member

We could "fix" that one by raising in __bool__ (meaning: it will at least give an error instead of silently returning a wrong answer)

@pitrou
Copy link
Member

pitrou commented Jul 1, 2020

Yes, we could. That may have other annoying implications, though (such as __contains__ not working anymore). I've started a ML discussion.

python/pyarrow/scalar.pxi Outdated Show resolved Hide resolved
python/pyarrow/scalar.pxi Outdated Show resolved Hide resolved
python/pyarrow/scalar.pxi Outdated Show resolved Hide resolved
return str(self.as_py())

def equals(self, Scalar other):
return self.wrapped.get().Equals(other.unwrap().get()[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the C++ Equals for == might give some potentially "surprising" behaviours, eg pa.scalar(1, type=pa.int64()) == pa.scalar(1, type=pa.int32()) gives False (while pa.scalar(1, type=pa.int64()) == 1 or pa.scalar(1, type=pa.int64()) == np.int32(1) gives True).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, also for arrays this currently doesn't work. But for arrays it gives an error about no kernel matching the types (which is more informative than "silently" giving False)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the options .as_py() conversion from the equality check, so now a scalar can be equal to another scalar only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the options .as_py() conversion from the equality check, so now a scalar can be equal to another scalar only.

Sounds good!

-----
Localized timestamps will currently be returned as UTC (pandas's native
representation). Timezone-naive data will be implicitly interpreted as
UTC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is fully correct? (or at least a bit misleading/confusing)
Timezone-naive data simply have no timezone, I am not aware of places in arrow where we implicitly assume UTC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing. Copied from pa.array()'s docstring, so we mey need to update it there as well.

@kszucs
Copy link
Member Author

kszucs commented Jul 2, 2020

Addressed the review comments and changed the equality checks to be strict, only allowing to compare scalars with scalars of the same type. One exception could be to compare invalid scalar values to pa.NA, but even if we consider to choose that behavior, it should be implemented on the C++ side.

@pitrou @jorisvandenbossche it should be ready for another review.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this is fine to go. I think @pitrou also asked for some C++ tests?

python/pyarrow/tests/test_array.py Outdated Show resolved Hide resolved
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks okay to me, it does look like adding a small amount of C++ test coverage for the things that changed there would be good

@kszucs
Copy link
Member Author

kszucs commented Jul 6, 2020

@pitrou added the C++ tests.

@pitrou
Copy link
Member

pitrou commented Jul 6, 2020

Looks like the PR needs rebasing and fixing for the latest union changes.

@kszucs
Copy link
Member Author

kszucs commented Jul 6, 2020

@pitrou updated. I assume now we propagate the null values from the selected child array, please double check the union tests.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! I haven't reviewed the Python changes again, so the following comments are only about the C++ changes.

cpp/src/arrow/scalar.cc Outdated Show resolved Hide resolved
cpp/src/arrow/scalar_test.cc Show resolved Hide resolved
ASSERT_TRUE(third->Equals(scalar_beta));

ASSERT_OK_AND_ASSIGN(auto fourth, arr.GetScalar(3));
ASSERT_TRUE(fourth->Equals(MakeNullScalar(ty)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... interesting. Shouldn't it be MakeNullScalar(utf8())?

Copy link
Member Author

@kszucs kszucs Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GetScalar() should return a Scalar with the same type as the Array, so in case of unions with the union type rather than the type selected by the union type id (the union scalar's value is another scalar with the type id).

I don't have a strong opinion on this though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Yes, you're probably right. A pity we lose information about the value type, then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the value type in UnionScalar.value.type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, perhaps check that in the test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for UnionScalar we have two is_valid flags, une for the union scalar and one for the underlying value scalar. Currently MakeNullScalar() sets the validity flag for the union scalar and leaves the value scalar uninitialized.

What do you think, shall we initialize the value scalar as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It the value scalar is not a nullptr, then it would seem better, unless that's complicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can ensure to have an explicit type to the value scalar from array.GetScalar() but not when a null union scalar is directly constructed, so MakeNullScalar(union_type) won't know which type to choose for the underlying value scalar.

cpp/src/arrow/scalar_test.cc Show resolved Hide resolved
cpp/src/arrow/scalar_test.cc Show resolved Hide resolved
@wesm
Copy link
Member

wesm commented Jul 6, 2020

+1. Thanks @kszucs for this work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants