Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] ListArray.from_arrays does not check validity of input arrays #22527

Closed
asfimport opened this issue Aug 5, 2019 · 8 comments
Closed

Comments

@asfimport
Copy link

From #4979 (comment).

When creating a ListArray from offsets and values in python, there is no validation of the offsets that it starts with 0 and ends with the length of the array (but is that required? the docs seem to indicate that: https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type ("The first value in the offsets array is 0, and the last element is the length of the values array.").

The array you get "seems" ok (the repr), but on conversion to python or flattened arrays, things go wrong:

In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 

In [62]: a
Out[62]: 
<pyarrow.lib.ListArray object at 0x7fdd9c468678>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ]
]

In [63]: a.flatten()
Out[63]: 
<pyarrow.lib.Int64Array object at 0x7fdd9cbfe9e8>
[
  0,   # <--- includes the 0
  1,
  2,
  3,
  4
]

In [64]: a.to_pylist()
Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes more elements as garbage

Calling validate manually correctly raises:

In [65]: a.validate()
...
ArrowInvalid: Final offset invariant not equal to values length: 10!=5

In C++ the main constructors are not safe, and as the caller you need to ensure that the data is correct or call a safe (slower) constructor. But do we want to use the unsafe / fast constructors without validation in Python as default as well? Or should we do a call to validate here?

A quick search seems to indicate that pa.Array.from_buffers does validation, but other from_arrays method don't seem to explicitly do this.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-6132. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@wesm @xhochy What do you think? I think there is value in spending a few CPU cycles validating inputs in the Python APIs.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I agree, I think adding validation here is OK

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
+1, not getting segfaults or delayed errors on Python APIs is essential.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
DictionaryArray.from_arrays has a safe=True/False argument (with safe=True as default) to disable the validity checking.
Although it is not exactly the same under the hood (DictionaryArray does not use the ValidateArray method), for users it is similar functionality, so I could also add such a keyword to ListArray.from_arrays as well (didn't do that yet in the PR #5029).

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Well, DictionaryArray.from_arrays(safe=True) is really expensive: it will walk all indices and check they are within bounds. ValidateArray doesn't do such a thing.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
I was actually just planning to open an issue for that: should ValidateArray check the indices of a DictionaryArray?

Not knowingly in detail how ValidateArray is used internally and what its purpose is, I would expect that from the name of that function, but from your response it might not?

ValidateArray for a ListArray also does walk all offsets and check they are consistent.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
My expectation is that ValidateArray is O(1), not O(n) in array size.

Perhaps we need a separate ValidateArrayData that digs deeper...

Edit: oh, you're right about ValidateArray(ListArray)...

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5029
#5029

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants