-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15479 : [C++] Cast fixed size list to compatible fixed size list type (other values type, other field name) #14181
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format?
or
See also: |
|
MacOS and Travis CI failures look irrelevant. |
LGTM, thank you for working on the issue! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
We generally add C++ tests as well, those would go in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_test.cc
python/pyarrow/tests/test_compute.py
Outdated
# Different field name and different type. | ||
cast_type = pa.list_(pa.field("element", value_type), 2) | ||
fsl = pa.FixedSizeListArray.from_arrays( | ||
pa.array([1, 2, 3, 4, 5, 6]), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specify the type of the source array explicitly here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to test with nulls as well, both at the list level and the child array level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And one more thing, it would be good to test with non-default cast options in such a way that the child array cast fails if the options aren't passed properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that I have addressed the points. (1st and 2nd in Python test and 3rd in C++ test)
Sorry, I am new to the project so still learning the jargon (and might have missed to address the point) :)
Thanks @lidavidm! Will try to address these by weekend (probably). |
@lidavidm have addressed mosts of the review. PTAL. Thank you :) (I don't think CI failures are relevant.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just a couple final comments
} | ||
|
||
TEST(Cast, FSLToFSL) { | ||
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5]]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5]]"); | |
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5], null]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last null
would lead to an error as we test with fixed_size_list
of 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this should work? This is a null list not a list of nulls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused. Can you point to some documentation? Thank you very much :)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are different:
>>> [[1, 2], [None, None]]
>>> [[1, 2], None]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> pyarrow.array([[1, 2], None, [None, None]], pyarrow.list_(pyarrow.int64(), 2))
<pyarrow.lib.FixedSizeListArray object at 0x7ff44632bb20>
[
[
1,
2
],
null,
[
null,
null
]
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that. It actually seems to have caught a bug. (Not sure why this is happens)
arr = pa.array([[1, 2], [3, 4], [None, 6], None], pa.list_(pa.int64(), 2))
print(arr[1:])
v = arr[1:].cast(my_type)
print(v)
Results in:
[
[
3,
4
],
[
null,
6
],
null
]
[
[
3,
4
],
[
null,
6
],
[
null,
null # extra null
]
]
Surprisingly this doesn't happen when we have a 0
offset.
arr = pa.array([[1, 2], [3, 4], [None, 6], None], pa.list_(pa.int64(), 2))
print(arr)
v = arr.cast(my_type)
print(v)
[
[
1,
2
],
[
3,
4
],
[
null,
6
],
null
]
[
[
1,
2
],
[
3,
4
],
[
null,
6
],
null
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, in that case, we should probably check what happens when the child array also has an offset…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments above. We should test for offsets here too.
|
||
TEST(Cast, FSLToFSL) { | ||
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5]]"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also just CheckCastFails(...)
to test what happens with a different type in C++? (A lot of development doesn't run the Python tests, so it's faster to catch things in C++ where possible.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Have added a something similar to CheckCastFails
. (Couldn't use exact as we raise TypeError instead of Invalid).
} | ||
|
||
TEST(Cast, FSLToFSL) { | ||
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5]]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, in that case, we should probably check what happens when the child array also has an offset…
const ArraySpan& in_array = batch[0].array; | ||
std::shared_ptr<ArrayData> values = in_array.child_data[0].ToArrayData(); | ||
// Take care of data if input are is a view. | ||
values = values->Slice(in_array.offset * in_size, in_array.length * in_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to take into account the child array's offset, too
} | ||
|
||
TEST(Cast, FSLToFSL) { | ||
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5]]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments above. We should test for offsets here too.
ARROW_SCOPED_TRACE("src_type = ", src_type->ToString(), | ||
", dest_type = ", dest_type->ToString()); | ||
auto src_array = ArrayFromJSON(src_type, json_data); | ||
CheckCast(src_array, ArrayFromJSON(dest_type, json_data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should also test casting slices of the input and comparing it to slices of the output here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also slice the child array, construct the parent array (with nulls, etc.) and test that as well. That should probably be a separate test/helper function. There's a helper function called TweakValidityBit that is useful for these sorts of things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CheckCast covers these interesting cases already (which ultimately calls the following function).
arrow/cpp/src/arrow/compute/kernels/test_util.cc
Lines 106 to 172 in 6cc37cf
void CheckScalar(std::string func_name, const DatumVector& inputs, Datum expected_datum, | |
const FunctionOptions* options) { | |
CheckScalarNonRecursive(func_name, inputs, expected_datum, options); | |
if (expected_datum.is_scalar()) return; | |
ASSERT_TRUE(expected_datum.is_array()) | |
<< "CheckScalar is only implemented for scalar/array expected values"; | |
auto expected = expected_datum.make_array(); | |
// check for at least 1 array, and make sure the others are of equal length | |
bool has_array = false; | |
for (const auto& input : inputs) { | |
if (input.is_array()) { | |
ASSERT_EQ(input.array()->length, expected->length()); | |
has_array = true; | |
} | |
} | |
ASSERT_TRUE(has_array) << "Must have at least 1 array input to have an array output"; | |
// Check all the input scalars | |
for (int64_t i = 0; i < expected->length(); ++i) { | |
CheckScalar(func_name, GetScalars(inputs, i), *expected->GetScalar(i), options); | |
} | |
// Since it's a scalar function, calling it on sliced inputs should | |
// result in the sliced expected output. | |
const auto slice_length = expected->length() / 3; | |
if (slice_length > 0) { | |
CheckScalarNonRecursive(func_name, SliceArrays(inputs, 0, slice_length), | |
expected->Slice(0, slice_length), options); | |
CheckScalarNonRecursive(func_name, SliceArrays(inputs, slice_length, slice_length), | |
expected->Slice(slice_length, slice_length), options); | |
CheckScalarNonRecursive(func_name, SliceArrays(inputs, 2 * slice_length), | |
expected->Slice(2 * slice_length), options); | |
} | |
// Should also work with an empty slice | |
CheckScalarNonRecursive(func_name, SliceArrays(inputs, 0, 0), expected->Slice(0, 0), | |
options); | |
// Ditto with ChunkedArray inputs | |
if (slice_length > 0) { | |
DatumVector chunked_inputs; | |
chunked_inputs.reserve(inputs.size()); | |
for (const auto& input : inputs) { | |
if (input.is_array()) { | |
auto ar = input.make_array(); | |
auto ar_chunked = std::make_shared<ChunkedArray>( | |
ArrayVector{ar->Slice(0, slice_length), ar->Slice(slice_length)}); | |
chunked_inputs.push_back(ar_chunked); | |
} else { | |
chunked_inputs.push_back(input.scalar()); | |
} | |
} | |
ArrayVector expected_chunks{expected->Slice(0, slice_length), | |
expected->Slice(slice_length)}; | |
ASSERT_OK_AND_ASSIGN(Datum out, | |
CallFunction(func_name, GetDatums(chunked_inputs), options)); | |
ValidateOutput(out); | |
auto chunked = out.chunked_array(); | |
(void)chunked; | |
AssertDatumsEqual(std::make_shared<ChunkedArray>(expected_chunks), out); | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have also added test with TweakValidityBit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't cover a case like this:
auto children = ArrayFromJSON(int16(), "[1, 2, null, 4, 5, null]");
children = children->Slice(2);
// Omitting null bitmap here, but we should test that as well
auto fsl = std::make_shared<FixedSizeListArray>(fixed_size_list(int16(), 2), 2, children);
// fsl is the list [[null, 4], [5, null]]
e.g. in Python:
>>> pyarrow.FixedSizeListArray.from_arrays(pyarrow.array([1, 2, None, 4, 5, None]).slice(2), type=pyarrow.list_(pyarrow.int64(), 2))
<pyarrow.lib.FixedSizeListArray object at 0x7f44bc40f0a0>
[
[
null,
4
],
[
5,
null
]
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reference, @lidavidm. Will update the tests soon :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have updated the test and hopefully covered the interesting case.
PTAL :)
Gentle ping @lidavidm! Thank you :) |
Benchmark runs are scheduled for baseline = 398519a and contender = f4e4f99. f4e4f99 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…t type (other values type, other field name) (apache#14181) Add kernel for FSL to FSL casting. Authored-by: kshitij12345 <kshitijkalambarkar@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Add kernel for FSL to FSL casting.