Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15479 : [C++] Cast fixed size list to compatible fixed size list type (other values type, other field name) #14181

Merged
merged 16 commits into from
Sep 28, 2022

Conversation

kshitij12345
Copy link
Contributor

@kshitij12345 kshitij12345 commented Sep 20, 2022

Add kernel for FSL to FSL casting.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@kshitij12345 kshitij12345 changed the title support casting fsl to fsl ARROW-15479 : [C++] Cast fixed size list to compatible fixed size list type (other values type, other field name) Sep 20, 2022
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@kshitij12345 kshitij12345 marked this pull request as ready for review September 20, 2022 21:23
@kshitij12345
Copy link
Contributor Author

cc: @AlenkaF @jorisvandenbossche

@kshitij12345
Copy link
Contributor Author

MacOS and Travis CI failures look irrelevant.

@AlenkaF
Copy link
Member

AlenkaF commented Sep 21, 2022

LGTM, thank you for working on the issue!
Would need more C++ eyes on it though.
@lidavidm would you mind having a look at this PR please?

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

We generally add C++ tests as well, those would go in https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_test.cc

# Different field name and different type.
cast_type = pa.list_(pa.field("element", value_type), 2)
fsl = pa.FixedSizeListArray.from_arrays(
pa.array([1, 2, 3, 4, 5, 6]), 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify the type of the source array explicitly here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to test with nulls as well, both at the list level and the child array level

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And one more thing, it would be good to test with non-default cast options in such a way that the child array cast fails if the options aren't passed properly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that I have addressed the points. (1st and 2nd in Python test and 3rd in C++ test)

Sorry, I am new to the project so still learning the jargon (and might have missed to address the point) :)

@kshitij12345
Copy link
Contributor Author

kshitij12345 commented Sep 21, 2022

Thanks @lidavidm!

Will try to address these by weekend (probably).

@kshitij12345
Copy link
Contributor Author

@lidavidm have addressed mosts of the review. PTAL. Thank you :) (I don't think CI failures are relevant.)

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a couple final comments

}

TEST(Cast, FSLToFSL) {
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5]]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5]]");
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5], null]");

Copy link
Contributor Author

@kshitij12345 kshitij12345 Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last null would lead to an error as we test with fixed_size_list of 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this should work? This is a null list not a list of nulls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused. Can you point to some documentation? Thank you very much :)!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are different:

>>> [[1, 2], [None, None]]
>>> [[1, 2], None]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> pyarrow.array([[1, 2], None, [None, None]], pyarrow.list_(pyarrow.int64(), 2))
<pyarrow.lib.FixedSizeListArray object at 0x7ff44632bb20>
[
  [
    1,
    2
  ],
  null,
  [
    null,
    null
  ]
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that. It actually seems to have caught a bug. (Not sure why this is happens)

arr = pa.array([[1, 2], [3, 4], [None, 6], None], pa.list_(pa.int64(), 2))
print(arr[1:])
v = arr[1:].cast(my_type)
print(v)

Results in:


[
  [
    3,
    4
  ],
  [
    null,
    6
  ],
  null
]


[
  [
    3,
    4
  ],
  [
    null,
    6
  ],
  [
    null,
    null  # extra null
  ]
]

Surprisingly this doesn't happen when we have a 0 offset.

arr = pa.array([[1, 2], [3, 4], [None, 6], None], pa.list_(pa.int64(), 2))
print(arr)
v = arr.cast(my_type)
print(v)
[
  [
    1,
    2
  ],
  [
    3,
    4
  ],
  [
    null,
    6
  ],
  null
]

[
  [
    1,
    2
  ],
  [
    3,
    4
  ],
  [
    null,
    6
  ],
  null
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in that case, we should probably check what happens when the child array also has an offset…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments above. We should test for offsets here too.


TEST(Cast, FSLToFSL) {
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5]]");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also just CheckCastFails(...) to test what happens with a different type in C++? (A lot of development doesn't run the Python tests, so it's faster to catch things in C++ where possible.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Have added a something similar to CheckCastFails. (Couldn't use exact as we raise TypeError instead of Invalid).

}

TEST(Cast, FSLToFSL) {
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5]]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in that case, we should probably check what happens when the child array also has an offset…

const ArraySpan& in_array = batch[0].array;
std::shared_ptr<ArrayData> values = in_array.child_data[0].ToArrayData();
// Take care of data if input are is a view.
values = values->Slice(in_array.offset * in_size, in_array.length * in_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to take into account the child array's offset, too

}

TEST(Cast, FSLToFSL) {
CheckFSLToFSL({int32(), float32(), int64()}, "[[0, 1], [2, 3], [null, 5]]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments above. We should test for offsets here too.

ARROW_SCOPED_TRACE("src_type = ", src_type->ToString(),
", dest_type = ", dest_type->ToString());
auto src_array = ArrayFromJSON(src_type, json_data);
CheckCast(src_array, ArrayFromJSON(dest_type, json_data));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should also test casting slices of the input and comparing it to slices of the output here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also slice the child array, construct the parent array (with nulls, etc.) and test that as well. That should probably be a separate test/helper function. There's a helper function called TweakValidityBit that is useful for these sorts of things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CheckCast covers these interesting cases already (which ultimately calls the following function).

void CheckScalar(std::string func_name, const DatumVector& inputs, Datum expected_datum,
const FunctionOptions* options) {
CheckScalarNonRecursive(func_name, inputs, expected_datum, options);
if (expected_datum.is_scalar()) return;
ASSERT_TRUE(expected_datum.is_array())
<< "CheckScalar is only implemented for scalar/array expected values";
auto expected = expected_datum.make_array();
// check for at least 1 array, and make sure the others are of equal length
bool has_array = false;
for (const auto& input : inputs) {
if (input.is_array()) {
ASSERT_EQ(input.array()->length, expected->length());
has_array = true;
}
}
ASSERT_TRUE(has_array) << "Must have at least 1 array input to have an array output";
// Check all the input scalars
for (int64_t i = 0; i < expected->length(); ++i) {
CheckScalar(func_name, GetScalars(inputs, i), *expected->GetScalar(i), options);
}
// Since it's a scalar function, calling it on sliced inputs should
// result in the sliced expected output.
const auto slice_length = expected->length() / 3;
if (slice_length > 0) {
CheckScalarNonRecursive(func_name, SliceArrays(inputs, 0, slice_length),
expected->Slice(0, slice_length), options);
CheckScalarNonRecursive(func_name, SliceArrays(inputs, slice_length, slice_length),
expected->Slice(slice_length, slice_length), options);
CheckScalarNonRecursive(func_name, SliceArrays(inputs, 2 * slice_length),
expected->Slice(2 * slice_length), options);
}
// Should also work with an empty slice
CheckScalarNonRecursive(func_name, SliceArrays(inputs, 0, 0), expected->Slice(0, 0),
options);
// Ditto with ChunkedArray inputs
if (slice_length > 0) {
DatumVector chunked_inputs;
chunked_inputs.reserve(inputs.size());
for (const auto& input : inputs) {
if (input.is_array()) {
auto ar = input.make_array();
auto ar_chunked = std::make_shared<ChunkedArray>(
ArrayVector{ar->Slice(0, slice_length), ar->Slice(slice_length)});
chunked_inputs.push_back(ar_chunked);
} else {
chunked_inputs.push_back(input.scalar());
}
}
ArrayVector expected_chunks{expected->Slice(0, slice_length),
expected->Slice(slice_length)};
ASSERT_OK_AND_ASSIGN(Datum out,
CallFunction(func_name, GetDatums(chunked_inputs), options));
ValidateOutput(out);
auto chunked = out.chunked_array();
(void)chunked;
AssertDatumsEqual(std::make_shared<ChunkedArray>(expected_chunks), out);
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have also added test with TweakValidityBit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't cover a case like this:

auto children = ArrayFromJSON(int16(), "[1, 2, null, 4, 5, null]");
children = children->Slice(2);
// Omitting null bitmap here, but we should test that as well
auto fsl = std::make_shared<FixedSizeListArray>(fixed_size_list(int16(), 2), 2, children);
// fsl is the list [[null, 4], [5, null]]

e.g. in Python:

>>> pyarrow.FixedSizeListArray.from_arrays(pyarrow.array([1, 2, None, 4, 5, None]).slice(2), type=pyarrow.list_(pyarrow.int64(), 2))
<pyarrow.lib.FixedSizeListArray object at 0x7f44bc40f0a0>
[
  [
    null,
    4
  ],
  [
    5,
    null
  ]
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference, @lidavidm. Will update the tests soon :)

Copy link
Contributor Author

@kshitij12345 kshitij12345 Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated the test and hopefully covered the interesting case.

PTAL :)

@kshitij12345
Copy link
Contributor Author

Gentle ping @lidavidm! Thank you :)

@lidavidm lidavidm merged commit f4e4f99 into apache:master Sep 28, 2022
@ursabot
Copy link

ursabot commented Sep 29, 2022

Benchmark runs are scheduled for baseline = 398519a and contender = f4e4f99. f4e4f99 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.68% ⬆️0.14%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.43% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] f4e4f993 ec2-t3-xlarge-us-east-2
[Failed] f4e4f993 test-mac-arm
[Failed] f4e4f993 ursa-i9-9960x
[Finished] f4e4f993 ursa-thinkcentre-m75q
[Finished] 398519a0 ec2-t3-xlarge-us-east-2
[Failed] 398519a0 test-mac-arm
[Failed] 398519a0 ursa-i9-9960x
[Finished] 398519a0 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
…t type (other values type, other field name) (apache#14181)

Add kernel for FSL to FSL casting.

Authored-by: kshitij12345 <kshitijkalambarkar@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants