-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11673 - [C++] Casting dictionary type to use different index type #10721
Conversation
int64_t len = 1000; | ||
auto val_arr = rand.ArrayOf(int32(), len, /*null_probability=*/0.01); | ||
ASSERT_OK_AND_ASSIGN(auto arr2, DictionaryEncode(val_arr)); | ||
// check unsafe indices. Cannot validate this array because ValidateOutput throws an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - now that I look at the JIRA again, it seems having unsafe casts isn't useful for this case? What do you think? If it produces invalid output (presumably, negative and out of bounds indices?), then it seems kind of pointless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking about this. OTOH I was thinking whether the ValidataOutput should check the validity of dictionary array in DictionaryType. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it does - are you saying it shouldn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think ValidateOutput checks the validity of ArrayData::dictionary
field. That's why ValidateOutput passes for unsafe valies AFAIU. :thin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it does:
arrow/cpp/src/arrow/array/validate.cc
Lines 565 to 572 in dbeed52
Status Visit(const DictionaryType& type) { | |
const Status indices_status = | |
CheckBounds(*type.index_type(), 0, data.dictionary->length - 1); | |
if (!indices_status.ok()) { | |
return Status::Invalid("Dictionary indices invalid: ", indices_status.ToString()); | |
} | |
return ValidateArrayFull(*data.dictionary); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's no way a primitive array to be validated once the casting is complete (unsafely in this instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're casting from dictionary to dictionary here though.
Anyways, the point stands: this is an unsafe cast that generates an invalid array (and will mostly always do so), so is this case worth supporting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with you. It's not worth testing that. I will remove it. :-)
@lidavidm is this a known test failure? |
Looks like this enables something that didn't work before?
|
Ah! indeed! thanks... |
I think this is good now, thanks! (Just one nit about an unused variable) |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this.
check_cast(dictionary(int8(), int16()), dictionary(int8(), int16()), | ||
"[1, 2, 3, 1, null, 3]"); | ||
check_cast(dictionary(int8(), int16()), dictionary(int32(), int64()), | ||
"[1, 2, 3, 1, null, 3]"); | ||
check_cast(dictionary(int32(), utf8()), dictionary(int8(), utf8()), | ||
R"(["a", "b", "a", null])"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests casting from int to/from float and signed to/from unsigned integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran int to/from float and unsigned (with Safe and Unsafe options) and it works correctly. LGTM.
@@ -1930,29 +1930,23 @@ TEST(Cast, DictTypeToAnotherDict) { | |||
check_cast(dictionary(int32(), utf8()), dictionary(int8(), utf8()), | |||
R"(["a", "b", "a", null])"); | |||
|
|||
// check float types (NOTE: ArrayFromJSON doesnt work for float value dictionary types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the error here? Would be nice to file a JIRA for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! ARROW-13381
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is from ArrayFromJSON(dictionary(..., floatXX()), ...)
is:
NotImplemented: JSON conversion to dictionary<values=float, indices=int8, ordered=0> not implemented
Nevertheless, you can Cast
successfully to float
values with
auto arr = ArrayFromJSON(dictionary(int8(), int32()), "[1, 2, 3, 1, null, 3]");
ASSERT_OK_AND_ASSIGN(auto casted, Cast(arr, dictionary(int8(), float32()), CastOptions::Safe()));
There is also a DictArrayFromJSON
but this requires explicit index values:
auto arr = DictArrayFromJSON(dictionary(int8(), float32()), "[0, 1, 2, 3, 4, 5]", "[1, 2, 3, 1, null, 3]");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nirandaperera I submitted this PR that enables using ArrayFromJSON
for dictionaries with floating-point values. If PR passes and gets merged, you can change test accordingly and remove error comment.
@@ -1930,7 +1930,7 @@ TEST(Cast, DictTypeToAnotherDict) { | |||
check_cast(dictionary(int32(), utf8()), dictionary(int8(), utf8()), | |||
R"(["a", "b", "a", null])"); | |||
|
|||
// check float types (NOTE: ArrayFromJSON doesnt work for float value dictionary types) | |||
// check float types (TODO: ARROW-13381 ArrayFromJSON doesnt work for float value dictionary types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// check float types (TODO: ARROW-13381 ArrayFromJSON doesnt work for float value dictionary types) | |
// check float types | |
// TODO(ARROW-13381): ArrayFromJSON doesnt work for float value dictionary types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be linted.
I made the suggested changes and I think this is ready now |
bcc0a02
to
26ffdb3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Niranda! Merging on green.
This PR adds casting from one dictionary type to anther dictionary type for both scalars and arrays :
ex: