Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11673 - [C++] Casting dictionary type to use different index type #10721

Closed
wants to merge 14 commits into from

Conversation

nirandaperera
Copy link
Contributor

@nirandaperera nirandaperera commented Jul 14, 2021

This PR adds casting from one dictionary type to anther dictionary type for both scalars and arrays :
ex:

dictionary(int8(), int16()) --> dictionary(int32(), int64())

@github-actions
Copy link

int64_t len = 1000;
auto val_arr = rand.ArrayOf(int32(), len, /*null_probability=*/0.01);
ASSERT_OK_AND_ASSIGN(auto arr2, DictionaryEncode(val_arr));
// check unsafe indices. Cannot validate this array because ValidateOutput throws an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - now that I look at the JIRA again, it seems having unsafe casts isn't useful for this case? What do you think? If it produces invalid output (presumably, negative and out of bounds indices?), then it seems kind of pointless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about this. OTOH I was thinking whether the ValidataOutput should check the validity of dictionary array in DictionaryType. 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it does - are you saying it shouldn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think ValidateOutput checks the validity of ArrayData::dictionary field. That's why ValidateOutput passes for unsafe valies AFAIU. :thin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it does:

Status Visit(const DictionaryType& type) {
const Status indices_status =
CheckBounds(*type.index_type(), 0, data.dictionary->length - 1);
if (!indices_status.ok()) {
return Status::Invalid("Dictionary indices invalid: ", indices_status.ToString());
}
return ValidateArrayFull(*data.dictionary);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no way a primitive array to be validated once the casting is complete (unsafely in this instance).

Copy link
Member

@lidavidm lidavidm Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're casting from dictionary to dictionary here though.

Anyways, the point stands: this is an unsafe cast that generates an invalid array (and will mostly always do so), so is this case worth supporting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with you. It's not worth testing that. I will remove it. :-)

@nirandaperera
Copy link
Contributor Author

@lidavidm is this a known test failure? [ FAILED ] Expression.ReplaceFieldsWithKnownValues
https://github.com/apache/arrow/pull/10721/checks?check_run_id=3078144913

@lidavidm
Copy link
Member

Expected 'ReplaceFieldsWithKnownValues( KnownFieldValues{{{"dict_str", dict_i32}}}, expr)' to fail with NotImplemented, but got OK

Looks like this enables something that didn't work before?

  ASSERT_OK_AND_ASSIGN(auto expr, field_ref("dict_str").Bind(*kBoringSchema));
  Datum dict_i32{
      DictionaryScalar::Make(MakeScalar<int32_t>(0), ArrayFromJSON(int32(), R"([3])"))};
  // Unsupported cast dictionary(int32(), int32()) -> dictionary(int32(), utf8())
  ASSERT_RAISES(NotImplemented, ReplaceFieldsWithKnownValues(
                                    KnownFieldValues{{{"dict_str", dict_i32}}}, expr));
  // Unsupported cast dictionary(int8(), utf8()) -> dictionary(int32(), utf8())
  dict_str = Datum{
      DictionaryScalar::Make(MakeScalar<int8_t>(0), ArrayFromJSON(utf8(), R"(["a"])"))};
  ASSERT_RAISES(NotImplemented, ReplaceFieldsWithKnownValues(
                                    KnownFieldValues{{{"dict_str", dict_str}}}, expr));

@nirandaperera
Copy link
Contributor Author

Expected 'ReplaceFieldsWithKnownValues( KnownFieldValues{{{"dict_str", dict_i32}}}, expr)' to fail with NotImplemented, but got OK

Looks like this enables something that didn't work before?

  ASSERT_OK_AND_ASSIGN(auto expr, field_ref("dict_str").Bind(*kBoringSchema));
  Datum dict_i32{
      DictionaryScalar::Make(MakeScalar<int32_t>(0), ArrayFromJSON(int32(), R"([3])"))};
  // Unsupported cast dictionary(int32(), int32()) -> dictionary(int32(), utf8())
  ASSERT_RAISES(NotImplemented, ReplaceFieldsWithKnownValues(
                                    KnownFieldValues{{{"dict_str", dict_i32}}}, expr));
  // Unsupported cast dictionary(int8(), utf8()) -> dictionary(int32(), utf8())
  dict_str = Datum{
      DictionaryScalar::Make(MakeScalar<int8_t>(0), ArrayFromJSON(utf8(), R"(["a"])"))};
  ASSERT_RAISES(NotImplemented, ReplaceFieldsWithKnownValues(
                                    KnownFieldValues{{{"dict_str", dict_str}}}, expr));

Ah! indeed! thanks...

@lidavidm
Copy link
Member

I think this is good now, thanks! (Just one nit about an unused variable)

@lidavidm
Copy link
Member

lidavidm commented Jul 15, 2021

I take that back, it's used after all. I had local changes, there's an unused variable 🙂

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this.

Comment on lines 1926 to 1934
check_cast(dictionary(int8(), int16()), dictionary(int8(), int16()),
"[1, 2, 3, 1, null, 3]");
check_cast(dictionary(int8(), int16()), dictionary(int32(), int64()),
"[1, 2, 3, 1, null, 3]");
check_cast(dictionary(int32(), utf8()), dictionary(int8(), utf8()),
R"(["a", "b", "a", null])");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests casting from int to/from float and signed to/from unsigned integers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran int to/from float and unsigned (with Safe and Unsafe options) and it works correctly. LGTM.

@@ -1930,29 +1930,23 @@ TEST(Cast, DictTypeToAnotherDict) {
check_cast(dictionary(int32(), utf8()), dictionary(int8(), utf8()),
R"(["a", "b", "a", null])");

// check float types (NOTE: ArrayFromJSON doesnt work for float value dictionary types)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the error here? Would be nice to file a JIRA for it

Copy link
Contributor Author

@nirandaperera nirandaperera Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! ARROW-13381

Copy link
Contributor

@edponce edponce Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is from ArrayFromJSON(dictionary(..., floatXX()), ...) is:

NotImplemented: JSON conversion to dictionary<values=float, indices=int8, ordered=0> not implemented

Nevertheless, you can Cast successfully to float values with

auto arr = ArrayFromJSON(dictionary(int8(), int32()), "[1, 2, 3, 1, null, 3]");
ASSERT_OK_AND_ASSIGN(auto casted, Cast(arr, dictionary(int8(), float32()), CastOptions::Safe()));

There is also a DictArrayFromJSON but this requires explicit index values:

auto arr = DictArrayFromJSON(dictionary(int8(), float32()), "[0, 1, 2, 3, 4, 5]", "[1, 2, 3, 1, null, 3]");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirandaperera I submitted this PR that enables using ArrayFromJSON for dictionaries with floating-point values. If PR passes and gets merged, you can change test accordingly and remove error comment.

@@ -1930,7 +1930,7 @@ TEST(Cast, DictTypeToAnotherDict) {
check_cast(dictionary(int32(), utf8()), dictionary(int8(), utf8()),
R"(["a", "b", "a", null])");

// check float types (NOTE: ArrayFromJSON doesnt work for float value dictionary types)
// check float types (TODO: ARROW-13381 ArrayFromJSON doesnt work for float value dictionary types)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// check float types (TODO: ARROW-13381 ArrayFromJSON doesnt work for float value dictionary types)
// check float types
// TODO(ARROW-13381): ArrayFromJSON doesnt work for float value dictionary types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be linted.

@nirandaperera
Copy link
Contributor Author

I made the suggested changes and I think this is ready now

Copy link
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Niranda! Merging on green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants