Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13573: [C++] Support dictionaries natively in case_when #11022

Closed
wants to merge 21 commits into from

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Aug 27, 2021

This supports dictionaries 'natively', that is, dictionaries are no longer always unpacked. (If mixed dictionary and non-dictionary arguments are given, then they will be unpacked.)

For scalar conditions, the output will have the dictionary of whichever input is selected (or no dictionary if the output is null). For array conditions, we unify the dictionaries as we select elements.

@github-actions
Copy link

@lidavidm lidavidm marked this pull request as draft August 27, 2021 21:37
@lidavidm lidavidm force-pushed the arrow-13573 branch 4 times, most recently from 9746270 to 7d3aeae Compare August 30, 2021 20:57
@lidavidm lidavidm marked this pull request as ready for review August 31, 2021 14:48
@lidavidm lidavidm marked this pull request as draft August 31, 2021 16:27
@lidavidm lidavidm force-pushed the arrow-13573 branch 3 times, most recently from f6a7a85 to a93ce99 Compare August 31, 2021 19:09
@lidavidm lidavidm marked this pull request as ready for review August 31, 2021 20:20
@lidavidm
Copy link
Member Author

One thought: we could have all dictionary types use the variable-width type implementation, meaning we'd always unify dictionaries. This would behave a little more consistently.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. I haven't looked fully at the implementation yet.

@@ -282,6 +294,163 @@ class DictionaryBuilderBase : public ArrayBuilder {
return indices_builder_.AppendEmptyValues(length);
}

Status AppendScalar(const Scalar& scalar, int64_t n_repeats) override {
if (!scalar.type->Equals(type())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to do this check every append or should this be left to callers?

switch (dict_ty.index_type()->id()) {
case Type::UINT8: {
const auto& value = dict.GetView(
internal::checked_cast<const UInt8Scalar&>(*dict_scalar.value.index).value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if dict has a null at this index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, there should be better testing for nulls in general, I'll amend that.

array.buffers[0], array.offset + offset, std::min(array.length, length),
[&](int64_t position) { return Append(dict.GetView(values[position])); },
[&]() { return AppendNull(); });
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to factor this out to avoid repetition? For example:

template <IndexType>
struct SliceAppender {
  const IndexType* values;

  Status operator()(const ArrayData& array, int64_t offset, int64_t length) {
    return VisitBitBlocks(
      array.buffers[0], array.offset + offset, length,
      [&](int64_t position) {
        if (dict.IsNull(values[position])) return AppendNull();
        return Append(dict.GetView(values[position]));
      },
      [&]() { return AppendNull(); });  }
    );
  }
}

case Type::UINT8:
  return SliceAppender{array.GetValues<uint8_t>(1) + offset}(array, offset, length);
// ...

cpp/src/arrow/array/builder_dict.h Show resolved Hide resolved
case Type::UINT8: {
const uint8_t* values = array.GetValues<uint8_t>(1) + offset;
return VisitBitBlocks(
array.buffers[0], array.offset + offset, std::min(array.length, length),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other AppendArraySlice implementations don't check that length is in bounds, so std::min doesn't seem necessary here.

auto values1 = make_list(ArrayFromJSON(int32(), "[0, 2, 2, 3, 4]"), values1_backing);
auto values2 = make_list(ArrayFromJSON(int32(), "[0, 1, 2, 2, 4]"), values2_backing);

CheckScalarNonRecursive(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this calling CheckScalarNonRecursive and not CheckScalar? Leave a comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scalar variant of the kernel will not produce the same dictionary indices so the values do not compare equal. I'll add a comment to that effect.

auto values1_null = DictArrayFromJSON(type, "[null, null, null, null]", dict1);
auto values2_null = DictArrayFromJSON(type, "[null, null, null, null]", dict2);
auto values1 = DictArrayFromJSON(type, "[0, null, 3, 1]", dict1);
auto values2 = DictArrayFromJSON(type, "[2, 1, null, 0]", dict2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, it looks like the nulls in the indices are placed at the same indices as the nulls in the respective dictionaries.

DictArrayFromJSON(type, "[0, 0, 2, 2]", dict1));

// If we can't map values from a dictionary, then raise an error
// Unmappable value is in the else clause
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: why don't we unify dictionaries instead? It would sound more useful to me. I don't see any reason for the first input to have a particular status, is there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had mostly tried to emulate the R/dplyr behavior as closely as possible: #10724 (comment)

But unification is honestly probably easier to implement for us, so I can switch to that instead.


// ...or optionally, emit null

// TODO: this is not implemented yet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this TODO is for. Emitting a null when some option is enabled?

@@ -1058,6 +1062,109 @@ void AddFSBinaryIfElseKernel(const std::shared_ptr<IfElseFunction>& scalar_funct
DCHECK_OK(scalar_function->AddKernel(std::move(kernel)));
}

// Given a reference dictionary, computes indices to map dictionary values from a
// comparison dictionary to the reference.
class DictionaryRemapper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we already have DictionaryUnifier for this? Or am I misunderstanding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for if we don't want unification, however, I think we might want to just unify dictionaries always.

IIRC, what DictionaryUnifier was missing was a way to compute a transposition map without adding new values to the internal memo table.

@lidavidm
Copy link
Member Author

lidavidm commented Sep 7, 2021

Changes:

  • We always unify dictionaries now.
  • Since this generates fresh dictionaries, to make testing easier, the dictionary variant of the kernel is compared against the non-dictionary variants.
  • Refactored the various changes in the dictionary builders; handle nulls in dictionaries by emitting null indices (note that this means that we won't generate dictionaries with nulls even if we get such dictionaries in the inputs)

@lidavidm
Copy link
Member Author

lidavidm commented Sep 9, 2021

It looks like the RTools 40 test failure/crash is real; I'm going to need to figure out how to replicate this properly. (So far I've had little success with a VM, unfortunately.)

@lidavidm
Copy link
Member Author

It looks like there's still 2 Windows failures to look into (a segfault in MinGW/32, which is hopefully more debuggable, and a failure to run one of the examples in RTools35), though the RTools40 crash is no more.

@pitrou
Copy link
Member

pitrou commented Sep 13, 2021

Hmm... did you make sense of the RTools 3.5 CI failure? I can't find the actual error in the logs :-/

@lidavidm
Copy link
Member Author

I'm setting up my Windows VM again since it expires every few months now :/ but it seems like the dataset example crashed when it was run, looking here: https://github.com/apache/arrow/pull/11022/checks?check_run_id=3589141006#step:12:395

@pitrou
Copy link
Member

pitrou commented Sep 13, 2021

Is that example expected to be impacted by this PR? Otherwise, perhaps we should just restart the build...

@lidavidm
Copy link
Member Author

I do not expect it to be impacted but it was also failing in the last couple builds. (That said, I think it wasn't failing before I turned off the unity build?)

@lidavidm
Copy link
Member Author

And I did finally get the R package built on Windows - write_dataset causes R to crash, so I'll need to dig…

@lidavidm
Copy link
Member Author

The CI is passing here, for the first time in quite a while.

else
export ARROW_S3=ON
export ARROW_WITH_RE2=ON
# Without this, some compute functionality segfaults
export CMAKE_UNITY_BUILD=OFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean it segfaults during compilation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It segfaults in the tests. I wasn't really able to debug this on Windows; it disappears once you build with debuginfo.

if (index_scalar.is_valid && dict.IsValid(index)) {
const auto& value = dict.GetView(index);
for (int64_t i = 0; i < n_repeats; i++) {
ARROW_RETURN_NOT_OK(Append(value));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but it sounds like offering a two-step API on DictionaryBuilder would allow for performance improvements:

/// Ensure `value` is in the dict, and return its index, but doesn't append it
Result<int64_t> Encode(c_type value);
/// Append the given dictionary index
Status AppendIndex(int64_t index);
Status AppendIndices(int64_t index, int64_t nrepeats);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed ARROW-14042.

}
EXPECT_OK_AND_ASSIGN(Datum expected, CallFunction(func_name, decoded_args));

if (actual.type()->id() == Type::DICTIONARY) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it would be nice if the caller actually said whether the output is supposed to be dictionary-encoded or not. Otherwise there could be silent regressions where the output type of a kernel changes from one version to another.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean add an options struct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckDictionary could accept an argument saying if the expected output is dictionary-encoded or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean now - will do.

for (auto index : {"null", "2", "1", "0"}) {
auto scalar = DictScalarFromJSON(type, index, dict);
auto expected_index = ScalarFromJSON(int32(), index);
AssertScalarsEqual(*DictionaryScalar::Make(expected_index, expected_dictionary),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also call scalar->ValidateFull()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you !

@pitrou pitrou closed this in 87e2ad5 Sep 21, 2021
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
This supports dictionaries 'natively', that is, dictionaries are no longer always unpacked. (If mixed dictionary and non-dictionary arguments are given, then they will be unpacked.)

For scalar conditions, the output will have the dictionary of whichever input is selected (or no dictionary if the output is null). For array conditions, we unify the dictionaries as we select elements.

Closes apache#11022 from lidavidm/arrow-13573

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants