ARROW-13573: [C++] Support dictionaries natively in case_when #11022

lidavidm · 2021-08-27T21:36:37Z

This supports dictionaries 'natively', that is, dictionaries are no longer always unpacked. (If mixed dictionary and non-dictionary arguments are given, then they will be unpacked.)

For scalar conditions, the output will have the dictionary of whichever input is selected (or no dictionary if the output is null). For array conditions, we unify the dictionaries as we select elements.

github-actions · 2021-08-27T21:36:58Z

https://issues.apache.org/jira/browse/ARROW-13573

lidavidm · 2021-08-31T20:20:37Z

One thought: we could have all dictionary types use the variable-width type implementation, meaning we'd always unify dictionaries. This would behave a little more consistently.

pitrou

Some comments. I haven't looked fully at the implementation yet.

pitrou · 2021-09-06T14:18:12Z

cpp/src/arrow/array/builder_dict.h

@@ -282,6 +294,163 @@ class DictionaryBuilderBase : public ArrayBuilder {
    return indices_builder_.AppendEmptyValues(length);
  }

+  Status AppendScalar(const Scalar& scalar, int64_t n_repeats) override {
+    if (!scalar.type->Equals(type())) {


Do we really want to do this check every append or should this be left to callers?

pitrou · 2021-09-06T14:20:43Z

cpp/src/arrow/array/builder_dict.h

+    switch (dict_ty.index_type()->id()) {
+      case Type::UINT8: {
+        const auto& value = dict.GetView(
+            internal::checked_cast<const UInt8Scalar&>(*dict_scalar.value.index).value);


What happens if dict has a null at this index?

Hmm, there should be better testing for nulls in general, I'll amend that.

pitrou · 2021-09-06T14:28:34Z

cpp/src/arrow/array/builder_dict.h

+            array.buffers[0], array.offset + offset, std::min(array.length, length),
+            [&](int64_t position) { return Append(dict.GetView(values[position])); },
+            [&]() { return AppendNull(); });
+      }


Is it possible to factor this out to avoid repetition? For example:

template <IndexType> struct SliceAppender { const IndexType* values; Status operator()(const ArrayData& array, int64_t offset, int64_t length) { return VisitBitBlocks( array.buffers[0], array.offset + offset, length, [&](int64_t position) { if (dict.IsNull(values[position])) return AppendNull(); return Append(dict.GetView(values[position])); }, [&]() { return AppendNull(); }); } ); } } case Type::UINT8: return SliceAppender{array.GetValues<uint8_t>(1) + offset}(array, offset, length); // ...

cpp/src/arrow/array/builder_dict.h

pitrou · 2021-09-06T14:29:17Z

cpp/src/arrow/array/builder_dict.h

+      case Type::UINT8: {
+        const uint8_t* values = array.GetValues<uint8_t>(1) + offset;
+        return VisitBitBlocks(
+            array.buffers[0], array.offset + offset, std::min(array.length, length),


Other AppendArraySlice implementations don't check that length is in bounds, so std::min doesn't seem necessary here.

pitrou · 2021-09-06T14:43:25Z

cpp/src/arrow/compute/kernels/scalar_if_else_test.cc

+  auto values1 = make_list(ArrayFromJSON(int32(), "[0, 2, 2, 3, 4]"), values1_backing);
+  auto values2 = make_list(ArrayFromJSON(int32(), "[0, 1, 2, 2, 4]"), values2_backing);
+
+  CheckScalarNonRecursive(


Why is this calling CheckScalarNonRecursive and not CheckScalar? Leave a comment?

The scalar variant of the kernel will not produce the same dictionary indices so the values do not compare equal. I'll add a comment to that effect.

pitrou · 2021-09-06T14:49:31Z

cpp/src/arrow/compute/kernels/scalar_if_else_test.cc

+  auto values1_null = DictArrayFromJSON(type, "[null, null, null, null]", dict1);
+  auto values2_null = DictArrayFromJSON(type, "[null, null, null, null]", dict2);
+  auto values1 = DictArrayFromJSON(type, "[0, null, 3, 1]", dict1);
+  auto values2 = DictArrayFromJSON(type, "[2, 1, null, 0]", dict2);


For some reason, it looks like the nulls in the indices are placed at the same indices as the nulls in the respective dictionaries.

pitrou · 2021-09-06T14:55:04Z

cpp/src/arrow/compute/kernels/scalar_if_else_test.cc

+      DictArrayFromJSON(type, "[0, 0, 2, 2]", dict1));
+
+  // If we can't map values from a dictionary, then raise an error
+  // Unmappable value is in the else clause


I'm curious: why don't we unify dictionaries instead? It would sound more useful to me. I don't see any reason for the first input to have a particular status, is there?

I had mostly tried to emulate the R/dplyr behavior as closely as possible: #10724 (comment)

But unification is honestly probably easier to implement for us, so I can switch to that instead.

pitrou · 2021-09-06T14:55:40Z

cpp/src/arrow/compute/kernels/scalar_if_else_test.cc

+
+  // ...or optionally, emit null
+
+  // TODO: this is not implemented yet


I'm not sure I understand what this TODO is for. Emitting a null when some option is enabled?

pitrou · 2021-09-06T14:58:56Z

cpp/src/arrow/compute/kernels/scalar_if_else.cc

@@ -1058,6 +1062,109 @@ void AddFSBinaryIfElseKernel(const std::shared_ptr<IfElseFunction>& scalar_funct
  DCHECK_OK(scalar_function->AddKernel(std::move(kernel)));
 }

+// Given a reference dictionary, computes indices to map dictionary values from a
+// comparison dictionary to the reference.
+class DictionaryRemapper {


Don't we already have DictionaryUnifier for this? Or am I misunderstanding?

This is for if we don't want unification, however, I think we might want to just unify dictionaries always.

IIRC, what DictionaryUnifier was missing was a way to compute a transposition map without adding new values to the internal memo table.

lidavidm · 2021-09-07T16:05:06Z

Changes:

We always unify dictionaries now.
Since this generates fresh dictionaries, to make testing easier, the dictionary variant of the kernel is compared against the non-dictionary variants.
Refactored the various changes in the dictionary builders; handle nulls in dictionaries by emitting null indices (note that this means that we won't generate dictionaries with nulls even if we get such dictionaries in the inputs)

lidavidm · 2021-09-09T13:37:13Z

It looks like the RTools 40 test failure/crash is real; I'm going to need to figure out how to replicate this properly. (So far I've had little success with a VM, unfortunately.)

lidavidm · 2021-09-13T14:03:43Z

It looks like there's still 2 Windows failures to look into (a segfault in MinGW/32, which is hopefully more debuggable, and a failure to run one of the examples in RTools35), though the RTools40 crash is no more.

pitrou · 2021-09-13T17:34:31Z

Hmm... did you make sense of the RTools 3.5 CI failure? I can't find the actual error in the logs :-/

lidavidm · 2021-09-13T17:35:44Z

I'm setting up my Windows VM again since it expires every few months now :/ but it seems like the dataset example crashed when it was run, looking here: https://github.com/apache/arrow/pull/11022/checks?check_run_id=3589141006#step:12:395

pitrou · 2021-09-13T17:56:54Z

Is that example expected to be impacted by this PR? Otherwise, perhaps we should just restart the build...

lidavidm · 2021-09-13T17:58:03Z

I do not expect it to be impacted but it was also failing in the last couple builds. (That said, I think it wasn't failing before I turned off the unity build?)

lidavidm · 2021-09-13T21:18:56Z

And I did finally get the R package built on Windows - write_dataset causes R to crash, so I'll need to dig…

lidavidm · 2021-09-16T18:24:15Z

The CI is passing here, for the first time in quite a while.

pitrou · 2021-09-20T13:14:29Z

ci/scripts/PKGBUILD

  else
    export ARROW_S3=ON
    export ARROW_WITH_RE2=ON
+    # Without this, some compute functionality segfaults
+    export CMAKE_UNITY_BUILD=OFF


You mean it segfaults during compilation?

It segfaults in the tests. I wasn't really able to debug this on Windows; it disappears once you build with debuginfo.

pitrou · 2021-09-20T13:22:55Z

cpp/src/arrow/array/builder_dict.h

+    if (index_scalar.is_valid && dict.IsValid(index)) {
+      const auto& value = dict.GetView(index);
+      for (int64_t i = 0; i < n_repeats; i++) {
+        ARROW_RETURN_NOT_OK(Append(value));


Not for this PR, but it sounds like offering a two-step API on DictionaryBuilder would allow for performance improvements:

/// Ensure `value` is in the dict, and return its index, but doesn't append it Result<int64_t> Encode(c_type value); /// Append the given dictionary index Status AppendIndex(int64_t index); Status AppendIndices(int64_t index, int64_t nrepeats);

I filed ARROW-14042.

pitrou · 2021-09-20T13:29:47Z

cpp/src/arrow/compute/kernels/test_util.cc

+  }
+  EXPECT_OK_AND_ASSIGN(Datum expected, CallFunction(func_name, decoded_args));
+
+  if (actual.type()->id() == Type::DICTIONARY) {


Hmm, it would be nice if the caller actually said whether the output is supposed to be dictionary-encoded or not. Otherwise there could be silent regressions where the output type of a kernel changes from one version to another.

Do you mean add an options struct?

CheckDictionary could accept an argument saying if the expected output is dictionary-encoded or not.

Ah I see what you mean now - will do.

pitrou · 2021-09-20T13:30:41Z

cpp/src/arrow/ipc/json_simple_test.cc

+  for (auto index : {"null", "2", "1", "0"}) {
+    auto scalar = DictScalarFromJSON(type, index, dict);
+    auto expected_index = ScalarFromJSON(int32(), index);
+    AssertScalarsEqual(*DictionaryScalar::Make(expected_index, expected_dictionary),


Also call scalar->ValidateFull()?

…ded results in tests

pitrou

+1, thank you !

This supports dictionaries 'natively', that is, dictionaries are no longer always unpacked. (If mixed dictionary and non-dictionary arguments are given, then they will be unpacked.) For scalar conditions, the output will have the dictionary of whichever input is selected (or no dictionary if the output is null). For array conditions, we unify the dictionaries as we select elements. Closes apache#11022 from lidavidm/arrow-13573 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added the Component: C++ label Aug 27, 2021

lidavidm marked this pull request as draft August 27, 2021 21:37

lidavidm force-pushed the arrow-13573 branch 4 times, most recently from 9746270 to 7d3aeae Compare August 30, 2021 20:57

lidavidm marked this pull request as ready for review August 31, 2021 14:48

lidavidm marked this pull request as draft August 31, 2021 16:27

lidavidm force-pushed the arrow-13573 branch 3 times, most recently from f6a7a85 to a93ce99 Compare August 31, 2021 19:09

lidavidm marked this pull request as ready for review August 31, 2021 20:20

pitrou reviewed Sep 6, 2021

View reviewed changes

lidavidm force-pushed the arrow-13573 branch from a93ce99 to 0759290 Compare September 7, 2021 16:02

lidavidm force-pushed the arrow-13573 branch from 2b7f761 to 891bfb6 Compare September 8, 2021 21:49

lidavidm force-pushed the arrow-13573 branch from 5484a35 to 4e8d34a Compare September 13, 2021 13:02

lidavidm force-pushed the arrow-13573 branch from d171c73 to 2bdee00 Compare September 16, 2021 13:24

lidavidm added 3 commits September 16, 2021 12:55

ARROW-13573: [C++] Add DictScalarFromJSON

06b428d

ARROW-13573: [C++] Check that dictionary array has dictionary

2ec1132

ARROW-13573: [C++] Handle simple dictionary cases

e3b7f93

lidavidm added 16 commits September 16, 2021 12:55

ARROW-13573: [C++] Transpose dictionaries in case_when

7a57c91

ARROW-13573: [C++] Handle nested dictionaries

17230ee

ARROW-13691: [C++] Rebase

16fe210

ARROW-13573: [C++] Always unify dictionaries

8e0e333

ARROW-13573: [C++] Handle nulls before unifying, refactor

60ffb02

ARROW-13573: [C++] Test dictionaries with nulls

a10888c

ARROW-13573: [C++] Address feedback

5cbe6d5

ARROW-13573: [C++] Add a direct test of dispatch

345388f

ARROW-13573: [C++] Fix mistakes

8abb93f

ARROW-13573: [C++] Fix undefined behavior

45563d1

ARROW-13573: [C++] See if turning off unity builds fixes R CI

5fc1a1f

ARROW-13573: [C++] Try bumping timeout

f2a0a9e

ARROW-13573: [C++] Should fix MinGW32

a5b6078

ARROW-13573: [C++] Make CMAKE_UNITY_BUILD depend on the rtools version

29a2f87

ARROW-13573: [C++] RTools40 build is very slow without unity build

d81773d

ARROW-13573: [C++] Add clarifying comments

15a64a2

lidavidm force-pushed the arrow-13573 branch from 2bdee00 to 15a64a2 Compare September 16, 2021 16:55

pitrou reviewed Sep 20, 2021

View reviewed changes

lidavidm added 2 commits September 20, 2021 11:15

ARROW-13573: [C++] Address feedback

26230a4

ARROW-13573: [C++] Explicitly indicate when we expect dictionary-enco…

ba39d83

…ded results in tests

pitrou approved these changes Sep 21, 2021

View reviewed changes

pitrou closed this in 87e2ad5 Sep 21, 2021

jeroen mentioned this pull request Oct 22, 2021

Apache arrow 6.0.0 r-windows/rtools-packages#230

Closed

This was referenced Sep 29, 2021

[C++] Support dictionaries directly in case_when kernel #29220

Closed

[C++] Improve performance on dictionaries for 'case_when' kernel #29639

Open


		// ...or optionally, emit null

		// TODO: this is not implemented yet

ARROW-13573: [C++] Support dictionaries natively in case_when #11022

ARROW-13573: [C++] Support dictionaries natively in case_when #11022

Conversation

lidavidm commented Aug 27, 2021 • edited Loading

github-actions bot commented Aug 27, 2021

lidavidm commented Aug 31, 2021

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm commented Sep 7, 2021

lidavidm commented Sep 9, 2021

lidavidm commented Sep 13, 2021

pitrou commented Sep 13, 2021

lidavidm commented Sep 13, 2021

pitrou commented Sep 13, 2021

lidavidm commented Sep 13, 2021

lidavidm commented Sep 13, 2021

lidavidm commented Sep 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

lidavidm commented Aug 27, 2021 •

edited

Loading