Skip to content

Commit

Permalink
GH-34890: [C++][Python] Add a no-op kernel for dictionary_encode(dict…
Browse files Browse the repository at this point in the history
…ionary) (#38349)

Added a no-op kernel for convenience as discussed in the issue.
* Closes: #34890

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
  • Loading branch information
3 people committed Dec 11, 2023
1 parent dff3068 commit e502728
Show file tree
Hide file tree
Showing 4 changed files with 20 additions and 6 deletions.
13 changes: 8 additions & 5 deletions cpp/src/arrow/compute/kernels/vector_hash.cc
Original file line number Diff line number Diff line change
Expand Up @@ -718,8 +718,9 @@ const DictionaryEncodeOptions* GetDefaultDictionaryEncodeOptions() {

const FunctionDoc dictionary_encode_doc(
"Dictionary-encode array",
("Return a dictionary-encoded version of the input array."), {"array"},
"DictionaryEncodeOptions");
("Return a dictionary-encoded version of the input array.\n"
"This function does nothing if the input is already a dictionary array."),
{"array"}, "DictionaryEncodeOptions");

// ----------------------------------------------------------------------
// This function does not use any hashing utilities
Expand Down Expand Up @@ -803,9 +804,11 @@ void RegisterVectorHash(FunctionRegistry* registry) {
GetDefaultDictionaryEncodeOptions());
AddHashKernels<DictEncodeAction>(dict_encode.get(), base, DictEncodeOutput);

// Calling dictionary_encode on dictionary input not supported, but if it
// ends up being needed (or convenience), a kernel could be added to make it
// a no-op
auto no_op = [](KernelContext*, const ExecSpan& span, ExecResult* out) {
out->value = span[0].array.ToArrayData();
return Status::OK();
};
DCHECK_OK(dict_encode->AddKernel({Type::DICTIONARY}, OutputType(FirstType), no_op));

DCHECK_OK(registry->AddFunction(std::move(dict_encode)));
}
Expand Down
9 changes: 9 additions & 0 deletions cpp/src/arrow/compute/kernels/vector_hash_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -687,6 +687,15 @@ TEST_F(TestHashKernel, DictEncodeIntervalMonth) {
{0, 0, 1, 0, 2});
}

TEST_F(TestHashKernel, DictEncodeDictInput) {
// Dictionary encode a dictionary is a no-op
auto dict_ty = dictionary(int32(), utf8());
auto dict = ArrayFromJSON(utf8(), R"(["a", "b", "c"])");
auto indices = ArrayFromJSON(int32(), "[0, 1, 2, 0, 1, 2, 0, 1, 2]");
auto input = std::make_shared<DictionaryArray>(dict_ty, indices, dict);
CheckDictEncode(input, dict, indices);
}

TEST_F(TestHashKernel, DictionaryUniqueAndValueCounts) {
auto dict_json = "[10, 20, 30, 40]";
auto dict = ArrayFromJSON(int64(), dict_json);
Expand Down
3 changes: 2 additions & 1 deletion docs/source/cpp/compute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1675,7 +1675,8 @@ Associative transforms
| | | Temporal, Binary- and String-like | | |
+-------------------+-------+-----------------------------------+-------------+-------+

* \(1) Output is ``Dictionary(Int32, input type)``.
* \(1) Output is ``Dictionary(Int32, input type)``. It is a no-op if input is
already a Dictionary array.

* \(2) Duplicates are removed from the output while the original order is
maintained.
Expand Down
1 change: 1 addition & 0 deletions python/pyarrow/tests/test_compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -1781,6 +1781,7 @@ def test_dictionary_decode():

assert array == dictionary_array_decode
assert array == pc.dictionary_decode(array)
assert pc.dictionary_encode(dictionary_array) == dictionary_array


def test_cast():
Expand Down

0 comments on commit e502728

Please sign in to comment.