Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-554: [C++] Add functions to unify dictionary types and arrays #3165

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+484 −85
Diff settings

Always

Just for now

@@ -31,13 +31,16 @@
#include "arrow/test-common.h"
#include "arrow/test-util.h"
#include "arrow/type.h"
#include "arrow/util/checked_cast.h"
#include "arrow/util/decimal.h"

namespace arrow {

using std::string;
using std::vector;

using internal::checked_cast;

// ----------------------------------------------------------------------
// Dictionary tests

@@ -740,4 +743,63 @@ TEST(TestDictionary, FromArray) {
ASSERT_RAISES(Invalid, DictionaryArray::FromArrays(dict_type, indices4, &arr4));
}

TEST(TestDictionary, TransposeBasic) {
std::shared_ptr<Array> arr, out, expected;

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 19, 2018

Contributor

Can you type them DictionaryArray at the declaration to avoid all casts?

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

Well, other APIs take Array pointers, not DictionaryArray.


auto dict = ArrayFromJSON(utf8(), "[\"A\", \"B\", \"C\"]");
auto dict_type = dictionary(int16(), dict);
auto indices = ArrayFromJSON(int16(), "[1, 2, 0, 0]");
// ["B", "C", "A", "A"]
ASSERT_OK(DictionaryArray::FromArrays(dict_type, indices, &arr));

// Transpose to same index type
{
auto out_dict = ArrayFromJSON(utf8(), "[\"Z\", \"A\", \"C\", \"B\"]");
auto out_dict_type = dictionary(int16(), out_dict);

const std::vector<int32_t> transpose_map{1, 3, 2};
ASSERT_OK(internal::checked_cast<const DictionaryArray&>(*arr).Transpose(
default_memory_pool(), out_dict_type, transpose_map, &out));

auto expected_indices = ArrayFromJSON(int16(), "[3, 2, 1, 1]");
ASSERT_OK(DictionaryArray::FromArrays(out_dict_type, expected_indices, &expected));
AssertArraysEqual(*out, *expected);
}

// Transpose to other type
{
auto out_dict = ArrayFromJSON(utf8(), "[\"Z\", \"A\", \"C\", \"B\"]");
auto out_dict_type = dictionary(int8(), out_dict);

const std::vector<int32_t> transpose_map{1, 3, 2};
ASSERT_OK(internal::checked_cast<const DictionaryArray&>(*arr).Transpose(
default_memory_pool(), out_dict_type, transpose_map, &out));

auto expected_indices = ArrayFromJSON(int8(), "[3, 2, 1, 1]");
ASSERT_OK(DictionaryArray::FromArrays(out_dict_type, expected_indices, &expected));
AssertArraysEqual(*expected, *out);
}
}

TEST(TestDictionary, TransposeNulls) {
std::shared_ptr<Array> arr, out, expected;

auto dict = ArrayFromJSON(utf8(), "[\"A\", \"B\", \"C\"]");
auto dict_type = dictionary(int16(), dict);
auto indices = ArrayFromJSON(int16(), "[1, 2, null, 0]");
// ["B", "C", null, "A"]
ASSERT_OK(DictionaryArray::FromArrays(dict_type, indices, &arr));

auto out_dict = ArrayFromJSON(utf8(), "[\"Z\", \"A\", \"C\", \"B\"]");
auto out_dict_type = dictionary(int16(), out_dict);

const std::vector<int32_t> transpose_map{1, 3, 2};
ASSERT_OK(internal::checked_cast<const DictionaryArray&>(*arr).Transpose(
default_memory_pool(), out_dict_type, transpose_map, &out));

auto expected_indices = ArrayFromJSON(int16(), "[3, 2, null, 1]");
ASSERT_OK(DictionaryArray::FromArrays(out_dict_type, expected_indices, &expected));
AssertArraysEqual(*expected, *out);
}

} // namespace arrow
@@ -33,6 +33,7 @@
#include "arrow/util/bit-util.h"
#include "arrow/util/checked_cast.h"
#include "arrow/util/decimal.h"
#include "arrow/util/int-util.h"
#include "arrow/util/logging.h"
#include "arrow/util/macros.h"
#include "arrow/visitor.h"
@@ -663,6 +664,66 @@ std::shared_ptr<Array> DictionaryArray::dictionary() const {
return dict_type_->dictionary();
}

template <typename InType, typename OutType>
static Status TransposeDictIndices(MemoryPool* pool, const ArrayData& in_data,
const std::shared_ptr<DataType>& type,
const std::vector<int32_t>& transpose_map,
std::shared_ptr<Array>* out) {
using in_c_type = typename InType::c_type;
using out_c_type = typename OutType::c_type;

std::shared_ptr<Buffer> out_buffer;
RETURN_NOT_OK(AllocateBuffer(pool, in_data.length * sizeof(out_c_type), &out_buffer));
// Null bitmap is unchanged
auto out_data = ArrayData::Make(type, in_data.length, {in_data.buffers[0], out_buffer},
in_data.null_count);
internal::TransposeInts(in_data.GetValues<in_c_type>(1),
out_data->GetMutableValues<out_c_type>(1), in_data.length,
transpose_map.data());

This comment has been minimized.

Copy link
@wesm

wesm Dec 19, 2018

Member

Note: this is a specific case of the "take" function, a la ndarray.take or "from" verb in J. Here we have

out = in TAKE transpose_map

https://issues.apache.org/jira/browse/ARROW-772

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

Yes, though this one has transparent up/downcasting as well (which we probably don't want in the general case of a "take" function).

*out = MakeArray(out_data);
return Status::OK();
}

Status DictionaryArray::Transpose(MemoryPool* pool, const std::shared_ptr<DataType>& type,
const std::vector<int32_t>& transpose_map,
std::shared_ptr<Array>* out) const {
DCHECK_EQ(type->id(), Type::DICTIONARY);
const auto& out_dict_type = checked_cast<const DictionaryType&>(*type);

// XXX We'll probably want to make this operation a kernel when we
// implement dictionary-to-dictionary casting.
auto in_type_id = dict_type_->index_type()->id();
auto out_type_id = out_dict_type.index_type()->id();

This comment has been minimized.

Copy link
@wesm

wesm Dec 19, 2018

Member

Do you think it's worth the (minor) expense of checking for no-op case where transpose_map is range(dictionary_size)? This covers the case where the unified dictionary type is an extended version of the current type, e.g. as the result of a delta dictionary

Perhaps let's create a follow up JIRA so as to not hold up this patch

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

There is already an XXX for that in type.h. I agree it could be beneficial in many cases (such as when you are simply augmenting an existing dictionary).

This comment has been minimized.

#define TRANSPOSE_IN_OUT_CASE(IN_INDEX_TYPE, OUT_INDEX_TYPE) \
case OUT_INDEX_TYPE::type_id: \
return TransposeDictIndices<IN_INDEX_TYPE, OUT_INDEX_TYPE>(pool, *data(), type, \
transpose_map, out);

#define TRANSPOSE_IN_CASE(IN_INDEX_TYPE) \
case IN_INDEX_TYPE::type_id: \
switch (out_type_id) { \
TRANSPOSE_IN_OUT_CASE(IN_INDEX_TYPE, Int8Type) \
TRANSPOSE_IN_OUT_CASE(IN_INDEX_TYPE, Int16Type) \
TRANSPOSE_IN_OUT_CASE(IN_INDEX_TYPE, Int32Type) \
TRANSPOSE_IN_OUT_CASE(IN_INDEX_TYPE, Int64Type) \
default: \
return Status::NotImplemented("unexpected index type"); \
}

switch (in_type_id) {
TRANSPOSE_IN_CASE(Int8Type)
TRANSPOSE_IN_CASE(Int16Type)
TRANSPOSE_IN_CASE(Int32Type)
TRANSPOSE_IN_CASE(Int64Type)
default:
return Status::NotImplemented("unexpected index type");
}

#undef TRANSPOSE_IN_OUT_CASE
#undef TRANSPOSE_IN_CASE
}

// ----------------------------------------------------------------------
// Implement Array::Accept as inline visitor

@@ -422,6 +422,9 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray {

value_type Value(int64_t i) const { return raw_values()[i]; }

// For API compatibility with BinaryArray etc.
value_type GetView(int64_t i) const { return Value(i); }

protected:
using PrimitiveArray::PrimitiveArray;
};
@@ -442,6 +445,8 @@ class ARROW_EXPORT BooleanArray : public PrimitiveArray {
i + data_->offset);
}

bool GetView(int64_t i) const { return Value(i); }

protected:
using PrimitiveArray::PrimitiveArray;
};
@@ -802,14 +807,31 @@ class ARROW_EXPORT DictionaryArray : public Array {
/// This function does the validation of the indices and input type. It checks if
/// all indices are non-negative and smaller than the size of the dictionary
///
/// \param[in] type a data type containing a dictionary
/// \param[in] type a dictionary type
/// \param[in] indices an array of non-negative signed
/// integers smaller than the size of the dictionary
/// \param[out] out the resulting DictionaryArray instance
static Status FromArrays(const std::shared_ptr<DataType>& type,
const std::shared_ptr<Array>& indices,
std::shared_ptr<Array>* out);

/// \brief Transpose this DictionaryArray
///
/// This method constructs a new dictionary array with the given dictionary type,
/// transposing indices using the transpose map.
/// The type and the transpose map are typically computed using
/// DictionaryType::Unify.
///
/// \param[in] pool a pool to allocate the array data from
/// \param[in] type a dictionary type
/// \param[in] transpose_map a vector transposing this array's indices
/// into the target array's indices
/// \param[out] out the resulting DictionaryArray instance
Status Transpose(MemoryPool* pool, const std::shared_ptr<DataType>& type,
const std::vector<int32_t>& transpose_map,
std::shared_ptr<Array>* out) const;
// XXX Do we also want an unsafe in-place Transpose?

This comment has been minimized.

Copy link
@wesm

wesm Dec 19, 2018

Member

I think we do want that but how we handle it may require some care. This also came up in the context of Arrow Flight where the dictionaries are unknown at the start of an RPC session

https://issues.apache.org/jira/browse/ARROW-3144

I would suggest that we address both cases through a single design, i.e. a MutableDictionaryType subtype or something similar.

One complexity that this introduces (which we need to be prepared for anyway) is the some code will need to check whether type->dictionary() is null or not (if it were null, it would mean that some data has been used inappropriately before the dictionary has been resolved), but I think that's OK.

This comment has been minimized.

Copy link
@wesm

wesm Dec 21, 2018

Member

I left a comment in ARROW-3144 directing to this discussion


std::shared_ptr<Array> indices() const;
std::shared_ptr<Array> dictionary() const;

Oops, something went wrong.
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.