Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-554: [C++] Add functions to unify dictionary types and arrays #3165

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
5 participants
@pitrou
Copy link
Contributor

pitrou commented Dec 12, 2018

No description provided.

@pitrou pitrou added the WIP label Dec 12, 2018

@pitrou pitrou force-pushed the pitrou:ARROW-554-conform-dicts branch 2 times, most recently from 28e55ce to cf6f8e2 Dec 12, 2018

Show resolved Hide resolved cpp/src/arrow/builder-dict.cc Outdated

@pitrou pitrou force-pushed the pitrou:ARROW-554-conform-dicts branch 3 times, most recently from f961288 to a72a759 Dec 17, 2018

@pitrou pitrou changed the title [WIP] ARROW-554: [C++] Add functions to conform dictionaries ARROW-554: [C++] Add functions to conform dictionaries Dec 18, 2018

@pitrou pitrou removed the WIP label Dec 18, 2018

@pitrou

This comment has been minimized.

Copy link
Contributor Author

pitrou commented Dec 18, 2018

This PR is ready for review now.

@pitrou pitrou force-pushed the pitrou:ARROW-554-conform-dicts branch from a72a759 to 7d2579b Dec 19, 2018

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Dec 19, 2018

Codecov Report

Merging #3165 into master will increase coverage by 1.19%.
The diff coverage is 89.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3165      +/-   ##
==========================================
+ Coverage   86.44%   87.63%   +1.19%     
==========================================
  Files         508      446      -62     
  Lines       70077    66052    -4025     
==========================================
- Hits        60576    57885    -2691     
+ Misses       9398     8167    -1231     
+ Partials      103        0     -103
Impacted Files Coverage Δ
cpp/src/arrow/visitor_inline.h 96.89% <ø> (ø) ⬆️
cpp/src/arrow/util/hashing.h 99.25% <ø> (ø) ⬆️
cpp/src/arrow/type.h 87.01% <ø> (ø) ⬆️
python/pyarrow/tests/test_types.py 96.89% <100%> (ø) ⬆️
cpp/src/arrow/type-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/util/int-util-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/type.cc 92.5% <100%> (+0.08%) ⬆️
cpp/src/arrow/util/int-util.cc 99.6% <100%> (+0.02%) ⬆️
cpp/src/arrow/array.h 98.44% <50%> (-0.77%) ⬇️
cpp/src/arrow/array/builder_dict.cc 65.15% <74.62%> (+11.3%) ⬆️
... and 87 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 944b9e3...7d2579b. Read the comment docs.

@fsaintjacques
Copy link
Contributor

fsaintjacques left a comment

LGTM

@@ -740,4 +743,63 @@ TEST(TestDictionary, FromArray) {
ASSERT_RAISES(Invalid, DictionaryArray::FromArrays(dict_type, indices4, &arr4));
}

TEST(TestDictionary, TransposeBasic) {
std::shared_ptr<Array> arr, out, expected;

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 19, 2018

Contributor

Can you type them DictionaryArray at the declaration to avoid all casts?

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

Well, other APIs take Array pointers, not DictionaryArray.

std::vector<int32_t> transpose_map = {1111, 2222, 3333, 4444, 5555, 6666, 7777};
std::vector<int64_t> dest(src.size());

TransposeInts(src.data(), dest.data(), 6, transpose_map.data());

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 19, 2018

Contributor

src.size() instead of 6, just in case.

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

It may emit a cast warning, though :-/

template <typename InputInt, typename OutputInt>
void TransposeInts(const InputInt* src, OutputInt* dest, int64_t length,
const int32_t* transpose_map) {
while (length >= 4) {

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 19, 2018

Contributor

Clang does this unrolling naturally, gcc does not :( https://godbolt.org/z/ZHobe1

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

Neither does MSVC. The compiler generally doesn't have enough context to know whether unrolling is a good idea (I'm surprised than clang does it at -O2, btw). I'd rather unroll explicitly where we have that context.

while (length >= 4) {
dest[0] = static_cast<OutputInt>(transpose_map[src[0]]);
dest[1] = static_cast<OutputInt>(transpose_map[src[1]]);
dest[2] = static_cast<OutputInt>(transpose_map[src[2]]);

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 19, 2018

Contributor

Aren't you concerned by truncation due to incompatible size, maybe a static_assert somewhere?

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

What would we static_assert? Indeed it's up to the caller to ensure the input is ok (it is by construction in case of dictionary unification).

This comment has been minimized.

Copy link
@fsaintjacques

fsaintjacques Dec 19, 2018

Contributor

sizeof(in_type) <= sizeof(out_type)
is_signed(in_type) == is_signed(out_type)

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

Ah, but the converse is supported as well.

@bkietz

bkietz approved these changes Dec 19, 2018

/// one per input type. Each integer vector represents the transposition
/// of input type indices into unified type indices.
// XXX Should we return something special (an empty transpose map?) when
// the transposition is the identity function?

This comment has been minimized.

Copy link
@bkietz

bkietz Dec 19, 2018

Contributor

Is there an instance where the first transposition isn't the identity function? If not then maybe we could just return types.size()-1 transpositions

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

Not in this implementation, but the XXX is about the more general case (all transpositions could actually be the identity function).

@wesm

This comment has been minimized.

Copy link
Member

wesm commented Dec 19, 2018

Giving this a look before merging, bear with me

@wesm
Copy link
Member

wesm left a comment

Thank you for kicking off this work. This can be a nuanced topic -- there's a number of follow up items that I've noted in comments. I'm going to go ahead and merge this patch and then we can create follow up JIRAs for discussion and work. A number of these items of follow up work will be forced by the development of other things, so no specific urgency

in_data.null_count);
internal::TransposeInts(in_data.GetValues<in_c_type>(1),
out_data->GetMutableValues<out_c_type>(1), in_data.length,
transpose_map.data());

This comment has been minimized.

Copy link
@wesm

wesm Dec 19, 2018

Member

Note: this is a specific case of the "take" function, a la ndarray.take or "from" verb in J. Here we have

out = in TAKE transpose_map

https://issues.apache.org/jira/browse/ARROW-772

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

Yes, though this one has transparent up/downcasting as well (which we probably don't want in the general case of a "take" function).

// implement dictionary-to-dictionary casting.
auto in_type_id = dict_type_->index_type()->id();
auto out_type_id = out_dict_type.index_type()->id();

This comment has been minimized.

Copy link
@wesm

wesm Dec 19, 2018

Member

Do you think it's worth the (minor) expense of checking for no-op case where transpose_map is range(dictionary_size)? This covers the case where the unified dictionary type is an extended version of the current type, e.g. as the result of a delta dictionary

Perhaps let's create a follow up JIRA so as to not hold up this patch

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

There is already an XXX for that in type.h. I agree it could be beneficial in many cases (such as when you are simply augmenting an existing dictionary).

This comment has been minimized.

Status Transpose(MemoryPool* pool, const std::shared_ptr<DataType>& type,
const std::vector<int32_t>& transpose_map,
std::shared_ptr<Array>* out) const;
// XXX Do we also want an unsafe in-place Transpose?

This comment has been minimized.

Copy link
@wesm

wesm Dec 19, 2018

Member

I think we do want that but how we handle it may require some care. This also came up in the context of Arrow Flight where the dictionaries are unknown at the start of an RPC session

https://issues.apache.org/jira/browse/ARROW-3144

I would suggest that we address both cases through a single design, i.e. a MutableDictionaryType subtype or something similar.

One complexity that this introduces (which we need to be prepared for anyway) is the some code will need to check whether type->dictionary() is null or not (if it were null, it would mean that some data has been used inappropriately before the dictionary has been resolved), but I think that's OK.

This comment has been minimized.

Copy link
@wesm

wesm Dec 21, 2018

Member

I left a comment in ARROW-3144 directing to this discussion

dict_types.push_back(checked_cast<const DictionaryType*>(type));
}

// XXX Should we check the ordered flag?

This comment has been minimized.

Copy link
@wesm

wesm Dec 19, 2018

Member

This is a bit tricky. We should consider it in a follow up JIRA.

You could conceivably have a sequence of dictionaries like

a < b < c
a < b < c < d
a < b < c < d < e < f

This could result from a sequence of ordered delta dictionaries. You could argue in this case that the output type should also be ordered.

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

IMHO the "ordered" flag doesn't seem to make sense. A dictionary is simply a linear lookup table, so it's always ordered in a sense.

This comment has been minimized.

Copy link
@wesm

This comment has been minimized.

Copy link
@bkietz

bkietz Dec 20, 2018

Contributor

Minor optimization: if dictionaries have an ordering, that could be used to merge them without building a memo table.

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 20, 2018

Author Contributor

@bkietz I'm not sure I understand your suggestion. "ordered" here does not mean "sorted", just that the dictionary's order is semantically meaningful.

This comment has been minimized.

Copy link
@bkietz

bkietz Dec 20, 2018

Contributor
A   0     1 2 3 4 5       6

B     7 8 1   3 4   10 11 6

U   0 7 8 1 2 3 4 5 10 11 6

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 20, 2018

Author Contributor

Right, but how does the algorithm look like and why would it more efficient than a series of hash lookups?

This comment has been minimized.

Copy link
@bkietz

bkietz Dec 20, 2018

Contributor

Sorry, you're right; talking about optimization is not justified when there is not a version which preserves order using a hash table. I'll write an example of each and benchmark them

This comment has been minimized.

Copy link
@wesm

wesm Dec 20, 2018

Member

I would say it's not worth spending time right now on handling multiple "ordered" dictionaries (where the only case where "ordered" would be preserved is when all the dictionaries are prefixes of a common dictionary)

This comment has been minimized.

Copy link
@wesm

Status DictionaryType::Unify(MemoryPool* pool, const std::vector<const DataType*>& types,
std::shared_ptr<DataType>* out_type,
std::vector<std::vector<int32_t>>* out_transpose_maps) {

This comment has been minimized.

Copy link
@wesm

wesm Dec 19, 2018

Member

We should also consider the case where we want to conform a dictionary to an other, without unification. For example:

[a, b, c, a, b, c], dictionary [a, b, c]

conform to dictionary [a, c, d]

yields

[a, null, c, a, null, c]

This isn't unification so much as conformance. This was what was meant by my comment "Conform the dictionary indices to reference a common dictionary" in ARROW-554. The use case would be implementing casts between dictionary types -- this would include a safe and unsafe mode (safe would error if a value is made null)

Since this use case is not addressed in this patch (unless I missed it), let us create a follow up JIRA

This comment has been minimized.

Copy link
@pitrou

pitrou Dec 19, 2018

Author Contributor

Hmm, I see. I have misinterpreted the original JIRA then. In any case, I was thinking of the CSV use case, where we'll be unifying dictionaries at the end.

This comment has been minimized.

auto t2 = dictionary(int8(), ArrayFromJSON(int32(), "[3, 4, 5, 6]"));
auto t3 = dictionary(int16(), ArrayFromJSON(int32(), "[3, 4, 5, 6]"));
auto t4 = dictionary(int8(), ArrayFromJSON(int16(), "[3, 4, 5, 6]"));
auto t5 = dictionary(int8(), ArrayFromJSON(int32(), "[3, 4, 7, 6]"));

This comment has been minimized.

Copy link
@wesm

wesm Dec 19, 2018

Member

I'm quite glad we have this function now =) We might consider adding even more testing sugar so that we can write jsarray::int64("[3, 4, 5, 6]") or some such

This comment has been minimized.

Copy link
@bkietz

bkietz Dec 20, 2018

Contributor

"[3, 1, 4, 2]"_array(int64())

ty1 = pa.dictionary(pa.float32(), pa.array([1.0, 2.0]), ordered=True)
assert ty1.index_type == pa.float32()
ty1 = pa.dictionary(pa.int8(), pa.array([1.0, 2.0]), ordered=True)
assert ty1.index_type == pa.int8()

This comment has been minimized.

Copy link
@wesm

@wesm wesm changed the title ARROW-554: [C++] Add functions to conform dictionaries ARROW-554: [C++] Add functions to unify dictionary types and arrays Dec 19, 2018

@wesm wesm closed this in f66fa80 Dec 19, 2018

@pitrou pitrou deleted the pitrou:ARROW-554-conform-dicts branch Dec 19, 2018

@wesm

This comment has been minimized.

Copy link
Member

wesm commented Dec 21, 2018

OK I think I have all the follow up items accounted for in JIRA

cav71 added a commit to cav71/arrow that referenced this pull request Dec 22, 2018

ARROW-554: [C++] Add functions to unify dictionary types and arrays
Author: Antoine Pitrou <antoine@python.org>

Closes apache#3165 from pitrou/ARROW-554-conform-dicts and squashes the following commits:

7d2579b <Antoine Pitrou> ARROW-554:  Add functions to conform dictionaries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.