ARROW-427: [C++] Implement dictionary array type #268

Closed
wants to merge 7 commits into
from

Projects

None yet

2 participants

@wesm
Member
wesm commented Jan 4, 2017 edited

I thought some about this and thought that it made sense to store the reference to the dictionary values themselves in the data type object, similar to CategoricalDtype in pandas. This will be at least adequate for the Feather file format merge.

In the IPC metadata, there is no explicit dictionary type -- an array can be dictionary encoded or not. On JIRA we've discussed adding a dictionary type flag indicating whether or not the dictionary values/categories are ordered (also called "ordinal") or unordered (also called "nominal"). That hasn't been done yet.

cpp/src/arrow/type.h
+
+ private:
+ // Must be an integer type (not currently checked)
+ Type::type index_type_;
@wesm
wesm Jan 4, 2017 Member

Could instead store std::shared_ptr<DataType> here, which would make certain things easier.

@xhochy

Looking good!

cpp/src/arrow/array.h
+
+ // Alternate ctor; other attributes (like null count) are inherited from the
+ // passed indices array
+ DictionaryArray(
@xhochy
xhochy Jan 4, 2017 Member

What's the use-case for this ctor? Looks like it should resemble some kind of casting.

@wesm
wesm Jan 4, 2017 Member

It's possible that the dictionary indices get transported in a PrimitiveArray object -- I am thinking in visitor pattern scenarios -- so this spares manual unpacking the buffers / metadata from the indices.

cpp/src/arrow/type.h
+
+ private:
+ // Must be an integer type (not currently checked)
+ Type::type index_type_;
wesm added some commits Jan 4, 2017
@wesm wesm Add rudimentary DictionaryType and DictionaryArray implementation for…
… discussion

Change-Id: I9f7b89382479f5e0cd4f2cc7c21e17e4f26fd87b
b52b3a7
@wesm wesm Refactor, compose shared_ptr<DataType> in DictionaryType
Change-Id: I412d5a54c958fc5c4f2368aaee51f84cd6a2f609
17c70de
@wesm wesm Implement PrettyPrint for DictionaryArray
Change-Id: I9b5c25b915dd51cb8e755182bf67e1f8a8a32f68
b06eb86
@wesm
Member
wesm commented Jan 6, 2017

I'm going to write some more tests for the DictionaryArray class itself and leave the IPC, JSON, and pandas.Categorical implementation parts to follow on patches

@wesm wesm Add tests, implementation for DictionaryArray::Equals and RangeEquals
Change-Id: I01f802ce0ced9d98474c6b652d428f800910f923
9efe46b
@wesm wesm changed the title from ARROW-427: [C++] WIP: Implement dictionary array type to ARROW-427: [C++] Implement dictionary array type Jan 6, 2017
@wesm wesm Implement rudimentary DictionaryArray::Validate
Change-Id: Ie2d97a8072f61a44214236f749cfc50318395a71
9a4edb5
@xhochy

Any special reason to get rid of them?

Member
wesm replied Jan 6, 2017

We aren't using the codes anywhere; I figure we should wait until we are providing some kind of ABI stability contract before we set them to immutable values.

@wesm wesm Revert T::Equals(const T& other) to EqualsExact to appease clang
Change-Id: I9344ee59cf2df56fa3468112d084c0e114e8fb65
a6c2896
@wesm
Member
wesm commented Jan 6, 2017

This is ready to go once the build is green

@wesm wesm cpplint
Change-Id: I7592f8c6858711af1aaed2cc368eeee653cc72a8
5ce3701
@xhochy
xhochy approved these changes Jan 6, 2017 View changes

+1, LGTM.

@asfgit asfgit pushed a commit that closed this pull request Jan 6, 2017
@wesm wesm ARROW-427: [C++] Implement dictionary array type
I thought some about this and thought that it made sense to store the reference to the dictionary values themselves in the data type object, similar to `CategoricalDtype` in pandas. This will be at least adequate for the Feather file format merge.

In the IPC metadata, there is no explicit dictionary type -- an array can be dictionary encoded or not. On JIRA we've discussed adding a dictionary type flag indicating whether or not the dictionary values/categories are ordered (also called "ordinal") or unordered (also called "nominal"). That hasn't been done yet.

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #268 from wesm/ARROW-427 and squashes the following commits:

5ce3701 [Wes McKinney] cpplint
a6c2896 [Wes McKinney] Revert T::Equals(const T& other) to EqualsExact to appease clang
9a4edb5 [Wes McKinney] Implement rudimentary DictionaryArray::Validate
9efe46b [Wes McKinney] Add tests, implementation for DictionaryArray::Equals and RangeEquals
b06eb86 [Wes McKinney] Implement PrettyPrint for DictionaryArray
17c70de [Wes McKinney] Refactor, compose shared_ptr<DataType> in DictionaryType
b52b3a7 [Wes McKinney] Add rudimentary DictionaryType and DictionaryArray implementation for discussion
74685f3
@asfgit asfgit closed this in 74685f3 Jan 6, 2017
@wesm wesm deleted the wesm:ARROW-427 branch Jan 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment