Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6870: [C#] Added Dictionary Array Types #5644

Closed
wants to merge 2 commits into from

Conversation

dparu-dev
Copy link

Added Dictionary Array types and builders that correspond to many of the existing Array Types that represent the arrays as dictionaries

@github-actions
Copy link

@emkornfield
Copy link
Contributor

@eerhardt do you think you will have time to review?

@eerhardt
Copy link
Contributor

Sorry, I missed this originally. I will review it today.

@eerhardt
Copy link
Contributor

@dparu-dev - thanks for your contribution! This is great work that will help out the .NET community using Arrow.

Forgive my ignorance, I haven't looked deeply into the "Dictionary" pieces in Arrow. But can you explain why we have each individual type of DictionaryArray?

  • Date32DictionaryArray
  • UInt16DictionaryArray
  • StringDictionaryArray
  • etc

I'm looking at the other languages' implementations: C++, Java, Python. And they only have a single DictonaryArray type. Not one for each underlying type. Why would .NET need to be different here?

@dparu-dev
Copy link
Author

The specific types are basically just to make accessing builders for them easier for the most part. However StringDictionaryArray requires a separate implementation to PrimativeDictionaryArray because strings are of a variable length and so we need to maintain offsets to where they start/end in the underlying buffer.

Or I should say at least I thought we did. If the other implementations just have the one type it is entirely possible I've messed up somewhere/not understood something.

@emkornfield
Copy link
Contributor

@eerhardt if it it helps helps dictionary Arrays are really just a composition of two Arrays. The dictionary array for a type, and an Integer array that indexes into the dictionaries. I'm not sure what the best way of modeling this in C# is, but this is one area where I think the Java codebase could use some improvement (need to refresh myself on the C++ code base).

From the IPC perspective each dictionary encoding column is tied to a dictionary with a specific "ID".

@eerhardt
Copy link
Contributor

if it it helps helps dictionary Arrays are really just a composition of two Arrays

I think that makes me an even firmer believer that we shouldn't have separate DictionaryArray classes for each type, but instead have a single DictionaryArray class that composes the underlying Array (which can be any concrete type).

That is what the C++ implementation does:

class ARROW_EXPORT DictionaryArray : public Array {
 public:
  using TypeClass = DictionaryType;

  explicit DictionaryArray(const std::shared_ptr<ArrayData>& data);

  DictionaryArray(const std::shared_ptr<DataType>& type,
                  const std::shared_ptr<Array>& indices,
                  const std::shared_ptr<Array>& dictionary);

To create a DictionaryArray - you pass in 3 things:

  • The type
  • The Indices array
  • The underlying "Data" array

@eerhardt
Copy link
Contributor

The specific types are basically just to make accessing builders for them easier for the most part.

For this, instead users would make a "builder" for the underlying arrays. Say they were making a "String" dictionary, they would use a StringArray.Builder. That would be used to build up the "values". Then they would use a IntXXXArray.Builder to build up the indices. (note that the indices appear to be able to be any signed integer type, according to the C++ code)

@dparu-dev
Copy link
Author

That is a fair view to take and I can alter things so we only have a single DictionaryArray type.

My rationale though was that by just gluing existing array builders together we require users to deal with ensuring they don't include the same value multiple times themselves and by taking this approach you can treat building a DictionaryArray in the same way you would treat building a more typical Dictionary.

@emkornfield
Copy link
Contributor

There are multiple different ways one might want to construct the actual dictionary batches (mostly involving the data-structure one uses for deduplication, Java has a couple of examples).

@wesm
Copy link
Member

wesm commented Nov 8, 2019

Thoughts about a way forward on this project?

@emkornfield
Copy link
Contributor

@eerhardt @dparu-dev ping?

@eerhardt
Copy link
Contributor

eerhardt commented Feb 3, 2020

Before this can be merged I think my above comments should be addressed. I don't think we should introduce all these new types when the other implementations are able to do it with a single DictionaryArray type.

@wesm
Copy link
Member

wesm commented Apr 1, 2020

I agree with @eerhardt's feedback -- it seems like having a simpler container would be better. I understand the desire to be able to pass either dictionary-encoded or non-dictionary-encoded versions of the same array into a function without that function knowing about whether it's dictionary-encoded or not, if that's important in C# then there may be some work to do around interfaces.

I'm going to close this PR for the sake of our overflowing PR queue. @dparu-dev when you've addressed the feedback please reopen this PR. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants