-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6870: [C#] Added Dictionary Array Types #5644
Conversation
@eerhardt do you think you will have time to review? |
Sorry, I missed this originally. I will review it today. |
@dparu-dev - thanks for your contribution! This is great work that will help out the .NET community using Arrow. Forgive my ignorance, I haven't looked deeply into the "Dictionary" pieces in Arrow. But can you explain why we have each individual type of DictionaryArray?
I'm looking at the other languages' implementations: C++, Java, Python. And they only have a single |
The specific types are basically just to make accessing builders for them easier for the most part. However StringDictionaryArray requires a separate implementation to PrimativeDictionaryArray because strings are of a variable length and so we need to maintain offsets to where they start/end in the underlying buffer. Or I should say at least I thought we did. If the other implementations just have the one type it is entirely possible I've messed up somewhere/not understood something. |
@eerhardt if it it helps helps dictionary Arrays are really just a composition of two Arrays. The dictionary array for a type, and an Integer array that indexes into the dictionaries. I'm not sure what the best way of modeling this in C# is, but this is one area where I think the Java codebase could use some improvement (need to refresh myself on the C++ code base). From the IPC perspective each dictionary encoding column is tied to a dictionary with a specific "ID". |
I think that makes me an even firmer believer that we shouldn't have separate DictionaryArray classes for each type, but instead have a single DictionaryArray class that composes the underlying Array (which can be any concrete type). That is what the C++ implementation does: class ARROW_EXPORT DictionaryArray : public Array {
public:
using TypeClass = DictionaryType;
explicit DictionaryArray(const std::shared_ptr<ArrayData>& data);
DictionaryArray(const std::shared_ptr<DataType>& type,
const std::shared_ptr<Array>& indices,
const std::shared_ptr<Array>& dictionary); To create a DictionaryArray - you pass in 3 things:
|
For this, instead users would make a "builder" for the underlying arrays. Say they were making a "String" dictionary, they would use a |
That is a fair view to take and I can alter things so we only have a single DictionaryArray type. My rationale though was that by just gluing existing array builders together we require users to deal with ensuring they don't include the same value multiple times themselves and by taking this approach you can treat building a DictionaryArray in the same way you would treat building a more typical Dictionary. |
There are multiple different ways one might want to construct the actual dictionary batches (mostly involving the data-structure one uses for deduplication, Java has a couple of examples). |
Thoughts about a way forward on this project? |
@eerhardt @dparu-dev ping? |
Before this can be merged I think my above comments should be addressed. I don't think we should introduce all these new types when the other implementations are able to do it with a single |
I agree with @eerhardt's feedback -- it seems like having a simpler container would be better. I understand the desire to be able to pass either dictionary-encoded or non-dictionary-encoded versions of the same array into a function without that function knowing about whether it's dictionary-encoded or not, if that's important in C# then there may be some work to do around interfaces. I'm going to close this PR for the sake of our overflowing PR queue. @dparu-dev when you've addressed the feedback please reopen this PR. Thanks! |
Added Dictionary Array types and builders that correspond to many of the existing Array Types that represent the arrays as dictionaries