Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BYTES type for group-by expression #5708

Merged
merged 1 commit into from Jul 21, 2020

Conversation

Jackie-Jiang
Copy link
Contributor

Description

Support BYTES type for group-by expression

Changes:

  • Add ValueToIdMap (on-the-fly dictionary) for BYTES type
  • Re-order the operation to save the per-value switch case in NoDictionaryMultiColumnGroupKeyGenerator. generateKeysForBlock()
  • Add type specific group key iterator in NoDictionaryMultiColumnGroupKeyGenerator for performance improvement
  • Enhance NoDictionaryGroupKeyGeneratorTest to test BYTES type

@@ -163,84 +164,201 @@ public int getCurrentGroupKeyUpperBound() {

@Override
public Iterator<GroupKey> getUniqueGroupKeys() {
return new GroupKeyIterator(_groupKeyMap);
switch (_dataType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may sound like an over-optimization but adding type specific GroupKeyGenerator has the potential to add overhead for runtime dispatch which Java doesn't handle well.

What is the advantage of having a type specific next method like the following for Double

@Override
    public GroupKey next() {
      Double2IntMap.Entry entry = _iterator.next();
      _groupKey._groupId = entry.getIntValue();
      _groupKey._stringKey = Double.toString(entry.getDoubleKey());
      return _groupKey;
    }

v/s the existing generic method

@Override
    public GroupKey next() {
      Map.Entry<Object, Integer> entry = _iterator.next();
      _groupKey._groupId = entry.getValue();
      _groupKey._stringKey = entry.getKey().toString();
      return _groupKey;
    }

Both are doing toString()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference is from the entry-set and iterator implementation of the fastutil maps. For an example, Int2IntMap.entrySet() is marked deprecated and the class suggest using the type-specific method instead. Also, we are able to use the fastIterator() provided by the FastEntrySet, and also the unboxed values to reduce the garbage, and the unnecessary boxing/unboxing.

@Jackie-Jiang Jackie-Jiang merged commit 0927e15 into apache:master Jul 21, 2020
@Jackie-Jiang Jackie-Jiang deleted the group_by_bytes branch July 21, 2020 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants