Skip to content

[CALCITE-4102] Some improvements to aggregate related operations#2053

Merged
hsyuan merged 1 commit intoapache:masterfrom
liyafan82:fly_0702_opt
Jul 29, 2020
Merged

[CALCITE-4102] Some improvements to aggregate related operations#2053
hsyuan merged 1 commit intoapache:masterfrom
liyafan82:fly_0702_opt

Conversation

@liyafan82
Copy link
Copy Markdown
Contributor

  1. Improve the use of data structures
  2. Extract common methods
  3. Avoid expensive operations

@liyafan82 liyafan82 force-pushed the fly_0702_opt branch 2 times, most recently from 1c39e51 to 5b3e872 Compare July 2, 2020 06:07
@zinking
Copy link
Copy Markdown
Contributor

zinking commented Jul 2, 2020

looks all equivalent rewrites for the refactoring,
any explanation for the improvements

* Checks if all bit sets contain a particular bit.
*/
public static boolean allContain(Collection<ImmutableBitSet> bitSets,
int bit) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not format like this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thank you for your kind reminder.

@liyafan82
Copy link
Copy Markdown
Contributor Author

looks all equivalent rewrites for the refactoring,
any explanation for the improvements

Sure. Thanks a lot for your feedback. Below are some explanations:

  1. We extract a common method for ImmutableBitSet, which is repeated multiple times in the code base. We also find it useful in our code base. This makes the code simpler and easier to maintain.
  2. In Aggregate#getRollup, we replace the LinkedHashSet with an ArrayList, because we can make sure that the newly added element must be distinct (according to the definition of a rollup). This makes adding data faster (LinkedHashSet takes O(logn) time, whereas ArrayList takes O(1)).
  3. When checking if a group set is a roll up, we replace the except operation with cardinality minus, which is much faster. We can make this optimization, because when a set contains another, the cardinality of the difference is the difference of the cardiality.
  4. In Util#firstDuplicate method, we replace the HashMap with a HashSet, because the value in the HashMap is not really used. This makes the logic clearer.

@yanlin-Lynn
Copy link
Copy Markdown
Contributor

make sense.
Can you update the commit log to describe the content of the PR shortly,
and add your name in the end.

See the guide in https://calcite.apache.org/develop/#contributing

@liyafan82
Copy link
Copy Markdown
Contributor Author

make sense.
Can you update the commit log to describe the content of the PR shortly,
and add your name in the end.

See the guide in https://calcite.apache.org/develop/#contributing

@yanlin-Lynn Thanks a lot for your kind reminder.
I have updated the commit message. Please check it.

@julianhyde
Copy link
Copy Markdown
Contributor

FWIW, I intentionally used HashMap rather than HashSet. HashSet uses a HashMap under the covers. I figured I'd save an object allocation. Yours is a bit clearer, a bit less efficient.

ret is not a great name for a variable; also declare it as List<Integer> not ArrayList<Integer>.

@liyafan82
Copy link
Copy Markdown
Contributor Author

FWIW, I intentionally used HashMap rather than HashSet. HashSet uses a HashMap under the covers. I figured I'd save an object allocation. Yours is a bit clearer, a bit less efficient.

Thanks a lot for your feedback. It is clever to HashMap to avoid object allocation.
Fortunately, the value object for HashSet is declared static, so it is allocated once, and shared by all HashSet objects.

ret is not a great name for a variable; also declare it as List<Integer> not ArrayList<Integer>.

Good suggestion. I have renamed ret to rollUpBits.

// Each subsequent items must be a subset with one fewer bit than the
// previous item
set.addAll(g.except(bitSet).toList());
ImmutableBitSet diff = g.except(bitSet);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be duplicate values since Set is changed to List?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chunweilei Thanks a lot for your feedback.
According to the definition of Rollup (https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java#L508-L513), there can be no duplicate.

@hsyuan
Copy link
Copy Markdown
Member

hsyuan commented Jul 15, 2020

FWIW, I intentionally used HashMap rather than HashSet. HashSet uses a HashMap under the covers. I figured I'd save an object allocation. Yours is a bit clearer, a bit less efficient.

Thanks a lot for your feedback. It is clever to HashMap to avoid object allocation.
Fortunately, the value object for HashSet is declared static, so it is allocated once, and shared by all HashSet objects.

I think what @julianhyde means is that the HashSet holds a pointer to HashMap, with HashSet object header, it will use 16 bytes more than bare HashMap.

…a Fan)

1. Extract a common method for ImmutableBitSet
2. Refine the use of data structures
3. Replace expensive operations with cheap ones
@liyafan82
Copy link
Copy Markdown
Contributor Author

FWIW, I intentionally used HashMap rather than HashSet. HashSet uses a HashMap under the covers. I figured I'd save an object allocation. Yours is a bit clearer, a bit less efficient.

Thanks a lot for your feedback. It is clever to HashMap to avoid object allocation.
Fortunately, the value object for HashSet is declared static, so it is allocated once, and shared by all HashSet objects.

I think what @julianhyde means is that the HashSet holds a pointer to HashMap, with HashSet object header, it will use 16 bytes more than bare HashMap.

@hsyuan Thanks for your kind reminder. I have reverted the code to use HashMap, and made it clear in the comment that this is by design. BTW, why it is 16 more bytes?

@hsyuan
Copy link
Copy Markdown
Member

hsyuan commented Jul 16, 2020

I have reverted the code to use HashMap, and made it clear in the comment that this is by design. BTW, why it is 16 more bytes?

Suppose in Hotspot 64-bit VM, with COOPS enabled, the object header is 12 bytes, the pointer to HashMap uses 4 bytes. So it is 16 bytes.

@liyafan82
Copy link
Copy Markdown
Contributor Author

I have reverted the code to use HashMap, and made it clear in the comment that this is by design. BTW, why it is 16 more bytes?

Suppose in Hotspot 64-bit VM, with COOPS enabled, the object header is 12 bytes, the pointer to HashMap uses 4 bytes. So it is 16 bytes.

Awesome! Thanks for your clarification.

Copy link
Copy Markdown
Contributor

@yanlin-Lynn yanlin-Lynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@yanlin-Lynn yanlin-Lynn added the LGTM-will-merge-soon Overall PR looks OK. Only minor things left. label Jul 20, 2020
@hsyuan hsyuan merged commit ce2ae64 into apache:master Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

LGTM-will-merge-soon Overall PR looks OK. Only minor things left.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants