Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support default star-tree #5147

Merged
merged 1 commit into from
May 12, 2020

Conversation

Jackie-Jiang
Copy link
Contributor

Support generating default star-tree config with the following rules:

  • All dictionary-encoded single-value dimensions (including date-time columns) with cardinality smaller or equal to the threshold (10000) will be included in the split order, sorted by their cardinality in descending order
  • Time column (if exists and dictionary-encoded) will be appended to the split order as the last element
  • Use COUNT(*) and SUM for all numeric metrics as function column pairs
  • Use default value (10000) for max leaf records

@codecov-io
Copy link

codecov-io commented Mar 12, 2020

Codecov Report

Merging #5147 into master will increase coverage by 0.05%.
The diff coverage is 71.56%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #5147      +/-   ##
============================================
+ Coverage     65.90%   65.95%   +0.05%     
  Complexity       12       12              
============================================
  Files          1052     1055       +3     
  Lines         54170    54150      -20     
  Branches       8078     8063      -15     
============================================
+ Hits          35702    35717      +15     
+ Misses        15819    15783      -36     
- Partials       2649     2650       +1     
Impacted Files Coverage Δ Complexity Δ
...e/pinot/broker/api/resources/PinotBrokerDebug.java 76.66% <ø> (ø) 0.00 <0.00> (ø)
.../BrokerResourceOnlineOfflineStateModelFactory.java 55.81% <ø> (ø) 0.00 <0.00> (ø)
.../pinot/broker/broker/helix/HelixBrokerStarter.java 71.97% <ø> (ø) 0.00 <0.00> (ø)
...thandler/SingleConnectionBrokerRequestHandler.java 92.68% <ø> (ø) 0.00 <0.00> (ø)
...rg/apache/pinot/broker/routing/RoutingManager.java 79.61% <ø> (-1.54%) 0.00 <0.00> (ø)
...ting/instanceselector/InstanceSelectorFactory.java 71.42% <ø> (ø) 0.00 <0.00> (ø)
...er/routing/segmentpruner/SegmentPrunerFactory.java 83.33% <ø> (ø) 0.00 <0.00> (ø)
...outing/segmentselector/SegmentSelectorFactory.java 60.00% <ø> (ø) 0.00 <0.00> (ø)
...oker/routing/timeboundary/TimeBoundaryManager.java 87.50% <ø> (ø) 0.00 <0.00> (ø)
...mmon/assignment/InstanceAssignmentConfigUtils.java 67.50% <ø> (ø) 0.00 <0.00> (?)
... and 169 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 00fcb1d...f2c78e0. Read the comment docs.

List<StarTreeV2BuilderConfig> starTreeV2BuilderConfigs = new ArrayList<>(starTreeIndexConfigs.size());
for (StarTreeIndexConfig starTreeIndexConfig : starTreeIndexConfigs) {
starTreeV2BuilderConfigs.add(StarTreeV2BuilderConfig.fromIndexConfig(starTreeIndexConfig));
if (indexingConfig.isEnableDefaultStarTree()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both are defined, you should pay attention to the specific start tree definition requested. This will keep backward compat also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't have backward compatible issue because default star-tree is not enabled by default.
But this is a great question, and we should not ignore the explicitly configured star-tree. Modify the code to generate star-tree for both configs (if both are defined, generate both default and customized).

@mcvsubbu
Copy link
Contributor

Also, please be sure to add documentation

@Jackie-Jiang
Copy link
Contributor Author

Also, please be sure to add documentation

Documentation for default star-tree config is inside StarTreeV2BuilderConfig. Will add more documentation in gitbook later

}

public void setEnableDefaultStarTree(boolean enableDefaultStarTree) {
_enableDefaultStarTree = enableDefaultStarTree;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to add this config may be to add a "mode" field in StarTreeIndexConfig. If mode is set to "auto" then ignore other settings. Otherwise, follow the other serttings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the current way so that the default star-tree can be enabled by one boolean flag. Don't want to change the existing StarTreeIndexConfig because adding a mode seems more confusing to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackie-Jiang will be good to get user feedback on this. @elonazoulay - you have used this feature - your inputs will be valuable.

@@ -145,6 +146,14 @@ public void setOnHeapDictionaryColumns(List<String> onHeapDictionaryColumns) {
_onHeapDictionaryColumns = onHeapDictionaryColumns;
}

public boolean isEnableDefaultStarTree() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isDefaultStarTreeEnabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to match the variable name (key of the config), and I prefer the config to be "enableDefaultStarTree": true

}

public void setEnableDefaultStarTree(boolean enableDefaultStarTree) {
_enableDefaultStarTree = enableDefaultStarTree;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackie-Jiang will be good to get user feedback on this. @elonazoulay - you have used this feature - your inputs will be valuable.

switch (fieldSpec.getFieldType()) {
case DIMENSION:
case DATE_TIME:
ColumnMetadata columnMetadata = segmentMetadata.getColumnMetadataFor(column);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we checking the cardinality threshold for date_time but not for time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I assume time will be included in most queries and in the range filter or group by, so I decide to always include time column as the last dimension to split.
For DATE_TIME, I assume the query pattern should be similar to other dimensions, so use the same rule for them.
Updated the comments for this.

Another way is to just treat all of them the same, but IMO always putting time column last should suit a wider range of use cases.

break;
case TIME:
columnMetadata = segmentMetadata.getColumnMetadataFor(column);
if (columnMetadata.hasDictionary()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot generate star tree on a column that's not dictionary encoded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the dimension must be dictionary encoded because we only store dictionary id in star-tree.

}
break;
case METRIC:
if (fieldSpec.getDataType().isNumeric()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about hyperloglog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For BYTES column, there is no way to tell what type of data it represents from the metadata. Similar to all other aggregation types (such as MIN, MAX, etc.), DICTINCTCOUNTHLL needs to be configured through the customized star-tree config.

@Jackie-Jiang
Copy link
Contributor Author

@mcvsubbu @kishoreg Addressed the comments, please take another look

@mcvsubbu
Copy link
Contributor

@Jackie-Jiang did you get feedback from @elonazoulay as suggested by Kishore on the table config part? Or, are we going ahead without the feedback?

@Jackie-Jiang
Copy link
Contributor Author

@Jackie-Jiang did you get feedback from @elonazoulay as suggested by Kishore on the table config part? Or, are we going ahead without the feedback?

No I haven't. @elonazoulay Does the default config (described in the pr message) look good to you?

@kishoreg
Copy link
Member

kishoreg commented May 9, 2020

please rebase and push

Support generating default star-tree config with the following rules:
- All dictionary-encoded single-value dimensions (including date-time columns) with cardinality smaller or equal to the threshold (10000) will be included in the split order, sorted by their cardinality in descending order
- Time column (if exists and dictionary-encoded) will be appended to the split order as the last element
- Use COUNT(*) and SUM for all numeric metrics as function column pairs
- Use default value (10000) for max leaf records
@Jackie-Jiang Jackie-Jiang merged commit 25bc1b5 into apache:master May 12, 2020
@Jackie-Jiang Jackie-Jiang deleted the star_tree_default_config branch May 12, 2020 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants