-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use sorted index based filtering only for dictionary encoded column #6288
Use sorted index based filtering only for dictionary encoded column #6288
Conversation
only for sorted column with dictionary
Codecov Report
@@ Coverage Diff @@
## master #6288 +/- ##
==========================================
+ Coverage 66.44% 74.01% +7.56%
==========================================
Files 1075 1252 +177
Lines 54773 61203 +6430
Branches 8168 8864 +696
==========================================
+ Hits 36396 45300 +8904
+ Misses 15700 12988 -2712
- Partials 2677 2915 +238
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Optional) IMO it is cleaner if we check the dictionary when initializing the data source metadata since we don't support raw sorted index. In ImmutableDataSource.java
, you can change the constructor of ImmutableDataSourceMetadata
to _sorted = columnMetadata.isSorted() && columnMetadata.hasDictionary()
. When we support raw sorted index in the future, we can change it back. Wdyt?
Predicate.Type predicateType = predicateEvaluator.getPredicateType(); | ||
if (predicateType == Predicate.Type.RANGE) { | ||
if (dataSource.getDataSourceMetadata().isSorted()) { | ||
if (dataSource.getDataSourceMetadata().isSorted() && (dataSource.getDictionary() != null)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit)
if (dataSource.getDataSourceMetadata().isSorted() && (dataSource.getDictionary() != null)) { | |
if (dataSource.getDataSourceMetadata().isSorted() && dataSource.getDictionary() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -59,7 +65,7 @@ public static BaseFilterOperator getLeafFilterOperator(PredicateEvaluator predic | |||
} else if (predicateType == Predicate.Type.REGEXP_LIKE) { | |||
return new ScanBasedFilterOperator(predicateEvaluator, dataSource, numDocs); | |||
} else { | |||
if (dataSource.getDataSourceMetadata().isSorted()) { | |||
if (dataSource.getDataSourceMetadata().isSorted() && (dataSource.getDictionary() != null)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit)
if (dataSource.getDataSourceMetadata().isSorted() && (dataSource.getDictionary() != null)) { | |
if (dataSource.getDataSourceMetadata().isSorted() && dataSource.getDictionary() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Currently we build sorted index only if the column is dictionary encoded. However, when we write
isSorted
in on-disk segment metadata, we write on the basis of pre-index stats collector. So, for a sorted column without dictionary, segment metadata will still indicate column as sortedproperties.setProperty(getKeyFor(column, IS_SORTED), String.valueOf(columnIndexCreationInfo.isSorted()));
During query processing, when we create filter operator, we check the data source metadata to see if the column is sorted and create sorted index based filter operator. However, using this operator for any sorted raw column will lead to the following error stack since we end up using a raw value based predicate evaluator for a dictionary based filter operator.
The solution is to do the additional check on data source to see if the column is dictionary encoded or not