New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-3548]Polygon expression processing using unknown expression and filtering performance improvement #3616
Conversation
Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/263/ |
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1966/ |
buildExpression(ranges); | ||
} | ||
|
||
private boolean rangeBinarySearch(List<Long[]> ranges, long searchForNumber) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use collections.binarysearch()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did the binary seach on ranges directly instead of expanding ranges to values in another list and doing collections.binarysearch() on the resultant one. Tried to avoid the unnecessary expansion of ranges. It can reduce the search complexity to order of ln(number of list of ranges).
} | ||
|
||
@Override | ||
public ExpressionResult evaluate(RowIntf value) { | ||
throw new UnsupportedOperationException("Operation not supported for Polygon expression"); | ||
if (rangeBinarySearch(ranges, (Long) value.getVal(0))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data is already sorted. So may be in future can take advantage of it , instead of checking each row exist in a range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah Agreed. Will think of it in future.
LGTM |
Why is this PR needed?
This PR improves the query processing performance of in_polygon UDF.
What changes were proposed in this PR?
At present, PolygonExpression processing leverages the existing InExpression. PolygonExpression internally creates a InExpression as a child to it. InExpression is constructed/build from the result of Quad tree algorithm. Algorithm returns the list of ranges(with each range having min and max Id for that range). And this list is a sorted one.
InExpression constitute of 2 childs. One child is a columnExpression(for geohash column) and the other is a ListExpression( with List of LiternalExpressions. One LiteralExpression for each Id returned from algo).
Problems associated with this approach:
Modifications with this PR:
Instead we can use UnknownExpression with RowLevelFilterResolverImpl and RowLevelFilterExecuterImpl processing. And override evaluate() method to do the binary searchon the list of ranges directly. This will significanly inprove the polygon filter query performance. And Polygon filter expression type is not required anymore at Carbon-Core module.
Does this PR introduce any user interface change?
Is any new testcase added?