-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-1328: Support table statistics #425
Conversation
@Override | ||
public void add() { | ||
if (work.obj != null) { | ||
com.clearspring.analytics.stream.cardinality.HyperLogLog hll = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this HLL package (com.clearspring.*) acceptable from a licensing perspective ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From [1], it is released under Apache 2.0 license [1] .
[1] https://github.com/addthis/stream-lib
[2] https://github.com/addthis/stream-lib/blob/master/LICENSE.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see...I suppose if @jacques-n has checked the licensing part I am good with it.
@vkorukanti have you had a chance to test this with metadata caching ? Since the row count stats is stored in both the metadata cache and the .drill.stats file, I am not sure where it is getting retrieved from. |
If you could clarify what we want to do with the ObjectHolder, I am ok with getting this patch merged in and address remaining issues incrementally. The benefits of having the framework for computing NDV and nullability statistics is substantial. |
ObjectHolder should be used cautiously. That is why it is marked as deprecated. I'm pretty sure it was deprecated when we first added it. In order to hold complex values that can change size in aggregation, we have to hold them in the Java heap because vectors are immutable. This is why we have ObjectHolder, basically allows us variable width mutable values in aggregation. It makes sense to use it here as well. |
@amansinha100 Thanks for reviewing, I haven't tested when the table has parquet metadata cache and stats. I will test it before posting the updated patch. |
Tested with parquet metadata cache. Currently if you do a count(*) on parquet table having metadata cache, we get that count from cache. For rowcount in planning: if a stats table exists we get the count from stats table, otherwise we get it from groupscan which in parquet case gets it from the metadata cache if exists. |
Updated patch. Also added a brief design and usage doc to JIRA. Waiting for CALCITE-1241. |
"REPLACE", | ||
"SCHEMAS", | ||
"TABLES", | ||
"FILES", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this came up before where a list of non reserved keyword might result in some ambiguous queries. See DRILL-2116.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also DRILL-3875.
Just want to clarify. There are TODOs and somethings may not work in certain cases. Consider this patch as alpha feature. Needs more work to make it a full fledged feature. |
Sounds good. I agree that some of my questions can be ignored :) |
As a matter of good convention can we add a JIRA number to the TODOs? I don't think it is necessary to create a JIRA for each one, but even a follow up JIRA that says "address shortcomings of statistics" and a JIRA number we can search the code for would be useful to make sure we don't lose track of them. |
@Override | ||
public void setup() { | ||
work = new ObjectHolder(); | ||
work.obj = new com.clearspring.analytics.stream.cardinality.HyperLogLog(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's magical about 10? Should this be configurable?
PRE: Add "append" concept to directory write. * This is so stats can be stored in [table].stats.drill and be appended to be writing a new file into the directory. FUNCS: Statistics functions as UDFs: Currently using FieldReader to ensure consistent output type so that Unpivot doesn't get confused. All stats columns should be Nullable, so that stats functions can return NULL when N/A. * custom versions of "count" that always return BigInt * HyperLogLog based NDV that returns BigInt that works only on VarChars * HyperLogLog with binary output that only works on VarChars OPS: Updated protobufs for new ops OPS: Implemented StatisticsAggregate OPS: Implemented StatisticsUnpivot ANALYZE: AnalyzeTable functionality * JavaCC syntax more-or-less copied from LucidDB. * (Basic) AnalyzePrule: DrillAnalyzeRel -> UnpivotPrel and StatsAggPrel ANALYZE: Add getMetadataTable() to AbstractSchema USAGE: Change field access in QueryWrapper USAGE: Add getDrillTable() to DrillScanRelBase and ScanPrel * since ScanPrel does not inherit from DrillScanRelBase, this requires adding a DrillTable to the constructor * This is done so that a custom ReflectiveRelMetadataProvider can access the DrillTable associated with Logical/Physical scans. USAGE: Attach DrillTableMetadata to DrillTable. * DrillTableMetadata represents the data scanned from a corresponding ".stats.drill" table * In order to avoid doing query execution right after the ".stats.drill" table is found, metadata is not actually collected until the MaterializationVisitor is used. ** Currently, the metadata source must be a string (so that a SQL query can be created). Doing this with a table is probably more complicated. ** Query is set up to extract only the most recent statistics results for each column. USAGE: Configure DrillJoinRelBase to use NDV metadata when available. USAGE: attach metadata to table USAGE: implement optiq provider
Patch attached to the JIRA is seems to be useful for generating table stats and using them for query planning. I rebased the patch to latest master, fixed few issues and added few tests.
It still needs work to make it a full fledged feature, but I think the current state of the patch is good enough to commit and make improvements/fixes later.
@jinfengni and @amansinha100 : Could you please review the patch?