Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-1328: Support table statistics #425

Closed
wants to merge 1 commit into from

Conversation

vkorukanti
Copy link
Member

Patch attached to the JIRA is seems to be useful for generating table stats and using them for query planning. I rebased the patch to latest master, fixed few issues and added few tests.

It still needs work to make it a full fledged feature, but I think the current state of the patch is good enough to commit and make improvements/fixes later.

@jinfengni and @amansinha100 : Could you please review the patch?

@Override
public void add() {
if (work.obj != null) {
com.clearspring.analytics.stream.cardinality.HyperLogLog hll =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this HLL package (com.clearspring.*) acceptable from a licensing perspective ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From [1], it is released under Apache 2.0 license [1] .

[1] https://github.com/addthis/stream-lib
[2] https://github.com/addthis/stream-lib/blob/master/LICENSE.txt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see...I suppose if @jacques-n has checked the licensing part I am good with it.

@amansinha100
Copy link

@vkorukanti have you had a chance to test this with metadata caching ? Since the row count stats is stored in both the metadata cache and the .drill.stats file, I am not sure where it is getting retrieved from.

@amansinha100
Copy link

If you could clarify what we want to do with the ObjectHolder, I am ok with getting this patch merged in and address remaining issues incrementally. The benefits of having the framework for computing NDV and nullability statistics is substantial.

@jacques-n
Copy link
Contributor

ObjectHolder should be used cautiously. That is why it is marked as deprecated. I'm pretty sure it was deprecated when we first added it. In order to hold complex values that can change size in aggregation, we have to hold them in the Java heap because vectors are immutable. This is why we have ObjectHolder, basically allows us variable width mutable values in aggregation. It makes sense to use it here as well.

@vkorukanti
Copy link
Member Author

@amansinha100 Thanks for reviewing, I haven't tested when the table has parquet metadata cache and stats. I will test it before posting the updated patch.

@vkorukanti
Copy link
Member Author

Tested with parquet metadata cache. Currently if you do a count(*) on parquet table having metadata cache, we get that count from cache. For rowcount in planning: if a stats table exists we get the count from stats table, otherwise we get it from groupscan which in parquet case gets it from the metadata cache if exists.

@vkorukanti
Copy link
Member Author

Updated patch. Also added a brief design and usage doc to JIRA. Waiting for CALCITE-1241.

"REPLACE",
"SCHEMAS",
"TABLES",
"FILES",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this came up before where a list of non reserved keyword might result in some ambiguous queries. See DRILL-2116.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also DRILL-3875.

@vkorukanti
Copy link
Member Author

Just want to clarify. There are TODOs and somethings may not work in certain cases. Consider this patch as alpha feature. Needs more work to make it a full fledged feature.

@sudheeshkatkam
Copy link
Contributor

Sounds good. I agree that some of my questions can be ignored :)

@jaltekruse
Copy link
Contributor

As a matter of good convention can we add a JIRA number to the TODOs? I don't think it is necessary to create a JIRA for each one, but even a follow up JIRA that says "address shortcomings of statistics" and a JIRA number we can search the code for would be useful to make sure we don't lose track of them.

@Override
public void setup() {
work = new ObjectHolder();
work.obj = new com.clearspring.analytics.stream.cardinality.HyperLogLog(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's magical about 10? Should this be configurable?

PRE: Add "append" concept to directory write.

* This is so stats can be stored in [table].stats.drill and be appended to be writing a new file into the directory.

FUNCS: Statistics functions as UDFs:
Currently using FieldReader to ensure consistent output type so that Unpivot doesn't get confused. All stats columns should be Nullable, so that stats functions can return NULL when N/A.
* custom versions of "count" that always return BigInt
* HyperLogLog based NDV that returns BigInt that works only on VarChars
* HyperLogLog with binary output that only works on VarChars

OPS: Updated protobufs for new ops

OPS: Implemented StatisticsAggregate

OPS: Implemented StatisticsUnpivot

ANALYZE: AnalyzeTable functionality
* JavaCC syntax more-or-less copied from LucidDB.
* (Basic) AnalyzePrule: DrillAnalyzeRel -> UnpivotPrel and StatsAggPrel

ANALYZE: Add getMetadataTable() to AbstractSchema

USAGE: Change field access in QueryWrapper

USAGE: Add getDrillTable() to DrillScanRelBase and ScanPrel
* since ScanPrel does not inherit from DrillScanRelBase, this requires adding a DrillTable to the constructor
* This is done so that a custom ReflectiveRelMetadataProvider can access the DrillTable associated with Logical/Physical scans.

USAGE: Attach DrillTableMetadata to DrillTable.
* DrillTableMetadata represents the data scanned from a corresponding ".stats.drill" table
* In order to avoid doing query execution right after the ".stats.drill" table is found, metadata is not actually collected until the MaterializationVisitor is used.
** Currently, the metadata source must be a string (so that a SQL query can be created). Doing this with a table is probably more complicated.
** Query is set up to extract only the most recent statistics results for each column.

USAGE: Configure DrillJoinRelBase to use NDV metadata when available.

USAGE: attach metadata to table

USAGE: implement optiq provider
KazydubB pushed a commit to KazydubB/drill that referenced this pull request Sep 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants