DRILL-1328: Support table statistics #425

vkorukanti · 2016-03-11T18:39:35Z

Patch attached to the JIRA is seems to be useful for generating table stats and using them for query planning. I rebased the patch to latest master, fixed few issues and added few tests.

It still needs work to make it a full fledged feature, but I think the current state of the patch is good enough to commit and make improvements/fixes later.

@jinfengni and @amansinha100 : Could you please review the patch?

amansinha100 · 2016-03-22T16:10:29Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StatisticsAggrFunctions.java

+    @Override
+    public void add() {
+      if (work.obj != null) {
+        com.clearspring.analytics.stream.cardinality.HyperLogLog hll =


Is this HLL package (com.clearspring.*) acceptable from a licensing perspective ?

From [1], it is released under Apache 2.0 license [1] .

[1] https://github.com/addthis/stream-lib
[2] https://github.com/addthis/stream-lib/blob/master/LICENSE.txt

I see...I suppose if @jacques-n has checked the licensing part I am good with it.

amansinha100 · 2016-03-26T18:20:26Z

@vkorukanti have you had a chance to test this with metadata caching ? Since the row count stats is stored in both the metadata cache and the .drill.stats file, I am not sure where it is getting retrieved from.

amansinha100 · 2016-03-26T18:25:09Z

If you could clarify what we want to do with the ObjectHolder, I am ok with getting this patch merged in and address remaining issues incrementally. The benefits of having the framework for computing NDV and nullability statistics is substantial.

jacques-n · 2016-03-26T18:40:26Z

ObjectHolder should be used cautiously. That is why it is marked as deprecated. I'm pretty sure it was deprecated when we first added it. In order to hold complex values that can change size in aggregation, we have to hold them in the Java heap because vectors are immutable. This is why we have ObjectHolder, basically allows us variable width mutable values in aggregation. It makes sense to use it here as well.

vkorukanti · 2016-03-26T18:49:43Z

@amansinha100 Thanks for reviewing, I haven't tested when the table has parquet metadata cache and stats. I will test it before posting the updated patch.

vkorukanti · 2016-05-18T17:36:09Z

Tested with parquet metadata cache. Currently if you do a count(*) on parquet table having metadata cache, we get that count from cache. For rowcount in planning: if a stats table exists we get the count from stats table, otherwise we get it from groupscan which in parquet case gets it from the metadata cache if exists.

vkorukanti · 2016-05-18T18:58:13Z

Updated patch. Also added a brief design and usage doc to JIRA. Waiting for CALCITE-1241.

sudheeshkatkam · 2016-05-18T23:15:51Z

exec/java-exec/src/main/codegen/data/Parser.tdd

+    "REPLACE",
+    "SCHEMAS",
+    "TABLES",
+    "FILES",


Something like this came up before where a list of non reserved keyword might result in some ambiguous queries. See DRILL-2116.

Also DRILL-3875.

vkorukanti · 2016-05-18T23:21:09Z

Just want to clarify. There are TODOs and somethings may not work in certain cases. Consider this patch as alpha feature. Needs more work to make it a full fledged feature.

sudheeshkatkam · 2016-05-18T23:35:08Z

Sounds good. I agree that some of my questions can be ignored :)

jaltekruse · 2016-05-18T23:42:26Z

As a matter of good convention can we add a JIRA number to the TODOs? I don't think it is necessary to create a JIRA for each one, but even a follow up JIRA that says "address shortcomings of statistics" and a JIRA number we can search the code for would be useful to make sure we don't lose track of them.

sudheeshkatkam · 2016-05-19T00:16:33Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StatisticsAggrFunctions.java

+    @Override
+    public void setup() {
+      work = new ObjectHolder();
+      work.obj = new com.clearspring.analytics.stream.cardinality.HyperLogLog(10);


What's magical about 10? Should this be configurable?

PRE: Add "append" concept to directory write. * This is so stats can be stored in [table].stats.drill and be appended to be writing a new file into the directory. FUNCS: Statistics functions as UDFs: Currently using FieldReader to ensure consistent output type so that Unpivot doesn't get confused. All stats columns should be Nullable, so that stats functions can return NULL when N/A. * custom versions of "count" that always return BigInt * HyperLogLog based NDV that returns BigInt that works only on VarChars * HyperLogLog with binary output that only works on VarChars OPS: Updated protobufs for new ops OPS: Implemented StatisticsAggregate OPS: Implemented StatisticsUnpivot ANALYZE: AnalyzeTable functionality * JavaCC syntax more-or-less copied from LucidDB. * (Basic) AnalyzePrule: DrillAnalyzeRel -> UnpivotPrel and StatsAggPrel ANALYZE: Add getMetadataTable() to AbstractSchema USAGE: Change field access in QueryWrapper USAGE: Add getDrillTable() to DrillScanRelBase and ScanPrel * since ScanPrel does not inherit from DrillScanRelBase, this requires adding a DrillTable to the constructor * This is done so that a custom ReflectiveRelMetadataProvider can access the DrillTable associated with Logical/Physical scans. USAGE: Attach DrillTableMetadata to DrillTable. * DrillTableMetadata represents the data scanned from a corresponding ".stats.drill" table * In order to avoid doing query execution right after the ".stats.drill" table is found, metadata is not actually collected until the MaterializationVisitor is used. ** Currently, the metadata source must be a string (so that a SQL query can be created). Doing this with a table is probably more complicated. ** Query is set up to extract only the most recent statistics results for each column. USAGE: Configure DrillJoinRelBase to use NDV metadata when available. USAGE: attach metadata to table USAGE: implement optiq provider

…tAnd (apache#425)

amansinha100 reviewed Mar 22, 2016
View reviewed changes

vkorukanti force-pushed the DRILL-1328-r1 branch from 079d109 to 8ef7fd7 Compare May 18, 2016 18:55

sudheeshkatkam reviewed May 18, 2016
View reviewed changes

sudheeshkatkam reviewed May 19, 2016
View reviewed changes

vkorukanti force-pushed the DRILL-1328-r1 branch from 8ef7fd7 to 6bcdf3c Compare May 19, 2016 00:34

gparai pushed a commit to gparai/drill that referenced this pull request Oct 10, 2018

MD-4818: Fixed handling of complex nested boolean operators in elemen…

603f6a6

…tAnd (apache#425)

karthik-man pushed a commit to karthik-man/drill that referenced this pull request Mar 14, 2019

MD-4818: Fixed handling of complex nested boolean operators in elemen…

ca9e8d0

…tAnd (apache#425)

arina-ielchiieva closed this Jul 1, 2019

KazydubB pushed a commit to KazydubB/drill that referenced this pull request Sep 25, 2019

MD-4818: Fixed handling of complex nested boolean operators in elemen…

d31a78c

…tAnd (apache#425)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-1328: Support table statistics #425

DRILL-1328: Support table statistics #425

vkorukanti commented Mar 11, 2016

amansinha100 Mar 22, 2016

vkorukanti Mar 22, 2016

amansinha100 Mar 26, 2016

amansinha100 commented Mar 26, 2016

amansinha100 commented Mar 26, 2016

jacques-n commented Mar 26, 2016

vkorukanti commented Mar 26, 2016

vkorukanti commented May 18, 2016

vkorukanti commented May 18, 2016

sudheeshkatkam May 18, 2016

sudheeshkatkam May 18, 2016

vkorukanti commented May 18, 2016

sudheeshkatkam commented May 18, 2016

jaltekruse commented May 18, 2016

sudheeshkatkam May 19, 2016

DRILL-1328: Support table statistics #425

DRILL-1328: Support table statistics #425

Conversation

vkorukanti commented Mar 11, 2016

amansinha100 Mar 22, 2016

Choose a reason for hiding this comment

vkorukanti Mar 22, 2016

Choose a reason for hiding this comment

amansinha100 Mar 26, 2016

Choose a reason for hiding this comment

amansinha100 commented Mar 26, 2016

amansinha100 commented Mar 26, 2016

jacques-n commented Mar 26, 2016

vkorukanti commented Mar 26, 2016

vkorukanti commented May 18, 2016

vkorukanti commented May 18, 2016

sudheeshkatkam May 18, 2016

Choose a reason for hiding this comment

sudheeshkatkam May 18, 2016

Choose a reason for hiding this comment

vkorukanti commented May 18, 2016

sudheeshkatkam commented May 18, 2016

jaltekruse commented May 18, 2016

sudheeshkatkam May 19, 2016

Choose a reason for hiding this comment