Conversation
…IVE-22940-sketches-fns
jcamachor
left a comment
There was a problem hiding this comment.
I have left a few comments. Cool progress.
I think there is enough content here to create two small wiki pages: one about data sketches integration, and another one about BI acceleration in Hive using them. Please, create a JIRA or label existing ones with TODOC so we do not forget about it.
ql/src/java/org/apache/hadoop/hive/ql/exec/DataSketchesFunctions.java
Outdated
Show resolved
Hide resolved
...rg/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRewriteCountDistinctToDataSketches.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
Outdated
Show resolved
Hide resolved
…1-rewrite-distinct
|
|
||
| HIVE_OPTIMIZE_BI_REWRITE_COUNT_DISTINCT_SKETCH( | ||
| "hive.optimize.bi.rewrite.countdistinct.sketch", "hll", | ||
| new StringSet("hll", "cpc", "theta"), |
There was a problem hiding this comment.
Can we limit the algorithm choices to a single one for the time being?
The reason I am asking this is that this will not work with materialized views. Since we are not storing in the SQL view definition the algorithm that we used to generate the column, if the property value changes, this would lead to errors.
The multi-algorithm supports needs a little bit more work. One option would be to store this information in the MV table properties so we know how to interpret them when HS2 needs to load them (and thus parse them). What do you think?
There was a problem hiding this comment.
I don't think that would be neccessary
- it works correctly - even if we have 1 algo the interesting behaviour is still there - which is: if the rewrite is enabled the created MV will be a rewritten one
- it won't get applied for different modes/etc so it doesn't lead to errors at all...
I've added a test(sketches_materialized_view_sketchtype.q)
which shows how it works when there is an MV for HLL ; in case the mode is not HLL the MV is ignored and computed directly
I think the real meaning of the MV should not change(I think we agree on this); we have 2 choices here:
- ignore all rewriting during MV construction/rebuild - so this rewrite may not happen for an MV - and users have to use the expanded form of the sketch stuff to create an MV for that purpose
- save a conf overlay alongside with the MV
I think addressing this is outside of the scope of this change
There was a problem hiding this comment.
I understand for a single algorithm it will work. However, consider the following scenario:
- A user enables BI mode and algorithm
hll. - The user creates a MV with count distinct. The MV has stored the count distinct field using
hll. The SQL statement still has count distinct. - We change default algorithm to
cpcand restart HS2. Thus, when the MV is loaded by HS2, the count distinct is transformed tocpc. - The user runs a query with count distinct, which transforms to
cpc, matches the MV... but fails at deserialization time because the sketch stored for the MV ishll.
That is why I suggested we could limit the options for algorithms till we have proper support. The risk I see if we do not do that now is that if anyone creates MVs using the different default algorithms, we will not have any way to distinguish between them anymore.
From the two choices that you mention above, I was suggesting the second option, since the main goal of the whole effort is to be able to use these algorithms seamlessly with the MVs. I agree it can be outside of the scope of this change, but let's limit the algorithm choices till then?
There was a problem hiding this comment.
I was not thinking about restarting HS2
sure...we can limit it to one - but if this incorrect behaviour does exists - then I think it could also be triggered with the main bi mode switch as well:
- in one case there will be a sketch there
- in the other some integer value
There was a problem hiding this comment.
I've tried it out - I didn't seen any exceptions the MV match for a plain count(distinct id) didn't happened....
when I've changed the default algo no exceptions happened; but matches were made incorrectly - so there could be dragons...
I've removed cpc/theta for now...we can add it back later
There was a problem hiding this comment.
You are right, failure can still happen when the sketch is stored and the mode changes.
Thanks for making the changes in any case. Let's check in this patch and give priority to the overlay issue, it should not be too difficult to address and will fix all these issues.
No description provided.