Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug](materialized-view) check groupby in agg-bottom-plan when rewrite agg query by mv #34274

Merged
merged 5 commits into from
May 15, 2024

Conversation

Z-SWEI
Copy link
Contributor

@Z-SWEI Z-SWEI commented Apr 29, 2024

Proposed changes

Issue Number: close #34155

check groupby in agg-bottom-plan when rewrite and rollup agg query by mv

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@Z-SWEI
Copy link
Contributor Author

Z-SWEI commented Apr 29, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 41812 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 174e3a73c156c7dbfffbcbb585b36c5c636190af, data reload: false

------ Round 1 ----------------------------------
q1	17611	4326	4177	4177
q2	2029	181	182	181
q3	10470	1250	1171	1171
q4	10191	723	813	723
q5	7509	2700	2744	2700
q6	214	136	134	134
q7	1039	630	611	611
q8	9307	2131	2109	2109
q9	9594	6786	6801	6786
q10	9405	3840	4003	3840
q11	435	251	249	249
q12	532	220	228	220
q13	18256	3188	3236	3188
q14	286	235	226	226
q15	508	497	468	468
q16	505	407	403	403
q17	1010	752	709	709
q18	8502	7901	7748	7748
q19	3509	1522	1537	1522
q20	646	321	312	312
q21	5246	4058	4223	4058
q22	344	277	278	277
Total cold run time: 117148 ms
Total hot run time: 41812 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4517	4343	4352	4343
q2	381	266	276	266
q3	3230	2990	2758	2758
q4	1861	1620	1630	1620
q5	5565	5490	5544	5490
q6	213	125	122	122
q7	2354	1983	2001	1983
q8	3268	3418	3342	3342
q9	8858	8837	8822	8822
q10	3942	3827	3876	3827
q11	594	490	503	490
q12	786	613	630	613
q13	16107	3111	3223	3111
q14	314	286	276	276
q15	535	486	485	485
q16	489	454	435	435
q17	1732	1495	1457	1457
q18	7771	7468	7445	7445
q19	1632	1529	1586	1529
q20	2011	1778	1757	1757
q21	11477	4775	4824	4775
q22	546	476	479	476
Total cold run time: 78183 ms
Total hot run time: 55422 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185787 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 174e3a73c156c7dbfffbcbb585b36c5c636190af, data reload: false

query1	918	359	337	337
query2	6448	2416	2347	2347
query3	6644	204	213	204
query4	23542	21350	21122	21122
query5	4081	443	418	418
query6	272	180	180	180
query7	4593	296	287	287
query8	261	192	183	183
query9	8470	2374	2353	2353
query10	427	245	256	245
query11	14681	14176	14098	14098
query12	129	91	90	90
query13	1649	381	403	381
query14	10594	7486	8148	7486
query15	247	160	166	160
query16	8119	255	265	255
query17	1814	560	533	533
query18	2103	279	261	261
query19	340	146	150	146
query20	96	91	85	85
query21	195	124	122	122
query22	5000	4942	4909	4909
query23	33880	33160	33269	33160
query24	8526	3001	2980	2980
query25	560	366	368	366
query26	695	156	146	146
query27	2056	318	333	318
query28	5614	2040	2045	2040
query29	838	582	607	582
query30	239	156	148	148
query31	987	732	704	704
query32	90	52	52	52
query33	614	243	237	237
query34	867	471	495	471
query35	743	682	662	662
query36	1042	939	902	902
query37	105	66	66	66
query38	3102	3027	2968	2968
query39	1590	1545	1532	1532
query40	195	121	121	121
query41	41	39	38	38
query42	108	93	95	93
query43	581	536	539	536
query44	1048	716	733	716
query45	277	255	260	255
query46	1066	701	710	701
query47	1958	1863	1878	1863
query48	369	290	287	287
query49	828	383	389	383
query50	786	384	376	376
query51	6872	6827	6701	6701
query52	104	89	92	89
query53	340	273	270	270
query54	273	237	235	235
query55	83	74	74	74
query56	235	213	218	213
query57	1194	1107	1146	1107
query58	228	202	210	202
query59	3227	3048	3124	3048
query60	251	230	229	229
query61	90	85	87	85
query62	593	437	440	437
query63	305	272	268	268
query64	8158	7117	7094	7094
query65	3073	2996	3033	2996
query66	778	335	326	326
query67	15405	15181	14944	14944
query68	5258	546	553	546
query69	472	297	294	294
query70	1154	1100	1124	1100
query71	375	272	267	267
query72	7868	2758	2421	2421
query73	708	326	323	323
query74	6540	6105	6161	6105
query75	3357	2653	2672	2653
query76	2855	969	929	929
query77	391	259	262	259
query78	11008	10193	10317	10193
query79	3748	529	522	522
query80	2163	434	430	430
query81	531	215	219	215
query82	967	96	98	96
query83	299	171	173	171
query84	266	88	81	81
query85	1990	306	259	259
query86	485	305	277	277
query87	3347	3075	3104	3075
query88	4608	2411	2413	2411
query89	488	389	371	371
query90	2074	183	183	183
query91	122	99	95	95
query92	59	46	50	46
query93	5420	535	511	511
query94	1304	189	180	180
query95	405	302	297	297
query96	594	271	275	271
query97	3146	2913	2966	2913
query98	239	231	215	215
query99	1189	879	855	855
Total cold run time: 284067 ms
Total hot run time: 185787 ms

}
finalOutputExpressions.add(new Alias(rollupedExpression));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final output should contains both aggregate function and group by dimension, this may lost group by dimension out put and change original agg output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final output should contains both aggregate function and group by dimension, this may lost group by dimension out put and change original agg output

I think maybe line 261 does the job you mentioned.
This PR does not modify the logic in rewriteQueryByView().
The key change is line 428 in topPlanSplitToGroupAndFunction() that prevent the loss of groupby in child agg for rules like MaterializedViewProjectAggregateRule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point, the change may change the original agg output order.
such as query is as following:

SELECT sum(o_totalprice), o_orderpriority
FROM orders
GROUP BY o_orderstatus, o_orderpriority;

after mv rewreite will be

SELECT o_orderpriority, sum(o_totalprice)
FROM orders
GROUP BY o_orderstatus, o_orderpriority;

Maybe we should add another check as following
WDYT?

        // if top plan doesn't use group by but bottom agg has group by, should also check
        if (!queryGroupByExpressions.isEmpty() && queryTopPlanGroupBySet.isEmpty()) {
            // todo check as your code in line 242 to 262
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, I will try

}
finalOutputExpressions.add(new Alias(rollupedExpression));
}
// add project to guarantee group by column ref is slot reference,
// this is necessary because physical createHash will need slotReference later
Copy link
Contributor

@seawinde seawinde May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the following logic here is better, WDYT?

// if top plan doesn't use group by but bottom agg has group by, should also check
        if (!queryGroupByExpressions.isEmpty() && queryTopPlanGroupBySet.isEmpty()) {
            // todo check as code in line 263 to 278
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried here but caused duplicated code about rewrittenGroupByExpression, so I just add the compensate groupby to queryExpressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated code can extract a common method and call the method in different place.
queryExpressions means the expression query used readly, if you add compensate groupby to queryExpressions, this may change the queryExpressions original meaning.

@Z-SWEI Z-SWEI requested a review from seawinde May 8, 2024 07:01
@@ -284,6 +285,26 @@ protected Plan rewriteQueryByView(MatchMode matchMode,
}
// add project to guarantee group by column ref is slot reference,
// this is necessary because physical createHash will need slotReference later
for (Expression expression : queryChildrenGroupBySet) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can add if condition as following for performance
if (!queryGroupByExpressions.isEmpty() && queryTopPlanGroupBySet.isEmpty()) {
do your check logic
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有没有可能queryTopPlanGroupBySet不为空,但是也没有包含全部的groupby字段 这种情况?
比如 select sum(o_totalprice), a from orders group by o_orderdate, a, b
这种情况的话if (!queryGroupByExpressions.isEmpty() && queryTopPlanGroupBySet.isEmpty())是不是筛查不出来

Copy link
Contributor

@seawinde seawinde May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, this check condition should be if top group by doesn't equals bottom group by, we should add check

@@ -426,7 +447,8 @@ private Pair<Set<? extends Expression>, Set<? extends Expression>> topPlanSplitT
topGroupByExpressions.add(expression);
}
});
return Pair.of(topGroupByExpressions, topFunctionExpressions);
groupByExpressionSet.removeAll(topGroupByExpressions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topPlanSplitToGroupAndFunction means get the group by and agg fuctions expression pair from top plan, this change the measing of the method
May be we can get the result of method topPlanSplitToGroupAndFunction and then do your removeAll logic

}
NamedExpression groupByExpression = rewrittenGroupByExpression instanceof NamedExpression
? (NamedExpression) rewrittenGroupByExpression : new Alias(rewrittenGroupByExpression);
finalGroupExpressions.add(groupByExpression);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if doesn't pass when check, we return null as the code line 302, this is useful
We should not add the groupByExpression to finalGroupExpressions like code line 306,
finalGroupExpressions means mv rewrite result out put

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉得留在finalGroupExpressions里边吧。比如像下面这个select语句在改写的时候,logicalProject里边不包含groupby的字段,这个时候不把child里边的logicalAggregate的groupby o_orderdate留着,那改写成功以后这个groupby不就被丢掉了吗?我理解错了吗

select sum(o_totalprice) from orders group by o_orderdate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, this should add to the finalGroupExpressions.
your test case is nice, can you add the test case to regression-test/suites/nereids_rules_p0/mv/agg_with_roll_up/aggregate_with_roll_up.groovy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'd like to try.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After add the test case to aggregate_with_roll_up.groovy
you can run the command as following to re generate aggregate_with_roll_up.out:
./run-regression-test.sh --run -d nereids_rules_p0/mv/agg_with_roll_up -s aggregate_with_roll_up -forceGenOut
see regresson-test detail in https://doris.apache.org/community/developer-guide/regression-testing

@Z-SWEI Z-SWEI requested a review from seawinde May 13, 2024 08:47
@seawinde
Copy link
Contributor

run buildall

Copy link
Contributor

@seawinde seawinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by anyone and no changes requested.

@seawinde
Copy link
Contributor

Next We should make sure that required pipeline tests all pass

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 15, 2024
@morrySnow morrySnow merged commit 016814f into apache:master May 15, 2024
26 of 28 checks passed
yiguolei pushed a commit that referenced this pull request May 15, 2024
…y mv (#34274)

check groupby in agg-bottom-plan when rewrite and rollup agg query by mv
morrySnow pushed a commit that referenced this pull request Aug 20, 2024
…terialized view (#38958)

This is brought by #34274

if mv def is 
select o_orderdate from  orders group by o_orderdate;

query is as followiing, the result is wrong.
select 1 from orders group by o_orderdate;
seawinde added a commit to seawinde/doris that referenced this pull request Aug 21, 2024
…terialized view (apache#38958)

This is brought by apache#34274

if mv def is
select o_orderdate from  orders group by o_orderdate;

query is as followiing, the result is wrong.
select 1 from orders group by o_orderdate;
dataroaring pushed a commit that referenced this pull request Aug 23, 2024
…terialized view (#38958)

This is brought by #34274

if mv def is 
select o_orderdate from  orders group by o_orderdate;

query is as followiing, the result is wrong.
select 1 from orders group by o_orderdate;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.0-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] rewrite by mv roll up bug? when query rewrite by materialized view
5 participants