Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-4717][VL] Adapting the bind reference of agg that contains subquery in agg expressions #4705

Merged
merged 3 commits into from
Feb 19, 2024

Conversation

liujiayi771
Copy link
Contributor

@liujiayi771 liujiayi771 commented Feb 18, 2024

What changes were proposed in this pull request?

When aggregateExpressions includes a subquery, Spark's PlanAdaptiveSubqueries Rule will transform the subquery within the final aggregation. The aggregateFunction in the aggregateExpressions of the final aggregation will be cloned, resulting in creating new aggregateFunction objects. The inputAggBufferAttributes will also generate new AttributeReference instances with larger exprId, which leads to a failure in binding with the output of the partial aggregation. We need to adapt to this situation; when encountering a failure to bind, it is necessary to allow the binding of inputAggBufferAttribute with the same name but different exprId.

Fixes #4717.

How was this patch tested?

Add a test case in VeloxAggregateFunctionsSuite.

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

@liujiayi771
Copy link
Contributor Author

@PHILO-HE Could you help to review? This bug causes some agg fallback.

@liujiayi771
Copy link
Contributor Author

Before this PR:
image
After this PR:
image

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@PHILO-HE
Copy link
Contributor

Seems there are some relevant test failure. Please fix it. Thanks!

Copy link

Run Gluten Clickhouse CI

Copy link
Contributor

@ulysses-you ulysses-you left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems aggregate buffer attribute name is not unique, e.g., two sum aggregate functions in one aggregate SELECT sum(select c1 from t), sum(select c2 from t) FROM t. How about use aggregate buffer attribute index ?

@liujiayi771
Copy link
Contributor Author

It seems aggregate buffer attribute name is not unique, e.g., two sum aggregate functions in one aggregate SELECT sum(select c1 from t), sum(select c2 from t) FROM t. How about use aggregate buffer attribute index ?

I just switched to using index, can you help review the latest code?

@liujiayi771
Copy link
Contributor Author

Seems there are some relevant test failure. Please fix it. Thanks!

These failed UTs include some other scenarios that need to be addressed. The current code has been modified to be more comprehensive, and these failed UTs used to be fallback.

// exprId.
val attrsWithSameName = originalInputAttributes.collect {
case a if a.name == attr.name => a
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if originalInputAttributes (grouping keys) has an attribute naing sum,count, etc ? Can we add a test case for that case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad case; perhaps we should remove the group key in front of originalInputAttributes, because the output of the final agg's child is always the group key ++ inputAggBufferAttributes.

Copy link

Run Gluten Clickhouse CI

|""".stripMargin)(
df => assert(getExecutedPlan(df).count(_.isInstanceOf[HashAggregateExecTransformer]) == 2))

runQueryAndCompare("""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulysses-you Add a new test case.

@ulysses-you
Copy link
Contributor

lgtm if test pass

@ulysses-you ulysses-you merged commit ca18322 into apache:main Feb 19, 2024
16 of 19 checks passed
@liujiayi771 liujiayi771 deleted the subquery-in-agg branch February 19, 2024 09:10
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4705_time.csv log/native_master_02_18_2024_40486295d_time.csv difference percentage
q1 35.43 34.87 -0.563 98.41%
q2 24.31 24.17 -0.147 99.40%
q3 38.86 38.03 -0.831 97.86%
q4 35.32 37.83 2.510 107.11%
q5 70.01 72.33 2.322 103.32%
q6 8.13 8.29 0.160 101.96%
q7 82.61 84.22 1.603 101.94%
q8 84.66 85.03 0.372 100.44%
q9 125.51 120.47 -5.032 95.99%
q10 44.30 43.05 -1.243 97.19%
q11 20.37 20.10 -0.270 98.68%
q12 27.96 30.20 2.248 108.04%
q13 45.09 44.24 -0.850 98.12%
q14 20.35 19.94 -0.410 97.98%
q15 29.23 27.56 -1.675 94.27%
q16 13.12 14.18 1.057 108.05%
q17 101.64 102.14 0.500 100.49%
q18 151.05 146.95 -4.098 97.29%
q19 12.63 12.62 -0.009 99.93%
q20 26.39 27.97 1.584 106.00%
q21 224.44 224.64 0.203 100.09%
q22 13.53 17.31 3.782 127.96%
total 1234.95 1236.16 1.212 100.10%

@PHILO-HE PHILO-HE changed the title [VL] Adapting the bind reference of agg that contains subquery in agg expressions [GLUTEN-4717][VL] Adapting the bind reference of agg that contains subquery in agg expressions Feb 20, 2024
Copy link

#4717

@PHILO-HE
Copy link
Contributor

Thanks for your great work! I just created an issue for possible reference in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CORE] Failed to bind reference for xxx
4 participants