Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Flushable distinct agg caused correctness issue #4421

Closed
Yohahaha opened this issue Jan 16, 2024 · 5 comments · Fixed by #4443
Closed

[VL] Flushable distinct agg caused correctness issue #4421

Yohahaha opened this issue Jan 16, 2024 · 5 comments · Fixed by #4443
Labels
bug Something isn't working triage

Comments

@Yohahaha
Copy link
Contributor

Backend

VL (Velox)

Bug description

TPC-DS q38 has distinct agg, when partial distinct is flushed, final result is incorrect.

flushed unflushed

expect: [26056]
actual: [26131]

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

@Yohahaha Yohahaha added bug Something isn't working triage labels Jan 16, 2024
@rui-mo
Copy link
Contributor

rui-mo commented Jan 16, 2024

It seems the aggregation before partial_count should not be flushable. Can we fully rely on whether the data being directly sent to exchange to decide whether flushable aggregation can be generated? cc @zhztheplayer

@Yohahaha
Copy link
Contributor Author

When agg function is empty, we may not use flushable agg which may produce duplicate rows.

@zhztheplayer
Copy link
Member

zhztheplayer commented Jan 17, 2024

Thank you for reporting @Yohahaha . I think this is very likely similar to the possible corner case I mentioned in #4312 (comment)

if partial aggregation is already handing data that is partitioned by the exchange (aggregate) keys

We'd enhance FlushableAggregateRule to handle this case.

Probably our CI tests with result comparisons are not handling enough data to trigger flush. I'll take a look into that as well.

@zhztheplayer
Copy link
Member

zhztheplayer commented Jan 17, 2024

When agg function is empty, we may not use flushable agg which may produce duplicate rows.

Flushing on distinct aggregation can be useful, e.g., do flush on scan-side then do final distinct on reducers. The problem here is Spark has specialized logic on count(distinct), however if we think it generally it's about the aggregate key distribution. If there is an exchange on the exit side of aggregation, and the exchange would be enforced to redistribute the data populated in the same aggregate group but on different mappers, to the same reducer, the partial aggregation can be considered safe to flush.

@zhztheplayer
Copy link
Member

Update: The issue should impact Spark version >= 3.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
3 participants