Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46683] Write a subquery generator that generates subqueries permutations to increase testing coverage #44599

Closed
wants to merge 122 commits into from

Conversation

andylam-db
Copy link
Contributor

@andylam-db andylam-db commented Jan 4, 2024

What changes were proposed in this pull request?

This PR creates a suite, GeneratedSubquerySuite, that generates SQL with variations of subqueries. These variations include:

  1. The location of the subquery in the main query (SELECT, FROM, WHERE)
  2. Whether the subquery is correlated, if it is in SELECT or WHERE.
  3. The type of subquery predicate, if it is in WHERE.
  4. Whether the subquery has a DISTINCT projection.
  5. The operators in the subquery: currently there are JOINS, SET OPS, LIMIT and AGGREGATE (sum, count, groupby, no-groupby).

How this works is that this suite generates SQL queries, and are then run against Postgres using docker integration tests.

Why are the changes needed?

There are a lot of subquery correctness issues, ranging from very old bugs to new ones that are being introduced due to work being done on subquery correlation optimization. This is especially in the areas of COUNT bugs and null behaviors.

To increase test coverage and robustness in this area, we want to write a subquery generator that writes variations of subqueries, producing SQL tests.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR adds test. NA.

Was this patch authored or co-authored using generative AI tooling?

No.

@andylam-db andylam-db changed the title Generate subqueries for correctness testing [SPARK-46683] Write a subquery generator that generates subqueries of different variations to increase testing coverage Jan 11, 2024
@andylam-db andylam-db marked this pull request as ready for review January 11, 2024 22:35
@andylam-db
Copy link
Contributor Author

@cloud-fan @jchen5 Open for reviews! I listed some potential issues in the PR description:

  1. There are a lot of queries. There are 14 files, each containing ~300 queries. Adding more variations (such as new operators like WINDOW, OFFSET) would only increase the number of queries. This might slow down tests significantly.
  2. A small change in the SubquerySQLGeneratorSuite might introduce a lot of golden file changes. This makes this potentially very difficult to maintain.

@andylam-db andylam-db changed the title [SPARK-46683] Write a subquery generator that generates subqueries of different variations to increase testing coverage [SPARK-46683] Write a subquery generator that generates subqueries permutations to increase testing coverage Jan 12, 2024
@andylam-db
Copy link
Contributor Author

@cloud-fan @jchen5 Changed it to not produce golden files instead, just a normal test suite.

@andylam-db
Copy link
Contributor Author

@cloud-fan Can we merge? Thanks!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in bc889c8 Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants