Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TopK Fuzz Tests 🐝 #7749

Closed
alamb opened this issue Oct 5, 2023 · 10 comments · Fixed by #7772
Closed

TopK Fuzz Tests 🐝 #7749

alamb opened this issue Oct 5, 2023 · 10 comments · Fixed by #7772
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Oct 5, 2023

Is your feature request related to a problem or challenge?

After #7721 a SortExec with a limit will use a special TopK . We have basic unit tests, but I think the coverage could be improved, specifically with Fuzz testing

Describe the solution you'd like

What I would like is a new fuzz test to be added to the the existing fuzz cases: https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/fuzz_cases

The structure of SortTest in https://github.com/apache/arrow-datafusion/blob/e95a24b5a260e0e2f603d52682d36cce192676f8/datafusion/core/tests/fuzz_cases/sort_fuzz.rs#L111 might be a good one to follow

The basic outline would be:

  1. Create an input with several columns (integers, strings, floats)
  2. Reorder the input randomly
  3. Divide the input up multiple batches using make_staggered_batches
  4. Run a query like SELECT * FROM t ORDER BY <col(s)> LIMIT <N> and collect the output
  5. Compute the expected result programmatically (e.g. by sort the data, prior to creating RecordBatches)
  6. Ensure the output matches the expected result

Input size: 1000 rows

Parameters to vary

  1. sort cols: (int), (string), (float), (int, string), (string, int), etc.
  2. N: 1, 10, 100, 300 (aka how many are kept)

Bonus points
make it easy to add new columns / types (e.g. like string dictionary)

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Oct 5, 2023
@alamb
Copy link
Contributor Author

alamb commented Oct 5, 2023

I think this would be a nice beginner project for someone who was familiar with Rust, but not familiar with DataFusion / working with RecordBatches / arrays. There are reasonable existing examples to follow as well

BTW you can run the fuzz tests with:

cargo test --test fuzz

@Tangruilin
Copy link
Contributor

/assign me

@Tangruilin
Copy link
Contributor

@alamb

@Tangruilin
Copy link
Contributor

I have looked this, it seems that make_staggered_batches can only get i64 cols, but the tests seems request cols with (integers, strings, floats). Maybe I need to wirte a fn make_staggered_batches myself?

@alamb
Copy link
Contributor Author

alamb commented Oct 6, 2023

Thanks @Tangruilin

I have looked this, it seems that make_staggered_batches can only get i64 cols, but the tests seems request cols with (integers, strings, floats). Maybe I need to wirte a fn make_staggered_batches myself?

Sorry, I think I meant use stagger_batch: https://github.com/apache/arrow-datafusion/blob/85046001da91d535f7ea417911cd51944f9820f4/test-utils/src/lib.rs#L72-L76

@Tangruilin
Copy link
Contributor

when i get the result with (string), (float).etc cols. I found that batches_to_vec can only work for i32. May be I need a another way to convert record batches to vec, are there any suggestions?

@Tangruilin
Copy link
Contributor

@alamb maybe i can add a enum to support other type? If it is ok, i will do it.

@alamb
Copy link
Contributor Author

alamb commented Oct 8, 2023

when i get the result with (string), (float).etc cols. I found that batches_to_vec can only work for i32. May be I need a another way to convert record batches to vec, are there any suggestions?

I would actually recommend doing it the otherway around:

  1. Create the data with a Vec
  2. Figure out what the expected answer is
  3. Convert the Vec to RecordBatch

Does that make sense?

@Tangruilin
Copy link
Contributor

when i get the result with (string), (float).etc cols. I found that batches_to_vec can only work for i32. May be I need a another way to convert record batches to vec, are there any suggestions?

I would actually recommend doing it the otherway around:

  1. Create the data with a Vec
  2. Figure out what the expected answer is
  3. Convert the Vec to RecordBatch

Does that make sense?

For 2 > The expected answer will be three types. I consider to use Generics to solve it, but different type need to use different arrow Array, so it seems can not solve.

@Tangruilin
Copy link
Contributor

when i get the result with (string), (float).etc cols. I found that batches_to_vec can only work for i32. May be I need a another way to convert record batches to vec, are there any suggestions?

I would actually recommend doing it the otherway around:

  1. Create the data with a Vec
  2. Figure out what the expected answer is
  3. Convert the Vec to RecordBatch

Does that make sense?

For 2 > The expected answer will be three types. I consider to use Generics to solve it, but different type need to use different arrow Array, so it seems can not solve.

I have a idea. I will try it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants