TopK Fuzz Tests 🐝 #7749

alamb · 2023-10-05T20:17:55Z

Is your feature request related to a problem or challenge?

After #7721 a SortExec with a limit will use a special TopK . We have basic unit tests, but I think the coverage could be improved, specifically with Fuzz testing

Describe the solution you'd like

What I would like is a new fuzz test to be added to the the existing fuzz cases: https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/fuzz_cases

The structure of SortTest in https://github.com/apache/arrow-datafusion/blob/e95a24b5a260e0e2f603d52682d36cce192676f8/datafusion/core/tests/fuzz_cases/sort_fuzz.rs#L111 might be a good one to follow

The basic outline would be:

Create an input with several columns (integers, strings, floats)
Reorder the input randomly
Divide the input up multiple batches using make_staggered_batches
Run a query like SELECT * FROM t ORDER BY <col(s)> LIMIT <N> and collect the output
Compute the expected result programmatically (e.g. by sort the data, prior to creating RecordBatches)
Ensure the output matches the expected result

Input size: 1000 rows

Parameters to vary

sort cols: (int), (string), (float), (int, string), (string, int), etc.
N: 1, 10, 100, 300 (aka how many are kept)

Bonus points
make it easy to add new columns / types (e.g. like string dictionary)

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2023-10-05T20:19:22Z

I think this would be a nice beginner project for someone who was familiar with Rust, but not familiar with DataFusion / working with RecordBatches / arrays. There are reasonable existing examples to follow as well

BTW you can run the fuzz tests with:

cargo test --test fuzz

Tangruilin · 2023-10-06T09:21:12Z

/assign me

Tangruilin · 2023-10-06T09:21:28Z

@alamb

Tangruilin · 2023-10-06T10:32:34Z

I have looked this, it seems that make_staggered_batches can only get i64 cols, but the tests seems request cols with (integers, strings, floats). Maybe I need to wirte a fn make_staggered_batches myself?

alamb · 2023-10-06T11:32:48Z

Thanks @Tangruilin

I have looked this, it seems that make_staggered_batches can only get i64 cols, but the tests seems request cols with (integers, strings, floats). Maybe I need to wirte a fn make_staggered_batches myself?

Sorry, I think I meant use stagger_batch: https://github.com/apache/arrow-datafusion/blob/85046001da91d535f7ea417911cd51944f9820f4/test-utils/src/lib.rs#L72-L76

Tangruilin · 2023-10-08T12:07:04Z

when i get the result with (string), (float).etc cols. I found that batches_to_vec can only work for i32. May be I need a another way to convert record batches to vec, are there any suggestions?

Tangruilin · 2023-10-08T12:08:24Z

@alamb maybe i can add a enum to support other type? If it is ok, i will do it.

alamb · 2023-10-08T12:48:29Z

when i get the result with (string), (float).etc cols. I found that batches_to_vec can only work for i32. May be I need a another way to convert record batches to vec, are there any suggestions?

I would actually recommend doing it the otherway around:

Create the data with a Vec
Figure out what the expected answer is
Convert the Vec to RecordBatch

Does that make sense?

Tangruilin · 2023-10-08T15:15:50Z

when i get the result with (string), (float).etc cols. I found that batches_to_vec can only work for i32. May be I need a another way to convert record batches to vec, are there any suggestions?

I would actually recommend doing it the otherway around:

Create the data with a Vec

Figure out what the expected answer is

Convert the Vec to RecordBatch

Does that make sense?

For 2 > The expected answer will be three types. I consider to use Generics to solve it, but different type need to use different arrow Array, so it seems can not solve.

Tangruilin · 2023-10-08T16:11:18Z

when i get the result with (string), (float).etc cols. I found that batches_to_vec can only work for i32. May be I need a another way to convert record batches to vec, are there any suggestions?

I would actually recommend doing it the otherway around:

Create the data with a Vec

Figure out what the expected answer is

Convert the Vec to RecordBatch

Does that make sense?

For 2 > The expected answer will be three types. I consider to use Generics to solve it, but different type need to use different arrow Array, so it seems can not solve.

I have a idea. I will try it

alamb added the enhancement New feature or request label Oct 5, 2023

alamb added the good first issue Good for newcomers label Oct 5, 2023

alamb mentioned this issue Oct 5, 2023

Optimize "ORDER BY + LIMIT" queries for speed / memory with special TopK operator #7721

Merged

alamb assigned Tangruilin Oct 6, 2023

alamb mentioned this issue Oct 6, 2023

Minor: improve documentation to stagger_batch #7754

Merged

Tangruilin mentioned this issue Oct 8, 2023

[test] add fuzz test for topk #7772

Merged

alamb closed this as completed in #7772 Oct 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TopK Fuzz Tests 🐝 #7749

TopK Fuzz Tests 🐝 #7749

alamb commented Oct 5, 2023

alamb commented Oct 5, 2023

Tangruilin commented Oct 6, 2023

Tangruilin commented Oct 6, 2023

Tangruilin commented Oct 6, 2023

alamb commented Oct 6, 2023 •

edited

Loading

Tangruilin commented Oct 8, 2023

Tangruilin commented Oct 8, 2023

alamb commented Oct 8, 2023

Tangruilin commented Oct 8, 2023

Tangruilin commented Oct 8, 2023

TopK Fuzz Tests 🐝 #7749

TopK Fuzz Tests 🐝 #7749

Comments

alamb commented Oct 5, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Oct 5, 2023

Tangruilin commented Oct 6, 2023

Tangruilin commented Oct 6, 2023

Tangruilin commented Oct 6, 2023

alamb commented Oct 6, 2023 • edited Loading

Tangruilin commented Oct 8, 2023

Tangruilin commented Oct 8, 2023

alamb commented Oct 8, 2023

Tangruilin commented Oct 8, 2023

Tangruilin commented Oct 8, 2023

alamb commented Oct 6, 2023 •

edited

Loading