-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TopK Fuzz Tests 🐝 #7749
Comments
I think this would be a nice beginner project for someone who was familiar with Rust, but not familiar with DataFusion / working with BTW you can run the fuzz tests with: cargo test --test fuzz |
/assign me |
I have looked this, it seems that make_staggered_batches can only get i64 cols, but the tests seems request cols with (integers, strings, floats). Maybe I need to wirte a fn make_staggered_batches myself? |
Thanks @Tangruilin
Sorry, I think I meant use |
when i get the result with (string), (float).etc cols. I found that batches_to_vec can only work for i32. May be I need a another way to convert record batches to vec, are there any suggestions? |
@alamb maybe i can add a enum to support other type? If it is ok, i will do it. |
I would actually recommend doing it the otherway around:
Does that make sense? |
For 2 > The expected answer will be three types. I consider to use Generics to solve it, but different type need to use different arrow Array, so it seems can not solve. |
I have a idea. I will try it |
Is your feature request related to a problem or challenge?
After #7721 a
SortExec
with a limit will use a specialTopK
. We have basic unit tests, but I think the coverage could be improved, specifically with Fuzz testingDescribe the solution you'd like
What I would like is a new fuzz test to be added to the the existing fuzz cases: https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/fuzz_cases
The structure of
SortTest
in https://github.com/apache/arrow-datafusion/blob/e95a24b5a260e0e2f603d52682d36cce192676f8/datafusion/core/tests/fuzz_cases/sort_fuzz.rs#L111 might be a good one to followThe basic outline would be:
make_staggered_batches
SELECT * FROM t ORDER BY <col(s)> LIMIT <N>
and collect the outputInput size: 1000 rows
Parameters to vary
Bonus points
make it easy to add new columns / types (e.g. like string dictionary)
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: