Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] add fuzz test for topk #7772

Merged
merged 2 commits into from
Oct 21, 2023
Merged

Conversation

Tangruilin
Copy link
Contributor

Which issue does this PR close?

Closes #7749 .

Rationale for this change

Add fuzz test for topk

What changes are included in this PR?

  1. This PR add i32 type fuzz test for topk;
  2. This PR add fn to get f64 and string batched but now not use(for the fn batches_to_vec can only work for i32)

TODO: This PR is not completed. I need some suggestions that if i need to modify fn batches_to_vec

Are these changes tested?

this PR is a test

Are there any user-facing changes?

no

@github-actions github-actions bot added the core Core DataFusion crate label Oct 8, 2023
@Tangruilin Tangruilin force-pushed the test#7749#add_fuzz_test branch 4 times, most recently from f63c3fc to 4144b44 Compare October 9, 2023 16:10
@Tangruilin
Copy link
Contributor Author

@alamb Please review it.

@Tangruilin
Copy link
Contributor Author

@alamb Please review it.

Thanks

@alamb
Copy link
Contributor

alamb commented Oct 9, 2023

Thank you @Tangruilin -- I plan to review this tomorrow. I look forward to it!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Tangruilin

TODO: This PR is not completed. I need some suggestions that if i need to modify fn batches_to_vec

I am not quite sure what you mean here. I think if you can make a TopKScenario type structure, it maybe easier to figure out if you need to modify batches_to_vec

@@ -138,6 +162,44 @@ impl SortTest {
self
}

async fn run_with_params(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps naming this run_with_limit would make it easier to understand what is happening

}

#[tokio::test]
async fn test_sort_topk_i32() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about encapsulating the limit data and expected value calculation in a structure?

So this test might look like

let size: usize = ...; // pick a random size
let scenario = TopKScenario::new()
   // tell the scenario to sort by one column
   .with_sort_column(["i32_column"])
   // specify a limit of 10 rows
    .with_limit(10);

// stagger the batches in the scenario
scenario.stagger()

let collected = SortTest::new()
  // call Scenario::batches to get the input batches
  .with_input(scenario.batches());
// run the test
  .run_with_limit("t", scenario.sort_cols(), scenario.limit()).await;

// The scenario handles calculting expected output (as it knows the sort column and limit)
let expected = scenario.expected()
let actual =  batches_to_vec(&collected);
assert_eq!(actual, &expected); 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good suggestion! I'll do it tomorrow and push it

@alamb alamb marked this pull request as draft October 10, 2023 16:02
@alamb
Copy link
Contributor

alamb commented Oct 10, 2023

Mark as draft to signify this PR is not waiting on feedback anymore

Copy link
Contributor

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Tangruilin

use std::sync::Arc;
use test_utils::{batches_to_vec, partitions_to_sorted_vec};
use test_utils::{batches_to_vec, partitions_to_sorted_vec, stagger_batch};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const KB: u64 = 1 << 10;

for (batch_size, should_spill) in [(5, false), (20000, true), (1000000, true)] {
SortTest::new()
.with_int32_batches(batch_size)
.with_pool_size(10240)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.with_pool_size(10240)
.with_pool_size(10 * KB)

for (batch_size, should_spill) in [(5, false), (20000, false), (1000000, true)] {
SortTest::new()
.with_int32_batches(batch_size)
.with_pool_size(102400)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.with_pool_size(102400)
.with_pool_size(100 * KB)

@Tangruilin
Copy link
Contributor Author

I run ci/scripts/rust_clippy.sh on my mac but get
image
but the ci is
image

the result is not some, I'm confused @alamb

@alamb
Copy link
Contributor

alamb commented Oct 10, 2023

I run ci/scripts/rust_clippy.sh on my mac but get

You may have to do:

  1. rustup update to get the latest rust version
  2. merge up from origin/main to get the latest DataFusion code

@Tangruilin
Copy link
Contributor Author

I run ci/scripts/rust_clippy.sh on my mac but get

You may have to do:

  1. rustup update to get the latest rust version
  2. merge up from origin/main to get the latest DataFusion code

except that. I found that the rust should be stable —— run rustup update && rust default stable then it is solved. Are there docs for this. If not, maybe I can add it.

@alamb
Copy link
Contributor

alamb commented Oct 13, 2023

except that. I found that the rust should be stable —— run rustup update && rust default stable then it is solved. Are there docs for this. If not, maybe I can add it.

I think it is implicit in the Testing setup: part:

rustup update stable DataFusion uses the latest stable release of rust

I agree this could be clearer in the documentation. Perhaps you can help here

@Tangruilin
Copy link
Contributor Author

I have tried some ways, but there is not a prefect solution. I will update the PR tonight

Signed-off-by: reilly <tang.ruilin@foxmail.com>
@Tangruilin Tangruilin marked this pull request as ready for review October 16, 2023 15:35
@Tangruilin Tangruilin requested a review from alamb October 16, 2023 15:38
@alamb
Copy link
Contributor

alamb commented Oct 16, 2023

Thanks @Tangruilin -- I plan to review this more carefully tomorrow

@Tangruilin
Copy link
Contributor Author

Thanks @Tangruilin -- I plan to review this more carefully tomorrow

Don't forget this~~~

@alamb
Copy link
Contributor

alamb commented Oct 19, 2023

I am sorry 😢 -- I have not forgotten, but I am backed up on reviews:

Screenshot 2023-10-19 at 6 14 21 AM

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @Tangruilin and @Weijun-H -- this very nicely done. I am sorry for my delay in reviewing. I think this PR adds additional coverage and is structured to make future improvements straghtforward.

🙏 great job

datafusion_expr::col(topk_scenario.col_name).sort(true, true)
])
.unwrap()
.limit(0, Some(topk_scenario.limit))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified the plan here has the expected TopK element:

| physical_plan                                              | GlobalLimitExec: skip=0, fetch=10                    |
|                                                            |   SortExec: TopK(fetch=10), expr=[x@0 ASC]           |
|                                                            |     MemoryExec: partitions=1, partition_sizes=[38]   |

@alamb alamb merged commit 5bdc9af into apache:main Oct 21, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 21, 2023

BTW I am working on an extesion of this test to support multiple columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TopK Fuzz Tests 🐝
3 participants