You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An underflow panic can sometimes be triggered by setting max_row_group_size to less than execution.batch_size.
thread 'tokio-runtime-worker' panicked at datafusion/core/src/datasource/file_format/parquet.rs:886:33:
attempt to subtract with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
To Reproduce
I found this by trying a wide range of max_row_group_sizes
#[tokio::main]asyncfnmain() -> Result<()>{// create local execution contextlet ctx = SessionContext::new();let testdata = "benchmarks/data/tpch_sf10/lineitem";let filename = &format!("{testdata}/part-0.parquet");// define the query using the DataFrame traitlet df = ctx
.read_parquet(filename,ParquetReadOptions::default()).await?
.limit(0,Some(200_000))?;println!("{}", df.clone().count().await?);for row_group_size in(1..8193).step_by(283).rev(){println!("row group size: {}", row_group_size);println!("Writing without parallelism!");let row_group_path = format!("/tmp/{}.parquet", row_group_size);letmut options = TableParquetOptions::default();
options.global.max_row_group_size = row_group_size;
options.global.allow_single_file_parallelism = false;
df.clone().write_parquet(&row_group_path,DataFrameWriteOptions::new().with_single_file_output(true),Some(options),).await.unwrap();println!("Writing with parallelism!");let row_group_path = format!("/tmp/para_{}.parquet", row_group_size);letmut options = TableParquetOptions::default();
options.global.max_row_group_size = row_group_size;
options.global.allow_single_file_parallelism = true;
df.clone().write_parquet(&row_group_path,DataFrameWriteOptions::new().with_single_file_output(true),Some(options),).await.unwrap();}Ok(())}
Expected behavior
No combination of max_row_group_size and execution.batch_size should lead to panic
Additional context
Extremely tiny max_row_group sizes can cause a stack overflow error even if parallel_parquet writer is disabled. E.g. a max row group size of 1. We may want to raise a configuration validation error for absurdly small row group sizes.
This happens even if release mode:
row group size: 1
Writing without parallelism!
thread 'tokio-runtime-worker' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)
The text was updated successfully, but these errors were encountered:
@matthewmturner the panic message is different so I'm not sure... are you changing any other configs? If you are able to post a minimal reproducer of your panic I can compare if they are the same.
I will edit the description here to show the panic message I see.
@devinjdangelo i am going to work on one, but i am moving this weekend so i might not have time in the next few days. Will try to get one by early next week.
@matthewmturner Ok sounds good. I will fix the logic error causing the panic in this issue, and it is possible that it will fix your panic if the same logic error is triggering both.
Describe the bug
An underflow panic can sometimes be triggered by setting max_row_group_size to less than execution.batch_size.
To Reproduce
I found this by trying a wide range of max_row_group_sizes
Expected behavior
No combination of max_row_group_size and execution.batch_size should lead to panic
Additional context
Extremely tiny max_row_group sizes can cause a stack overflow error even if parallel_parquet writer is disabled. E.g. a max row group size of 1. We may want to raise a configuration validation error for absurdly small row group sizes.
This happens even if release mode:
The text was updated successfully, but these errors were encountered: