Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaNs can break parquet statistics #255

Closed
crepererum opened this issue May 4, 2021 · 0 comments · Fixed by #256
Closed

NaNs can break parquet statistics #255

crepererum opened this issue May 4, 2021 · 0 comments · Fixed by #256
Labels
bug parquet Changes to the parquet crate

Comments

@crepererum
Copy link
Contributor

Describe the bug
NaN can occur in parquet statistics and override all other possible values. This is very similar to PARQUET-1225 which was filed for the C++ implementation.

To Reproduce
Add the following tests:

#[test]
fn test_float_statistics_nan_middle() {
    let stats = statistics_roundtrip::<FloatType>(&[1.0, f32::NAN, 2.0]);
    assert!(stats.has_min_max_set());
    if let Statistics::Float(stats) = stats {
        assert_eq!(stats.min(), &1.0);
        assert_eq!(stats.max(), &2.0);
    } else {
        panic!("expecting Statistics::Float");
    }
}

#[test]
fn test_float_statistics_nan_start() {
    let stats = statistics_roundtrip::<FloatType>(&[f32::NAN, 1.0, 2.0]);
    assert!(stats.has_min_max_set());
    if let Statistics::Float(stats) = stats {
        assert_eq!(stats.min(), &1.0);
        assert_eq!(stats.max(), &2.0);
    } else {
        panic!("expecting Statistics::Float");
    }
}

#[test]
fn test_float_statistics_nan_only() {
    let stats = statistics_roundtrip::<FloatType>(&[f32::NAN, f32::NAN]);
    assert!(!stats.has_min_max_set());
    assert!(matches!(stats, Statistics::Float(_)));
}

fn statistics_roundtrip<T: DataType>(values: &[<T as DataType>::T]) -> Statistics {
    let page_writer = get_test_page_writer();
    let props = Arc::new(WriterProperties::builder().build());
    let mut writer = get_test_column_writer::<T>(page_writer, 0, 0, props);
    writer.write_batch(values, None, None).unwrap();

    let (_bytes_written, _rows_written, metadata) = writer.close().unwrap();
    if let Some(stats) = metadata.statistics() {
        stats.clone()
    } else {
        panic!("metadata missing statistics");
    }
}

Note that while the tests are written for f32/float, this also applies to f64/double.

Expected behavior
NaNs should be ignored during stats calculation. If only NaNs are present then min and max value should be unset.

Additional context
Tested commit was 8f030db53d9eda901c82db9daf94339fc447d0db.

@crepererum crepererum added the bug label May 4, 2021
crepererum added a commit to crepererum/arrow-rs that referenced this issue May 4, 2021
nevi-me pushed a commit that referenced this issue May 5, 2021
@jorgecarleitao jorgecarleitao added arrow Changes to the arrow crate parquet Changes to the parquet crate and removed arrow Changes to the arrow crate labels May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants