Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to enable statistics for string columns? #5270

Closed
jonashaag opened this issue Jan 2, 2024 · 11 comments
Closed

How to enable statistics for string columns? #5270

jonashaag opened this issue Jan 2, 2024 · 11 comments
Labels
bug parquet Changes to the parquet crate

Comments

@jonashaag
Copy link

Describe the bug

I'm using the https://github.com/pacman82/odbc2parquet library that is based on this crate.

I observe that statistics like min/max are not written for string columns:

In [4]: pq.ParquetFile("/tmp/o2p").metadata.row_group(0).column(1)
Out[4]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x1033c1080>
  file_offset: 1123
  file_path:
  physical_type: BYTE_ARRAY
  num_values: 100
  path_in_schema: XXX
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x103476070>
      has_min_max: False
      min: None
      max: None
      null_count: None
      distinct_count: None
      num_values: 100
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: ZSTD
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 394
  data_page_offset: 938
  total_compressed_size: 729
  total_uncompressed_size: 2993

Relevant code: https://github.com/pacman82/odbc2parquet/blob/b571cad6fae1b58e1aab8348f14b32f20d6ec165/src/query/parquet_writer.rs#L47

To Reproduce

Use odbc2parquet to download any table that contains a string column

Expected behavior

Should have min/max statistics.

Additional context

@tustvold
Copy link
Contributor

tustvold commented Jan 2, 2024

Perhaps you might provide a reproducer for this, or at the very least an example file exhibiting this property. I wonder if this might be a quirk of pyarrow...

The following test passes

#[test]
fn test_byte_array_statistics_full() {
    let schema = Arc::new(
        types::Type::group_type_builder("schema")
            .with_fields(vec![Arc::new(
                types::Type::primitive_type_builder("col1", Type::BYTE_ARRAY)
                    .with_converted_type(ConvertedType::UTF8)
                    .with_logical_type(Some(LogicalType::String))
                    .build()
                    .unwrap(),
            )])
            .build()
            .unwrap(),
    );
    let props = Arc::new(WriterProperties::builder().build());

    let mut buffer = Vec::with_capacity(1024);
    let mut writer = SerializedFileWriter::new(&mut buffer, schema, props).unwrap();

    let mut row_group = writer.next_row_group().unwrap();
    let mut column = row_group.next_column().unwrap().unwrap();
    column
        .typed::<ByteArrayType>()
        .write_batch(
            &[
                ByteArray::from(b"123".to_vec()),
                ByteArray::from(b"abc".to_vec()),
            ],
            None,
            None,
        )
        .unwrap();
    column.close().unwrap();
    row_group.close().unwrap();

    let metadata = writer.close().unwrap();
    let col = &metadata.row_groups[0].columns[0];
    let col_meta = col.meta_data.as_ref().unwrap();
    let col_stats = col_meta.statistics.as_ref().unwrap();
    assert_eq!(col_stats.min.as_ref().unwrap(), b"123");
    assert_eq!(col_stats.max.as_ref().unwrap(), b"abc");
}

@jonashaag
Copy link
Author

In [3]: pq.ParquetFile("/tmp/no_stats.parquet").metadata.row_group(0).column(0)
Out[3]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x105bf7100>
  ...
  statistics:
    <pyarrow._parquet.Statistics object at 0x125cf6200>
      has_min_max: False
      min: None
      max: None
      ...

In [5]: table = pq.read_table("/tmp/no_stats.parquet")

# Re-write table with PyArrow
In [6]: with pq.ParquetWriter("/tmp/with_stats.parquet", schema=table.schema) as w:
   ...:     w.write_table(table)
   ...:

In [7]: pq.ParquetFile("/tmp/with_stats.parquet").metadata.row_group(0).column(0)
Out[7]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x105660680>
  ...
  statistics:
    <pyarrow._parquet.Statistics object at 0x125cf5da0>
      has_min_max: True
      min: 01
      max: 01
      ...

no_stats.parquet.zip

@jonashaag
Copy link
Author

I'm pretty sure it's not a quirk of PyArrow. I realized that filtering with Polars/DuckDB on string columns on Parquet files created with the odbc2parquet tool is much slower than on files created with PyArrow.

@tustvold
Copy link
Contributor

tustvold commented Jan 2, 2024

Ok so the statistics are present in the file

#[test]
    fn test_example() {
        let file = File::open("/home/raphael/Downloads/no_stats.parquet").unwrap();
        let reader = SerializedFileReader::new(file).unwrap();
        let metadata = reader.metadata().row_group(0).column(0);
        let col_stats = metadata.statistics().unwrap();
        println!("{col_stats:?}");
    }
ByteArray({min: Some(ByteArray { data: "01" }), max: Some(ByteArray { data: "01" }), distinct_count: None, null_count: 0, min_max_deprecated: false, min_max_backwards_compatible: false, max_value_exact: false, min_value_exact: false})

So the question is now why pyarrow is unhappy with those statistics, I vaguely remember some bug/limitation in pyarrow related to this - let me see if I can dig it out.

is much slower than on files created with PyArrow

This could be for a very large number of reasons

Edit: I can't find the exact issue I am looking for but these are all related

Perhaps @mapleFU you might have the relevant information to hand?

@pacman82
Copy link

pacman82 commented Jan 2, 2024

Hello @tustvold ,

I do not know if this helps to clear things up, or causes more confusion, but here is my attempt to reproduce the issue directly with odbc2parquet.

See: https://github.com/pacman82/odbc2parquet/blob/42295d22b4ac442acdb70fea9c18752c830de6e1/tests/integration.rs#L3973

#[test]
fn write_statistics_for_text_columns() {
    // Setup table for test
    let table_name = "WriteStatisticsForTextColumns";
    let mut table = TableMssql::new(table_name, &["VARCHAR(10)"]);
    table.insert_rows_as_text(&[["aaa"], ["zzz"]]);
    let query = format!("SELECT a FROM {table_name}");

    let command = Command::cargo_bin("odbc2parquet")
        .unwrap()
        .args([
            "query",
            "--connection-string",
            MSSQL,
            "-", // Use `-` to explicitly write to stdout
            &query,
        ])
        .assert()
        .success();

    // Then
    let bytes = Bytes::from(command.get_output().stdout.clone());
    let reader = SerializedFileReader::new(bytes).unwrap();
    let stats = reader.metadata().row_group(0).column(0).statistics().unwrap();
    assert_eq!("aaa", str::from_utf8(stats.min_bytes()).unwrap());
    assert_eq!("zzz", str::from_utf8(stats.max_bytes()).unwrap());
}

The above code executes and the tests passes. This hints that at reading the file with the Rust parquet crate the statistics are present. Yet they seem to be written differently to what the Python stack seems to expect.

Best, Markus

@tustvold
Copy link
Contributor

tustvold commented Jan 2, 2024

Yet they seem to be written differently to what the Python stack seems to expect.

Yes, I think we need the python/arrow-cpp developers to weigh in here as to what is going on

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 2, 2024

I think this is actually due to #5158 only having recently been merged in.

Some reproduction code:

use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use parquet::arrow::ArrowWriter;
use parquet::errors::Result;
use parquet::file::reader::FileReader;
use parquet::file::serialized_reader::SerializedFileReader;
use std::fs::File;

fn main() -> Result<()> {
    println!("checking file from issue");
    let no_stats_path = "/home/jeffrey/Downloads/no_stats.parquet";
    let file = File::open(no_stats_path)?;
    let reader = SerializedFileReader::new(file)?;
    dbg!(reader.metadata().file_metadata().column_order(0));

    println!("checking file after rewritten by pyarrow");
    let with_stats_path = "/home/jeffrey/Downloads/with_stats.parquet";
    let file = File::open(with_stats_path)?;
    let reader = SerializedFileReader::new(file)?;
    dbg!(reader.metadata().file_metadata().column_order(0));

    println!("rewriting file from issue with latest parquet-rs");
    let file = File::open(no_stats_path)?;
    let mut reader = ParquetRecordBatchReaderBuilder::try_new(file)?.build()?;
    let batch = reader.next().unwrap()?;

    let new_with_stats_path = "/home/jeffrey/Downloads/new_with_stats.parquet";
    let file = File::create(new_with_stats_path)?;
    let mut writer = ArrowWriter::try_new(file, batch.schema(), None)?;
    writer.write(&batch)?;
    writer.close()?;

    println!("checking file after rewritten by latest parquet-rs");
    let file = File::open(new_with_stats_path)?;
    let reader = SerializedFileReader::new(file)?;
    dbg!(reader.metadata().file_metadata().column_order(0));

    Ok(())
}

And the output:

checking file from issue
[parquet/./examples/read_parquet.rs:30] reader.metadata().file_metadata().column_order(0) = UNDEFINED
checking file after rewritten by pyarrow
[parquet/./examples/read_parquet.rs:36] reader.metadata().file_metadata().column_order(0) = TYPE_DEFINED_ORDER(
    UNSIGNED,
)
rewriting file from issue with latest parquet-rs
checking file after rewritten by latest parquet-rs
[parquet/./examples/read_parquet.rs:52] reader.metadata().file_metadata().column_order(0) = TYPE_DEFINED_ORDER(
    UNSIGNED,
)

We can see that original no_stats.parquet (from here #5270 (comment)) has an undefined column order. After following steps to rewrite that file using pyarrow, can see it is now populated.

Now I run with latest master of parquet-rs, rewriting that original file as was done for pyarrow, and can see now the column order is defined.

When I check this new parquet file written by latest parquet-rs master branch on pyarrow, statistics are coming through now:

>>> pq.ParquetFile("/home/jeffrey/Downloads/new_with_stats.parquet").metadata.row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f2bf82d92b0>
  file_offset: 61
  file_path:
  physical_type: BYTE_ARRAY
  num_values: 1
  path_in_schema: x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7f2bf82d94e0>
      has_min_max: True
      min: 01
      max: 01
      null_count: None
      distinct_count: None
      num_values: 1
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: UNCOMPRESSED
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 24
  total_compressed_size: 57
  total_uncompressed_size: 57
>>>

Could you test again from the arrow-rs master branch, to see if this resolves the issue?

This fix should come in arrow-rs release 50.0.0 (not included in 49.0.0 which is the current latest on crates.io), see tracking for the release: #5234

@tustvold
Copy link
Contributor

I believe this is now fixed and released in arrow 50.0.0. Please feel free to reopen if I am mistaken

@pacman82
Copy link

Hello @tustvold ,

thanks for the information. I'll try to see if the behavior has now vanished in odbc2parquet.

@pacman82
Copy link

Yep, I can now also read statistics in Python, with the files written by odbc2parquet.

@pacman82
Copy link

Thanks!

@tustvold tustvold added the parquet Changes to the parquet crate label Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

4 participants