Skip to content

[Bug]: Parquet null metadata non correctly evaluated #2349

@gcassfull

Description

@gcassfull

What happened?

I developed a data lake on AWS Athena using flow as a library to create parquet files correctly, but despite having correctly created the file athena reader is not able to correctly read the file.

I test locally with pyarrow.parquet and it can be read by python library, so the parquet file seemed correct, but inspecting it I found that the columns null metadata are not correctly populated.

In how to reproduce section I've added the php code to generate a simple file and a python code to check the correctness of metadata.

I already solved it locally. I'm reporting as asked to be allowd to open the pull request as discussed in guidelines.

How to reproduce?

Php code creating a simple file:

<?php
require __DIR__ . '/vendor/autoload.php';

use Flow\Parquet\Writer;
use Flow\Parquet\ParquetFile\Schema;
use Flow\Parquet\ParquetFile\Schema\FlatColumn;

$schema = Schema::with(
    FlatColumn::string('col_all_null'),
    FlatColumn::string('col_all_string'),
    FlatColumn::string('col_mixed'),
);

$rows = [
    ['col_all_null' => null, 'col_all_string' => 'a', 'col_mixed' => 'x'],
    ['col_all_null' => null, 'col_all_string' => 'b', 'col_mixed' => null],
    ['col_all_null' => null, 'col_all_string' => 'c', 'col_mixed' => 'z'],
];

$path = __DIR__ . '/flow_minimal.parquet';
if (file_exists($path)) unlink($path);

(new Writer())->write($path, $schema, $rows);
echo "written: $path\n";he issue (no backticks needed).



Python code to read the parquet file and verify metadata:


"""Verify null_count metadata of a flow-php parquet file."""
import sys
import pyarrow.parquet as pq

PATH = sys.argv[1] if len(sys.argv) > 1 else "flow_minimal.parquet"

EXPECTED = {
    "col_all_null":   {"null_count": 3, "min": None, "max": None},
    "col_all_string": {"null_count": 0, "min": "a",  "max": "c"},
    "col_mixed":      {"null_count": 1, "min": "x",  "max": "z"},
}

f = pq.ParquetFile(PATH)
ok = True

for rg_i in range(f.metadata.num_row_groups):
    rg = f.metadata.row_group(rg_i)
    for c_i in range(rg.num_columns):
        col = rg.column(c_i)
        path = col.path_in_schema
        st = col.statistics
        exp = EXPECTED.get(path)

        actual = {
            "null_count": st.null_count,
            "min": st.min,
            "max": st.max,
        }
        passed = exp == actual
        ok &= passed
        status = "OK" if passed else "FAIL"
        print(f"[{status}] {path}: expected={exp} actual={actual} has_null_count={st.has_null_count}")

sys.exit(0 if ok else 1)

Playground snippet

No response

Data required to reproduce bug locally

With the php code you create a flow_minimal.parquet file and then using the python code in an environment with pyarrow you can execute the python script and check its output:

python3 check_parquet_nulls.py flow_minimal.parquet

Version

0.35

Relevant error output

After running `python3 check_parquet_nulls.py flow_minimal.parquet` you get the errors shown.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Todo

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions