What happened?
I developed a data lake on AWS Athena using flow as a library to create parquet files correctly, but despite having correctly created the file athena reader is not able to correctly read the file.
I test locally with pyarrow.parquet and it can be read by python library, so the parquet file seemed correct, but inspecting it I found that the columns null metadata are not correctly populated.
In how to reproduce section I've added the php code to generate a simple file and a python code to check the correctness of metadata.
I already solved it locally. I'm reporting as asked to be allowd to open the pull request as discussed in guidelines.
How to reproduce?
Php code creating a simple file:
<?php
require __DIR__ . '/vendor/autoload.php';
use Flow\Parquet\Writer;
use Flow\Parquet\ParquetFile\Schema;
use Flow\Parquet\ParquetFile\Schema\FlatColumn;
$schema = Schema::with(
FlatColumn::string('col_all_null'),
FlatColumn::string('col_all_string'),
FlatColumn::string('col_mixed'),
);
$rows = [
['col_all_null' => null, 'col_all_string' => 'a', 'col_mixed' => 'x'],
['col_all_null' => null, 'col_all_string' => 'b', 'col_mixed' => null],
['col_all_null' => null, 'col_all_string' => 'c', 'col_mixed' => 'z'],
];
$path = __DIR__ . '/flow_minimal.parquet';
if (file_exists($path)) unlink($path);
(new Writer())->write($path, $schema, $rows);
echo "written: $path\n";he issue (no backticks needed).
Python code to read the parquet file and verify metadata:
"""Verify null_count metadata of a flow-php parquet file."""
import sys
import pyarrow.parquet as pq
PATH = sys.argv[1] if len(sys.argv) > 1 else "flow_minimal.parquet"
EXPECTED = {
"col_all_null": {"null_count": 3, "min": None, "max": None},
"col_all_string": {"null_count": 0, "min": "a", "max": "c"},
"col_mixed": {"null_count": 1, "min": "x", "max": "z"},
}
f = pq.ParquetFile(PATH)
ok = True
for rg_i in range(f.metadata.num_row_groups):
rg = f.metadata.row_group(rg_i)
for c_i in range(rg.num_columns):
col = rg.column(c_i)
path = col.path_in_schema
st = col.statistics
exp = EXPECTED.get(path)
actual = {
"null_count": st.null_count,
"min": st.min,
"max": st.max,
}
passed = exp == actual
ok &= passed
status = "OK" if passed else "FAIL"
print(f"[{status}] {path}: expected={exp} actual={actual} has_null_count={st.has_null_count}")
sys.exit(0 if ok else 1)
Playground snippet
No response
Data required to reproduce bug locally
With the php code you create a flow_minimal.parquet file and then using the python code in an environment with pyarrow you can execute the python script and check its output:
python3 check_parquet_nulls.py flow_minimal.parquet
Version
0.35
Relevant error output
After running `python3 check_parquet_nulls.py flow_minimal.parquet` you get the errors shown.
What happened?
I developed a data lake on AWS Athena using flow as a library to create parquet files correctly, but despite having correctly created the file athena reader is not able to correctly read the file.
I test locally with pyarrow.parquet and it can be read by python library, so the parquet file seemed correct, but inspecting it I found that the columns null metadata are not correctly populated.
In how to reproduce section I've added the php code to generate a simple file and a python code to check the correctness of metadata.
I already solved it locally. I'm reporting as asked to be allowd to open the pull request as discussed in guidelines.
How to reproduce?
Playground snippet
No response
Data required to reproduce bug locally
With the php code you create a
flow_minimal.parquetfile and then using the python code in an environment with pyarrow you can execute the python script and check its output:python3 check_parquet_nulls.py flow_minimal.parquetVersion
0.35
Relevant error output
After running `python3 check_parquet_nulls.py flow_minimal.parquet` you get the errors shown.