PARQUET-758: Add files with Float16 column #40

benibus · 2023-10-11T21:18:55Z

These files are dependent on the Float16 type proposal's acceptance: apache/parquet-format#184

They should be useful for testing several cases across Parquet implementations:

Basic binary representations of standard values, +/- zeros, and NaN
Comparisons between finite values
Exclusion of NaNs from statistics min/max
Normalizing min/max values when only zeros are present

Generated with:

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

t1 = pa.Table.from_arrays(
    [pa.array([None,
               np.float16(0.0),
               np.float16(np.NaN)], type=pa.float16())],
    names="x")
t2 = pa.Table.from_arrays(
    [pa.array([None,
               np.float16(1.0),
               np.float16(-2.0),
               np.float16(np.NaN),
               np.float16(0.0),
               np.float16(-1.0),
               np.float16(-0.0),
               np.float16(2.0)],
              type=pa.float16())],
    names="x")

pq.write_table(t1, "float16_zeros_and_nans.parquet")
pq.write_table(t2, "float16_nonzeros_and_nans.parquet")

m1 = pq.read_metadata("float16_zeros_and_nans.parquet")
m2 = pq.read_metadata("float16_nonzeros_and_nans.parquet")

print(m1.row_group(0).column(0))
print(m2.row_group(0).column(0))

Output:

<pyarrow._parquet.ColumnChunkMetaData object at 0x7f24d48c4d60>
  file_offset: 72
  file_path: 
  physical_type: FIXED_LEN_BYTE_ARRAY
  num_values: 3
  path_in_schema: x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7f24d48c4ea0>
      has_min_max: True
      min: b'\x00\x80'
      max: b'\x00\x00'
      null_count: 1
      distinct_count: None
      num_values: 2
      physical_type: FIXED_LEN_BYTE_ARRAY
      logical_type: Float16
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 24
  total_compressed_size: 68
  total_uncompressed_size: 64
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f24d48c4d60>
  file_offset: 84
  file_path: 
  physical_type: FIXED_LEN_BYTE_ARRAY
  num_values: 8
  path_in_schema: x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7f24d48c4e50>
      has_min_max: True
      min: b'\x00\xc0'
      max: b'\x00@'
      null_count: 1
      distinct_count: None
      num_values: 7
      physical_type: FIXED_LEN_BYTE_ARRAY
      logical_type: Float16
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 34
  total_compressed_size: 80
  total_uncompressed_size: 76

tustvold · 2023-10-31T20:13:45Z

apache/parquet-format#184 has now been merged, is this waiting on anything further?

I'm mainly wondering what I should be doing with apache/arrow-rs#5003

pitrou · 2023-11-09T14:27:03Z

data/README.md

+| float16_nonzeros_and_nans.parquet | Float16 (logical type) column with NaNs and nonzero finite min/max values |
+| float16_zeros_and_nans.parquet    | Float16 (logical type) column with NaNs and zeros as min/max values |


Can you perhaps show how the file was generated, or how the data looks like, in the same spirit as was done for "NaN in stats" in below?

pitrou · 2023-11-09T14:27:45Z

@benibus Can you please update this and make it ready for review? It would be better to merge this soon, so that the file can be used for integration testing in implementation PRs.

benibus · 2023-11-09T15:36:53Z

@pitrou Extended the docs in the README and marked as ready to review.

I believe these files should be sufficient for our purposes - including apache/arrow-rs#5003 (sorry about the wait, @tustvold... that was my bad).

data/README.md

pitrou

Thanks @benibus !

Add generated Float16 files

307a8bf

Jefffrey mentioned this pull request Oct 30, 2023

Parquet: read/write f16 for Arrow apache/arrow-rs#5003

Merged

pitrou reviewed Nov 9, 2023

View reviewed changes

Add detailed description to docs

486e264

benibus marked this pull request as ready for review November 9, 2023 15:29

pitrou reviewed Nov 9, 2023

View reviewed changes

data/README.md Outdated Show resolved Hide resolved

Try adding internal link

1496206

pitrou approved these changes Nov 9, 2023

View reviewed changes

pitrou changed the title ~~Add files with Float16 column~~ PARQUET-758: Add files with Float16 column Nov 9, 2023

pitrou merged commit 506afff into apache:master Nov 9, 2023

benibus mentioned this pull request Nov 16, 2023

[C++][Go][Parquet] Utilize new parquet-testing files in Float16 tests apache/arrow#38751

Closed

asfimport mentioned this pull request Jun 23, 2024

[Format] HALF precision FLOAT Logical type apache/parquet-format#317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-758: Add files with Float16 column #40

PARQUET-758: Add files with Float16 column #40

benibus commented Oct 11, 2023

tustvold commented Oct 31, 2023

pitrou Nov 9, 2023

pitrou commented Nov 9, 2023

benibus commented Nov 9, 2023

pitrou left a comment

		\| float16_nonzeros_and_nans.parquet \| Float16 (logical type) column with NaNs and nonzero finite min/max values \|
		\| float16_zeros_and_nans.parquet \| Float16 (logical type) column with NaNs and zeros as min/max values \|

PARQUET-758: Add files with Float16 column #40

PARQUET-758: Add files with Float16 column #40

Conversation

benibus commented Oct 11, 2023

tustvold commented Oct 31, 2023

pitrou Nov 9, 2023

Choose a reason for hiding this comment

pitrou commented Nov 9, 2023

benibus commented Nov 9, 2023

pitrou left a comment

Choose a reason for hiding this comment