Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-758: Add files with Float16 column #40

Merged
merged 3 commits into from
Nov 9, 2023

Conversation

benibus
Copy link
Contributor

@benibus benibus commented Oct 11, 2023

These files are dependent on the Float16 type proposal's acceptance: apache/parquet-format#184

They should be useful for testing several cases across Parquet implementations:

  • Basic binary representations of standard values, +/- zeros, and NaN
  • Comparisons between finite values
  • Exclusion of NaNs from statistics min/max
  • Normalizing min/max values when only zeros are present

Generated with:

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

t1 = pa.Table.from_arrays(
    [pa.array([None,
               np.float16(0.0),
               np.float16(np.NaN)], type=pa.float16())],
    names="x")
t2 = pa.Table.from_arrays(
    [pa.array([None,
               np.float16(1.0),
               np.float16(-2.0),
               np.float16(np.NaN),
               np.float16(0.0),
               np.float16(-1.0),
               np.float16(-0.0),
               np.float16(2.0)],
              type=pa.float16())],
    names="x")

pq.write_table(t1, "float16_zeros_and_nans.parquet")
pq.write_table(t2, "float16_nonzeros_and_nans.parquet")

m1 = pq.read_metadata("float16_zeros_and_nans.parquet")
m2 = pq.read_metadata("float16_nonzeros_and_nans.parquet")

print(m1.row_group(0).column(0))
print(m2.row_group(0).column(0))

Output:

<pyarrow._parquet.ColumnChunkMetaData object at 0x7f24d48c4d60>
  file_offset: 72
  file_path: 
  physical_type: FIXED_LEN_BYTE_ARRAY
  num_values: 3
  path_in_schema: x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7f24d48c4ea0>
      has_min_max: True
      min: b'\x00\x80'
      max: b'\x00\x00'
      null_count: 1
      distinct_count: None
      num_values: 2
      physical_type: FIXED_LEN_BYTE_ARRAY
      logical_type: Float16
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 24
  total_compressed_size: 68
  total_uncompressed_size: 64
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f24d48c4d60>
  file_offset: 84
  file_path: 
  physical_type: FIXED_LEN_BYTE_ARRAY
  num_values: 8
  path_in_schema: x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7f24d48c4e50>
      has_min_max: True
      min: b'\x00\xc0'
      max: b'\x00@'
      null_count: 1
      distinct_count: None
      num_values: 7
      physical_type: FIXED_LEN_BYTE_ARRAY
      logical_type: Float16
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 34
  total_compressed_size: 80
  total_uncompressed_size: 76

@tustvold
Copy link

apache/parquet-format#184 has now been merged, is this waiting on anything further?

I'm mainly wondering what I should be doing with apache/arrow-rs#5003

data/README.md Outdated
Comment on lines 48 to 49
| float16_nonzeros_and_nans.parquet | Float16 (logical type) column with NaNs and nonzero finite min/max values |
| float16_zeros_and_nans.parquet | Float16 (logical type) column with NaNs and zeros as min/max values |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you perhaps show how the file was generated, or how the data looks like, in the same spirit as was done for "NaN in stats" in below?

@pitrou
Copy link
Member

pitrou commented Nov 9, 2023

@benibus Can you please update this and make it ready for review? It would be better to merge this soon, so that the file can be used for integration testing in implementation PRs.

@benibus benibus marked this pull request as ready for review November 9, 2023 15:29
@benibus
Copy link
Contributor Author

benibus commented Nov 9, 2023

@pitrou Extended the docs in the README and marked as ready to review.

I believe these files should be sufficient for our purposes - including apache/arrow-rs#5003 (sorry about the wait, @tustvold... that was my bad).

data/README.md Outdated Show resolved Hide resolved
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benibus !

@pitrou pitrou changed the title Add files with Float16 column PARQUET-758: Add files with Float16 column Nov 9, 2023
@pitrou pitrou merged commit 506afff into apache:master Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants