Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: improve BOOLEAN writing logic and report error on encoding fail #443

Merged
merged 5 commits into from
Jun 16, 2021

Commits on Jun 10, 2021

  1. improve BOOLEAN writing logic and report error on encoding fail

    When writing BOOLEAN data, writing more than 2048 rows of data will
    overflow the hard-coded 256 buffer set for the bit-writer in the
    PlainEncoder. Once this occurs, further attempts to write to the encoder
    fail, becuase capacity is exceeded, but the errors are silently ignored.
    
    This fix improves the error detection and reporting at the point of
    encoding and modifies the logic for bit_writing (BOOLEANS). The
    bit_writer is initially allocated 256 bytes (as at present), then each
    time the capacity is exceeded the capacity is incremented by another
    256 bytes.
    
    This certainly resolves the current problem, but it's not exactly a
    great fix because the capacity of the bit_writer could now grow
    substantially.
    
    Other data types seem to have a more sophisticated mechanism for writing
    data which doesn't involve growing or having a fixed size buffer. It
    would be desirable to make the BOOLEAN type use this same mechanism if
    possible, but that level of change is more intrusive and probably
    requires greater knowledge of the implementation than I possess.
    
    resolves: apache#349
    garyanaplan committed Jun 10, 2021
    Configuration menu
    Copy the full SHA
    db8af9b View commit details
    Browse the repository at this point in the history
  2. only manipulate the bit_writer for BOOLEAN data

    Tacky, but I can't think of better way to do this without
    specialization.
    garyanaplan committed Jun 10, 2021
    Configuration menu
    Copy the full SHA
    a4ec2d7 View commit details
    Browse the repository at this point in the history
  3. better isolation of changes

    Remove the byte tracking from the PlainEncoder and use the existing
    bytes_written() method in BitWriter.
    
    This is neater.
    garyanaplan committed Jun 10, 2021
    Configuration menu
    Copy the full SHA
    eb11b2b View commit details
    Browse the repository at this point in the history

Commits on Jun 14, 2021

  1. add test for boolean writer

    The test ensures that we can write > 2048 rows to a parquet file and
    that when we read the data back, it finishes without hanging (defined as
    taking < 5 seconds).
    
    If we don't want that extra complexity, we could remove the
    thread/channel stuff and just try to read the file and let the test
    runner terminate hanging tests.
    garyanaplan committed Jun 14, 2021
    Configuration menu
    Copy the full SHA
    06d9a33 View commit details
    Browse the repository at this point in the history

Commits on Jun 15, 2021

  1. fix capacity calculation error in bool encoding

    The values.len() reports the number of values to be encoded and so must
    be divided by 8 (bits in a bytes) to determine the effect on the byte
    capacity of the bit_writer.
    garyanaplan committed Jun 15, 2021
    Configuration menu
    Copy the full SHA
    da8c665 View commit details
    Browse the repository at this point in the history