Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow reading from stdin with schema inference #10

Merged
merged 2 commits into from
Apr 12, 2023

Commits on Mar 6, 2023

  1. feat(csv2parquet): allow reading from piped input

    Piped input does not support Seek out of the box
    Seek is required to infer the schema
    To work around this, we buffer the input iff input file
    does not support seek
    Only the number of lines actually used to infer the schema
    are buffered to allow reading of files larger than memory
    This works, because the arrow crate only seeks twice:
    1. To check whether seek is supported at the start
    2. To reset to the start of the file after schem inference
    
    The seekable buffer wrapper is only used when necessary
    
    There should be no performance penalty for currently supported
    use cases
    
    Use cases:
    ```sh
    cat test.csv | csv2parquet /dev/stdin test.parquet
    zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet
    ```
    
    Resolves domoritz#3
    
    feat: refactor SeekableReader into arrow-tools lib create
    
    Also refactor schema matching to make it less verbose by using map_err
    instead of match, see json2parquet for before/after
    corneliusroemer committed Mar 6, 2023
    Configuration menu
    Copy the full SHA
    b472f89 View commit details
    Browse the repository at this point in the history

Commits on Apr 7, 2023

  1. Configuration menu
    Copy the full SHA
    11e869c View commit details
    Browse the repository at this point in the history