Skip to content

Written Parquet file way bigger than input files  #1627

Closed
@stevenliebregt

Description

@stevenliebregt

Which part is this question about
The parquet file writer usage

Describe your question
Hi, I'm looking if parqet and arrow could fit a usecase of mine, but I've ran into a strange issue, for which I can find now answer in the documentation. I have two input files in txt format, where each record spans 4 lines. I have a parser that reads that just fine, and want to convert that format to a parquet file. The two input files are combined around 600MB, but when I write these to a parquet file, the resulting file is nearly 5GB, it also consumes around 6/7GB memory while writing the files. I have turned on compression.

let message_type = "
    message Schema {
        REQUIRED BINARY id (UTF8);
        REQUIRED BINARY header (UTF8);
        REQUIRED BINARY sequence (UTF8);
        REQUIRED BINARY quality (UTF8);
    }
";

let schema = Arc::new(parse_message_type(message_type).unwrap());
let props = Arc::new(
    WriterProperties::builder()
        .set_compression(Compression::SNAPPY)
        .build(),
);

My rust configuration for the writer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions