Closed
Description
Which part is this question about
The parquet file writer usage
Describe your question
Hi, I'm looking if parqet and arrow could fit a usecase of mine, but I've ran into a strange issue, for which I can find now answer in the documentation. I have two input files in txt format, where each record spans 4 lines. I have a parser that reads that just fine, and want to convert that format to a parquet file. The two input files are combined around 600MB, but when I write these to a parquet file, the resulting file is nearly 5GB, it also consumes around 6/7GB memory while writing the files. I have turned on compression.
let message_type = "
message Schema {
REQUIRED BINARY id (UTF8);
REQUIRED BINARY header (UTF8);
REQUIRED BINARY sequence (UTF8);
REQUIRED BINARY quality (UTF8);
}
";
let schema = Arc::new(parse_message_type(message_type).unwrap());
let props = Arc::new(
WriterProperties::builder()
.set_compression(Compression::SNAPPY)
.build(),
);
My rust configuration for the writer.