Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Written Parquet file way bigger than input files #1627

Closed
stevenliebregt opened this issue Apr 28, 2022 · 3 comments
Closed

Written Parquet file way bigger than input files #1627

stevenliebregt opened this issue Apr 28, 2022 · 3 comments
Labels
question Further information is requested

Comments

@stevenliebregt
Copy link

Which part is this question about
The parquet file writer usage

Describe your question
Hi, I'm looking if parqet and arrow could fit a usecase of mine, but I've ran into a strange issue, for which I can find now answer in the documentation. I have two input files in txt format, where each record spans 4 lines. I have a parser that reads that just fine, and want to convert that format to a parquet file. The two input files are combined around 600MB, but when I write these to a parquet file, the resulting file is nearly 5GB, it also consumes around 6/7GB memory while writing the files. I have turned on compression.

let message_type = "
    message Schema {
        REQUIRED BINARY id (UTF8);
        REQUIRED BINARY header (UTF8);
        REQUIRED BINARY sequence (UTF8);
        REQUIRED BINARY quality (UTF8);
    }
";

let schema = Arc::new(parse_message_type(message_type).unwrap());
let props = Arc::new(
    WriterProperties::builder()
        .set_compression(Compression::SNAPPY)
        .build(),
);

My rust configuration for the writer.

@stevenliebregt stevenliebregt added the question Further information is requested label Apr 28, 2022
@tustvold
Copy link
Contributor

Some ideas to try:

  • Disable dictionary compression for columns that don't have repeated values
  • Use writer version 2, which has better string encoding
  • Represent the id / sequence as an integral type instead of a variable length string
  • Try without snappy, as compression may not always yield benefits
  • Maybe try writing the data using something like pyarrow to determine if this is something specific to the Rust implementation

Without the data it is hard to say for sure what is going on, but ignoring compression parquet will have at least a 4 byte overhead per string, and so in the case of lots of small strings...

@stevenliebregt
Copy link
Author

Thanks for the answer, I'll give those ideas a try, if I find it's a problem specific to Rust I'll create an issue.

@Dandandan
Copy link
Contributor

Also, try zstd, which often gives quite a bit better compression than snappy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants