Getting started with Parquet development
If you are a complete novice to Parquet we would recommend starting with these documents:
- The striping and assembly algorithms from the Dremel paper (what Parquet is based on)
- To better understand Parquet, especially what repetition and definition levels are - Dremel made simple with Parquet
Encodings and types
If you are looking for a description of parquet encodings please follow this link.
To understand how Parquet represents rich logical types read this
There are already working implementations in other languages we find useful to check we are doing things right or when stuck understanding how a particular feature is supposed to work.
parquet-mr is an official specification repository containing Thrift definitions for data structures within the Parquet file. This spec is referenced by any library that implements Parquet.
parquet-mr is an official Java implementation, somewhat over-engineered, however the most stable.
fastparquet is probably the best implementation for Python, and it is extremely easy to follow. This is also our library of choice to work with the parquet format (of course, before parquet-dotnet was created :) )
parquet-cpp is an awful implementation using the C++ language, struggling both with code quality and compatibility. I would not recommend looking at it if you are new to parquet.
3rd Party Libraries
Snappy Sharp is used to compress and decompress via Snappy Algorithm