-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2362: Clarify parquet encoding #217
Conversation
Would you mind first create an issue like: https://issues.apache.org/jira/browse/PARQUET-2299 or use MINOR? |
This change looks good to me.
Does this means value in data page is same as position in dictionary page? 🤔 Also cc @wgtmac @gszadovszky |
I will create a related issue once my JIRA account request is approved.
I think so. The data page contains dictionary code (i.e. offset in dictionary page ) |
Encodings.md
Outdated
@@ -32,7 +32,7 @@ intended to be the simplest encoding. Values are encoded back to back. | |||
|
|||
The plain encoding is used whenever a more efficient encoding can not be used. It | |||
stores the data in the following format: | |||
- BOOLEAN: [Bit Packed](#BITPACKED), LSB first | |||
- BOOLEAN: [Bit Packed](#BITPACKED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe LSB first is not descriptive enough but at least we shall not remove it. What it wanted to say that for BOOLEAN
s it stores the bits from the least significant bit to the most significant one, instead of what is specified for BIT_PACKED
. It practically means that the BOOLEAN
bits are just written from left to right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, LSB first is pretty understandable and leaving it out makes the spec ambiguous.
Maybe it's only how I read it, but for me, LSB first pretty clearly conveys that "the first value will be stored in the least significant bit", so If I have the byte 0b00000001
, this byte encodes 8 values, with the first one being 1 and all others being 0.
Maybe it helps spelling out ot more clearly, e.g.,
LSB first, i.e., the first value is stored in the least significant bit.
but for someone writing bit fiddling logic - which someone writing an encoder/decoder has to do - LSB first
should be descriptive enough on its own IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL. Thanks a lot!
> Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding. The dictionary entries are not sorted (or at least not always sorted). > There is no padding between values (except for the last byte) which is padded with 0s. Minor change. Signed-off-by: Letian Jiang <letian.jiang@outlook.com>
04e78b8
to
233a864
Compare
Made some updates. Please take another look. @mapleFU @gszadovszky @JFinis |
> Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding. The dictionary entries are not sorted (or at least not always sorted). > There is no padding between values (except for the last byte) which is padded with 0s. Minor change. Signed-off-by: Letian Jiang <letian.jiang@outlook.com>
233a864
to
ce99bf3
Compare
The dictionary entries are not sorted (or at least not always sorted).
Minor change.
Jira
Commits