Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2362: Clarify parquet encoding #217

Merged
merged 1 commit into from
Oct 14, 2023

Conversation

letian-jiang
Copy link
Contributor

@letian-jiang letian-jiang commented Oct 3, 2023

Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding.

The dictionary entries are not sorted (or at least not always sorted).

There is no padding between values (except for the last byte) which is padded with 0s.

Minor change.

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@mapleFU
Copy link
Member

mapleFU commented Oct 3, 2023

Would you mind first create an issue like: https://issues.apache.org/jira/browse/PARQUET-2299 or use MINOR?

@mapleFU
Copy link
Member

mapleFU commented Oct 3, 2023

There is no padding between values (except for the last byte) which is padded with 0s.

This change looks good to me.

Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding.

Does this means value in data page is same as position in dictionary page? 🤔

Also cc @wgtmac @gszadovszky

@letian-jiang
Copy link
Contributor Author

Would you mind first create an issue like: https://issues.apache.org/jira/browse/PARQUET-2299 or use MINOR?

I will create a related issue once my JIRA account request is approved.

Does this means value in data page is same as position in dictionary page? 🤔

I think so. The data page contains dictionary code (i.e. offset in dictionary page )

Encodings.md Outdated
@@ -32,7 +32,7 @@ intended to be the simplest encoding. Values are encoded back to back.

The plain encoding is used whenever a more efficient encoding can not be used. It
stores the data in the following format:
- BOOLEAN: [Bit Packed](#BITPACKED), LSB first
- BOOLEAN: [Bit Packed](#BITPACKED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe LSB first is not descriptive enough but at least we shall not remove it. What it wanted to say that for BOOLEANs it stores the bits from the least significant bit to the most significant one, instead of what is specified for BIT_PACKED. It practically means that the BOOLEAN bits are just written from left to right.

Copy link
Contributor

@JFinis JFinis Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, LSB first is pretty understandable and leaving it out makes the spec ambiguous.

Maybe it's only how I read it, but for me, LSB first pretty clearly conveys that "the first value will be stored in the least significant bit", so If I have the byte 0b00000001, this byte encodes 8 values, with the first one being 1 and all others being 0.

Maybe it helps spelling out ot more clearly, e.g.,

LSB first, i.e., the first value is stored in the least significant bit.

but for someone writing bit fiddling logic - which someone writing an encoder/decoder has to do - LSB first should be descriptive enough on its own IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL. Thanks a lot!

@letian-jiang letian-jiang changed the title clarify parquet encoding PARQUET-2299: Clarify parquet encoding Oct 6, 2023
@letian-jiang letian-jiang marked this pull request as ready for review October 6, 2023 19:28
letian-jiang added a commit to letian-jiang/parquet-format that referenced this pull request Oct 6, 2023
> Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding.

The dictionary entries are not sorted (or at least not always sorted).

> There is no padding between values (except for the last byte) which is padded with 0s.

Minor change.

Signed-off-by: Letian Jiang <letian.jiang@outlook.com>
@letian-jiang
Copy link
Contributor Author

Made some updates. Please take another look. @mapleFU @gszadovszky @JFinis

@letian-jiang letian-jiang changed the title PARQUET-2299: Clarify parquet encoding PARQUET-2362: Clarify parquet encoding Oct 7, 2023
> Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding.

The dictionary entries are not sorted (or at least not always sorted).

> There is no padding between values (except for the last byte) which is padded with 0s.

Minor change.

Signed-off-by: Letian Jiang <letian.jiang@outlook.com>
@wgtmac wgtmac merged commit 77949ba into apache:master Oct 14, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants