Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1011: [FORMAT] fix typo and mistakes in Layout.md #673

Closed
wants to merge 1 commit into from

Conversation

@cloud-fan
Copy link
Contributor

commented May 11, 2017

according to the specification in the Null bitmaps section:

Bitmaps are to be initialized to be all unset at allocation time (this includes padding).

null bitmaps should always be padded with 0

| 00001011 | 0 (padding) | unspecified |
|Byte 0 (validity bitmap) | Bytes 1-63 |
|-------------------------|-----------------------|
| 00001011 | 0 (padding) |

This comment has been minimized.

Copy link
@wesm

wesm May 11, 2017

Member

IIRC this was a deliberate choice so that one can rely upon accurate null counts using a byte-wise hardware popcount instruction. Though I'm not sure how safe that is in general since we can do:

array->Slice(0, 1)

which would yield a slice of length 1, whose additional bytes should not be inspected in algorithms. In an IPC setting, the padding to byte 63 should be all zeros (see e.g. https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L220), so I think this is fine. We will want to make sure in the IPC writers that we do not write possibly unspecified in-memory bytes from the validity bitmap, instead truncating the buffer to the effective byte range and writing zero padding. It might be worth adding some language to this document about this.

Per our prior discussion about padding -- since this document is about what data gets written to stream or file format, it may be good to be more rigid about zero-ing out the padding bytes that go on the wire. This places a slight additional burden on the writers, though

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan May 12, 2017

Author Contributor

I think ideally we only need to use zero padding for null bitmap in a word(8 bytes) boundary, which means the previous code is accurate.

However, to make the rule simpler, I think it also makes sense to require all the padding bytes to be 0 for null bitmap. This is also what we specified in the "Null bitmaps" section:

A 1 (set bit) for index j indicates that the value is not null, while a 0 (bit not set) indicates that it is null. Bitmaps are to be initialized to be all unset at allocation time (this includes padding).

@wesm

This comment has been minimized.

Copy link
Member

commented May 11, 2017

Can you open a JIRA? The build failure is unrelated here

@cloud-fan cloud-fan changed the title [FORMAT] fix typo and mistakes in Layout.md [ARROW-1011][FORMAT] fix typo and mistakes in Layout.md May 12, 2017

@wesm

This comment has been minimized.

Copy link
Member

commented May 12, 2017

We use the "ARROW-1011:" style for the PR titles in our merge script, if you could change that. Thanks!

@wesm
wesm approved these changes May 12, 2017
Copy link
Member

left a comment

+1. I will comment in the other JIRAs about padding when I can to clarify my concerns there

@julienledem
Copy link
Member

left a comment

+1

@cloud-fan cloud-fan changed the title [ARROW-1011][FORMAT] fix typo and mistakes in Layout.md ARROW-1011: [FORMAT] fix typo and mistakes in Layout.md May 13, 2017

@asfgit asfgit closed this in 99ff240 May 14, 2017

jeffknupp added a commit to jeffknupp/arrow that referenced this pull request Jun 3, 2017
ARROW-1011: [FORMAT] fix typo and mistakes in Layout.md
according to the specification in the `Null bitmaps` section:
> Bitmaps are to be initialized to be all unset at allocation time (this includes padding).

null bitmaps should always be padded with 0

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#673 from cloud-fan/first and squashes the following commits:

8566f7c [Wenchen Fan] fix typo and mistakes in Layout.md
pcmoritz added a commit to pcmoritz/arrow that referenced this pull request Jun 11, 2017
ARROW-1011: [FORMAT] fix typo and mistakes in Layout.md
according to the specification in the `Null bitmaps` section:
> Bitmaps are to be initialized to be all unset at allocation time (this includes padding).

null bitmaps should always be padded with 0

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#673 from cloud-fan/first and squashes the following commits:

8566f7c [Wenchen Fan] fix typo and mistakes in Layout.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.