Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format: Are the nulls bits 0 or 1 for null values? #15479

Closed
asfimport opened this issue Mar 11, 2016 · 6 comments
Closed

Format: Are the nulls bits 0 or 1 for null values? #15479

asfimport opened this issue Mar 11, 2016 · 6 comments

Comments

@asfimport
Copy link

As brought up by Dan Robinson on the mailing list (thank you for catching this!), there is an inconsistency in the format documents in the representation of nulls with the ValueVectors code import – since I drafted these format documents initially I'll take the blame for the inconsistency, but:

  • Drill / ValueVectors uses the value 0 for null data, and 1 for non-null data
  • The format document currently states the opposite (values are null if the bit is set)

I can see arguments both ways, but one argument for the ValueVectors style is that values must be explicitly set to be non-null, versus uninitialized values being accidentally interpreted as being non-null. When initializing a bitmap, one can memset the bits to 0, then set then to 1 when non-null values are appended during construction.

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-62. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Jacques Nadeau / @jacques-n:
I consider the bitmap to be a validity map as opposed to a null map. I've also seen a couple places where it is nice to zero out values that are null using the zero in the bitmap without a condition... although I can't remember where we took advantage of this previously.

@asfimport
Copy link
Author

Dan Robinson / @danrobinson:
Being able to bitwise-& against the null bitmask definitely seems nice, although (returning to my other idea from the e-mail list) if the spec required values in nulled slots to be zeroed out, you wouldn't even have to do this.

@asfimport
Copy link
Author

Dan Robinson / @danrobinson:
For whatever it's worth: it seems PostgreSQL uses 0 in a null bitmap to indicate null values (http://www.postgresql.org/docs/8.0/static/storage-page-layout.html) while MySQL and SQL Server use 1 (https://dev.mysql.com/doc/internals/en/null-bitmap.html, http://www.sqlpassion.at/archive/2011/06/29/the-mystery-of-the-null-bitmap-mask/). And of course Drill uses 0, while Numpy uses 1. So there does not seem to be an established convention yet. IMHO I guess I think the validity-map approach that uses 0 is a little more elegant.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Since we already have production code (i.e. Drill) using 0 as null, and it's consistent with Postgres, I'm inclined to stick with that.

I expect that the null bitmap will also be used in practice in conjunction with evaluated predicates, so in aggregations you will include values that are included and not null. If nulls are 1, then you need to use included[i] & ~nulls[i] versus included[i] & valid[i]

@asfimport
Copy link
Author

Wes McKinney / @wesm:
see patch #34

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 34
#34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants