Skip to content

ORC-471: [C++] StructColumnWriter: no more rows than defined#368

Closed
rixed wants to merge 1 commit intoapache:masterfrom
rixed:master
Closed

ORC-471: [C++] StructColumnWriter: no more rows than defined#368
rixed wants to merge 1 commit intoapache:masterfrom
rixed:master

Conversation

@rixed
Copy link
Copy Markdown
Contributor

@rixed rixed commented Feb 23, 2019

When a struct row is null, the column reader expect the children to have
no row at all. So I guess the clients of the lib must not write those
rows when writing and ORC file (or the reader will fail with too litle data
to read). So, use the null counter (that has to be computed for stats
anyway) to shorten the children batches when writing them.

Makes it then possible to nest structures arbitrarily deep.

@rixed rixed changed the title ORC-471: StructColumnWriter: no more rows than defined ORC-471: [C++] StructColumnWriter: no more rows than defined Feb 23, 2019
@xndai
Copy link
Copy Markdown
Contributor

xndai commented Feb 26, 2019

I don't think this is correct. Look at StructColumnReader::next(). It always calls child readers with numValues. You should be able to verify that through a unit test.

Now I think you have a point that when struct column is NULL, we actually don't need to record any values for the child columns. But unfortunately such change would break compatibility, and needs to be seriously evaluated. As of today, null struct will also record all its children as null.

@rixed
Copy link
Copy Markdown
Contributor Author

rixed commented Feb 27, 2019

Ok, so structs set all their children to NULL. That was my first idea but I got really strange results with empty rows being read back after null rows. But now that you confirm that the API is indeed that struct children are supposed to be set (to null) on null rows, I can look for the bug in (hopefully) the right direction.

So, looking closer to what is happening at decoding time, it appears that the BooleanRleDecoder of a child row will receive the parent notNull vector (known as incomingMask in ColumnReader::next), and will skip over the NULL row when decoding its own notNull vector.
This is very good, but the equivalent mechanism seems to be missing from the ColumnWriter, which is always passed nullptr as its parent notNull. Therefore, I'm reading additional null rows in the children after a null row in the parent.

And indeed, if I add an additional argument to ColumnWriter::add for the incomingMask, and propagate the notNull from StructColumnWriter to children, then it also fixes the issue.

I updated this PR in that sense. Please have another look.
Maybe my understanding of the API is still wrong though.

When a struct row is null, the column reader for its children skips over
the null rows when decoding its own null vector. But currently no
ColumnWriters take into consideration its parent notNull when encoding its
own notNull vector. This result in additional empty rows being read out
from nested structures after a null row.

This patch fixes this, by changing ColumnWriter::add so that
StructColumnWriter propagates its notNull vector to its children so that
they can skip over null rows when encoding their own notNull.
@xndai
Copy link
Copy Markdown
Contributor

xndai commented Feb 27, 2019

Can you please add a unit test to demonstrate the problem? That would be very helpful for understanding the issue.

@rixed
Copy link
Copy Markdown
Contributor Author

rixed commented Feb 27, 2019

Liborc test infrastructure is rather primitive.
In my own project I have a code generator that produce an ORC encoder/decoder based on some type description and then test that it can encode then decode and retrieve the original data.
You should maybe add something like this in liborc if you want to be able to detect encoding bugs that happens only with deep compound types (for some quite small values of "deep"). As much as I'd like to contribute a more exhaustive testing framework for the C++ side of liborc, it's very low on my employer list of priorities :)

The simplest test that revealed the problem was using this type (orc notation) struct<name:array<tinyint>> with this dataset:

{"name": [1, 2]}
null
{"name": [3, 4]}
{"name": [5, 6]}

With liborc 1.5.4, the result was:

{"name": [1, 2]}
null
{"name": null}
{"name": [3, 4]}

Here is the corresponding orc file:

$ xxd /tmp/not_ok.orc
00000000: 4f52 4305 0000 ffb0 0500 00ff b007 0000  ORC.............
00000010: 0000 0205 0000 fffc 0f00 00fa 0102 0304  ................
00000020: 0506 4a00 00e3 62e3 6010 6090 60e5 02d1  ..J...b.`.`.`...
00000030: 8c60 9a09 48b3 81f9 4c60 3e23 90e6 1262  .`..H...L`>#...b
00000040: 01a9 4326 a598 dd7d 4300 3900 000a 1a0a  ..C&...}C.9.....
00000050: 0408 0350 010a 0408 0050 000a 0c08 0612  ...P.....P......
00000060: 0608 0210 0c18 2a50 00a2 0000 e360 1670  ......*P.....`.p
00000070: 97e2 e260 1660 9090 57d0 d060 5112 e4e0  ...`.`..W..`Q...
00000080: 1162 6494 62c9 4bcc 4d55 60d0 6030 6050  .bd.b.K.MU`.`0`P
00000090: e2e6 e012 6264 8272 3838 1821 2c03 162b  ....bd.r88.!,..+
000000a0: 160e e600 4620 c910 c060 c5c3 c126 c4c6  ....F ...`...&..
000000b0: c124 c023 a115 c0e0 c0e0 c108 0008 5410  .$.#..........T.
000000c0: 0118 8080 0422 0200 0b28 1f30 0682 f403  ....."...(.0....
000000d0: 034f 5243 17                             .ORC.

With the patched liborc the result was OK and here is the corresponding ORC file:

$ xxd /tmp/ok.orc
00000000: 4f52 4305 0000 ffb0 0500 00ff e005 0000  ORC.............
00000010: 0002 0500 00ff fc0f 0000 fa01 0203 0405  ................
00000020: 0652 0000 e362 e360 1060 9060 e502 d18c  .R...b.`.`.`....
00000030: 609a 094a 3308 3081 f98c 409a 4b88 05a4  `..J3.0...@.K...
00000040: 0e48 3281 4920 5b8a d9dd 3704 0039 0000  .H2.I [...7..9..
00000050: 0a1a 0a04 0803 5001 0a04 0800 5000 0a0c  ......P.....P...
00000060: 0806 1206 0802 100c 182a 5000 a200 00e3  .........*P.....
00000070: 6016 f092 e2e2 6016 6090 9053 d0d1 6051  `.....`.`..S..`Q
00000080: 12e4 e011 6264 9462 c94b cc4d 5560 d060  ....bd.b.K.MU`.`
00000090: 3060 50e2 e6e0 1262 6482 7238 3818 212c  0`P....bd.r88.!,
000000a0: 0316 2b16 0ee6 0046 20c9 10c0 60c5 c3c1  ..+....F ...`...
000000b0: 26c4 c6c1 24c0 23a1 15c0 e0c0 e0c1 0800  &...$.#.........
000000c0: 0854 1001 1880 8004 2202 000c 281f 3006  .T......"...(.0.
000000d0: 82f4 0303 4f52 4317                      ....ORC.

You can easily reverse each of those files with xxd -r.

The code that wrote those ORC is too big to be posted here (as I said, it's generated from the type description). But I will have a look tomorrow if I can quickly extract from that mass of generated code something that's small enough to be palatable, that could allow you to single-step into the code at least.

@rixed
Copy link
Copy Markdown
Contributor Author

rixed commented Feb 27, 2019

Here is something that should allow you to single-step the code and/or notice my error in how I write those orc files:

https://gist.github.com/rixed/35fc959f25ed4991870e8fbf972ab78e

@xndai
Copy link
Copy Markdown
Contributor

xndai commented Feb 28, 2019

Ok, thanks for providing a test case. I will look into that in the next day or two.

@xndai
Copy link
Copy Markdown
Contributor

xndai commented Feb 28, 2019

Thanks @rixed Now I understand the problem. Your fix looks fine except that we need a unit test to cover this scenario. Please take a look at orc/c++/test/TestWriter.cc. You can add a test case with schema like struct<struct>. With simple test data like you mentioned, we can demonstrate the issue and also validate your fix.

Thanks again for reporting this problem.

@rixed
Copy link
Copy Markdown
Contributor Author

rixed commented Mar 1, 2019

Thank you for confirming that it's indeed a bug in liborc.
You need a unit test but I do not. :-p

@rixed rixed closed this Mar 1, 2019
@xndai
Copy link
Copy Markdown
Contributor

xndai commented Mar 1, 2019

What I mean is for this pull request to be accepted, you will need a corresponding uint test created in orc c++.

@rixed
Copy link
Copy Markdown
Contributor Author

rixed commented Mar 1, 2019

That's indeed how I understood it.

What I meant is: I have reported a bug, proposed a patch, and offered an explanation about why the c++ testing suite failed to detect such a trivial bug. I think I've done enough. Adding another unit test to comply with whatever organisation guidelines is not worth my time. I will therefore wait until a committer who cares about that lib comes up with his own fix, and until then uses my own version.

Happy hacking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants