ORC-471: [C++] StructColumnWriter: no more rows than defined#368
ORC-471: [C++] StructColumnWriter: no more rows than defined#368rixed wants to merge 1 commit intoapache:masterfrom
Conversation
|
I don't think this is correct. Look at StructColumnReader::next(). It always calls child readers with numValues. You should be able to verify that through a unit test. Now I think you have a point that when struct column is NULL, we actually don't need to record any values for the child columns. But unfortunately such change would break compatibility, and needs to be seriously evaluated. As of today, null struct will also record all its children as null. |
|
Ok, so structs set all their children to NULL. That was my first idea but I got really strange results with empty rows being read back after null rows. But now that you confirm that the API is indeed that struct children are supposed to be set (to null) on null rows, I can look for the bug in (hopefully) the right direction. So, looking closer to what is happening at decoding time, it appears that the BooleanRleDecoder of a child row will receive the parent notNull vector (known as And indeed, if I add an additional argument to ColumnWriter::add for the incomingMask, and propagate the notNull from StructColumnWriter to children, then it also fixes the issue. I updated this PR in that sense. Please have another look. |
When a struct row is null, the column reader for its children skips over the null rows when decoding its own null vector. But currently no ColumnWriters take into consideration its parent notNull when encoding its own notNull vector. This result in additional empty rows being read out from nested structures after a null row. This patch fixes this, by changing ColumnWriter::add so that StructColumnWriter propagates its notNull vector to its children so that they can skip over null rows when encoding their own notNull.
|
Can you please add a unit test to demonstrate the problem? That would be very helpful for understanding the issue. |
|
Liborc test infrastructure is rather primitive. The simplest test that revealed the problem was using this type (orc notation) With liborc 1.5.4, the result was: Here is the corresponding orc file: With the patched liborc the result was OK and here is the corresponding ORC file: You can easily reverse each of those files with The code that wrote those ORC is too big to be posted here (as I said, it's generated from the type description). But I will have a look tomorrow if I can quickly extract from that mass of generated code something that's small enough to be palatable, that could allow you to single-step into the code at least. |
|
Here is something that should allow you to single-step the code and/or notice my error in how I write those orc files: https://gist.github.com/rixed/35fc959f25ed4991870e8fbf972ab78e |
|
Ok, thanks for providing a test case. I will look into that in the next day or two. |
|
Thanks @rixed Now I understand the problem. Your fix looks fine except that we need a unit test to cover this scenario. Please take a look at orc/c++/test/TestWriter.cc. You can add a test case with schema like struct<struct>. With simple test data like you mentioned, we can demonstrate the issue and also validate your fix. Thanks again for reporting this problem. |
|
Thank you for confirming that it's indeed a bug in liborc. |
|
What I mean is for this pull request to be accepted, you will need a corresponding uint test created in orc c++. |
|
That's indeed how I understood it. What I meant is: I have reported a bug, proposed a patch, and offered an explanation about why the c++ testing suite failed to detect such a trivial bug. I think I've done enough. Adding another unit test to comply with whatever organisation guidelines is not worth my time. I will therefore wait until a committer who cares about that lib comes up with his own fix, and until then uses my own version. Happy hacking! |
When a struct row is null, the column reader expect the children to have
no row at all. So I guess the clients of the lib must not write those
rows when writing and ORC file (or the reader will fail with too litle data
to read). So, use the null counter (that has to be computed for stats
anyway) to shorten the children batches when writing them.
Makes it then possible to nest structures arbitrarily deep.