-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parquet definition levels #511
Conversation
@hohav may you please point your test repro to my branch, write the file, then confirm if parquet tools sees the correct output. I don't have the Java parquet tools, so it'll be quicker if you do that. Thanks |
Codecov Report
@@ Coverage Diff @@
## master #511 +/- ##
==========================================
- Coverage 82.74% 82.71% -0.04%
==========================================
Files 165 166 +1
Lines 45686 45827 +141
==========================================
+ Hits 37805 37905 +100
- Misses 7881 7922 +41
Continue to review full report at Codecov.
|
Hi @hohav, I've looked at all 3 versions, and I see that while this PR fixes the definition issue, it doesn't address your specific issue because of the writer bug that I indicated in #385. I see that you also tested against arrow = { git = "https://github.com/apache/arrow-rs.git" }
parquet = { git = "https://github.com/apache/arrow-rs.git" } I might have not been clear enough, I had meant for you to test against arrow = { git = "https://github.com/nevi-me/arrow-rs.git", branch = "parquet-fix-levels" }
parquet = { git = "https://github.com/nevi-me/arrow-rs.git", branch = "parquet-fix-levels" } It's not an issue though, as I've cloned your repro repo, and installed parquet-mr tools to check what you're seeing. The solution is two-fold, this PR, and then computing stats in #512 so that we avoid the column writer computing these stats. I don't have enough bandwidth to drive #512, would you be able to help with it? I think the main tasks would be:
It would be a good exercise in digging into the arrow compute code. CC @alamb @jorgecarleitao for a review. |
- non-null primitive should have def = 0, was misinterpreting the spec - list increments 1 if not null, or 2 if null This fixes these issues, and updates the tests
4fecc32
to
4885c32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully grok the definition level details so I can't really say if this is correct or not, but I reviewed the tests and skimmed the code and it seems reasonable to me. 👍
parquet/src/arrow/arrow_writer.rs
Outdated
let file = get_temp_file("test_arrow_writer_list_non_null.parquet", &[]); | ||
let mut writer = ArrowWriter::try_new(file, Arc::new(schema), None).unwrap(); | ||
writer.write(&batch).unwrap(); | ||
writer.close().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed other tests like arrow_writer_binary
also read data back from parquet and validate the results (and thus confirm data survives the roundtrip).
Would it make sense to have the same test here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb it's because fn roundtrip
was added after these tests existed (and I think the reader was lagging behind the writer), so we never expanded it to them. I've now done so, removing quite a bit of duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much nicer 👍
Hi @nevi-me, thanks again for taking a look at this. I did test against both your branch and master, though it's possible I messed up while juggling so many variants. I can take a stab at #512 when I find time, but would that also address the crash in |
Which issue does this PR close?
Relates to #385, but does not close it.
Rationale for this change
While investigating #385, I noticed that there was a discrepancy between the max definitions calculated in
parquet::arrow::levels.
and what the parquet type system emits. So I kept digging, and found that my interpretation of the rules had errors.This fixes these issues, and updates the tests
What changes are included in this PR?
Described above
Are there any user-facing changes?
No