Skip to content

Commit

Permalink
ARROW-8289: [Rust] Parquet Arrow writer with nested support
Browse files Browse the repository at this point in the history
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Max Burke <max@urbanlogiq.com>
Co-authored-by: Andy Grove <andygrove73@gmail.com>
Co-authored-by: Max Burke <maxburke@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
  • Loading branch information
4 people committed Aug 13, 2020
1 parent 3d0a9d5 commit 80a9c02
Show file tree
Hide file tree
Showing 5 changed files with 692 additions and 5 deletions.
2 changes: 1 addition & 1 deletion rust/arrow/src/array/mod.rs
Expand Up @@ -115,7 +115,7 @@ pub use self::array::StructArray;
pub use self::null::NullArray;
pub use self::union::UnionArray;

pub(crate) use self::array::make_array;
pub use self::array::make_array;

pub type BooleanArray = PrimitiveArray<BooleanType>;
pub type Int8Array = PrimitiveArray<Int8Type>;
Expand Down

0 comments on commit 80a9c02

Please sign in to comment.