-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5123: [Rust] Parquet derive for simple structs #4140
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4140 +/- ##
==========================================
- Coverage 87.78% 87.74% -0.05%
==========================================
Files 758 758
Lines 92394 92193 -201
Branches 1251 1251
==========================================
- Hits 81112 80893 -219
- Misses 11165 11179 +14
- Partials 117 121 +4
Continue to review full report at Codecov.
|
Thanks for the PR @xrl ! will take a look at it soon. |
I'm now adding support for casting In any case, to be clear, the follow now automatically works:
are there are other logical types that would be useful in this preliminary release? Timestamps scratch my itch since I'm translating records from postgres over to parquet and our app uses a lot of timezone-free timestamps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xrl ! I have some comments on the PR. Still reading it so may have more coming. :)
syn::PathArguments::AngleBracketed(angle_args) => { | ||
let mut gen_args_iter = angle_args.args.iter(); | ||
let first_arg = gen_args_iter.next().unwrap(); | ||
assert!(gen_args_iter.next().is_none()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these be simplified to?
let first_arg = &angle_args.args[0];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure there is only one generic argument to the type, it's just an assertion for preconditions so the code doesn't mysteriously break further in the code generation. We don't support things like Map<&str, bool>
, for example. Perhaps it should be a different kind of error? This is the only assert!
in the code so I think I should go to the more standard unimplemented!()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case you can use this:
assert!(angle_args.args.len() == 1);
let first_arg = &angle_args.args[0];
It's better than using iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! As long as this has a sufficient documentation on how to use this feature, it should be good to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xrl . Some more comments.
ed180da
to
85fe336
Compare
@emkornfield yes, this is still relevant. I need to address some final points and make sure this still compiles with the latest parquet-rs. I have been using this in production without a hitch and it's time to push the work over the finish line. |
@sunchao can you take another look at this PR? :) |
@sunchao ping :) |
Sorry @xrl - didn't see your comment earlier. Will take a look this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xrl for continuing work on this, and sorry for the late review. I think this overall looks pretty good! I just have some cosmetic comments plus one suggestion on the API. Let me know what you think. :)
rust/parquet_derive/Cargo.toml
Outdated
|
||
[package] | ||
name = "parquet_derive" | ||
version = "0.13.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this version is out of date.
rust/parquet_derive_test/Cargo.toml
Outdated
|
||
[package] | ||
name = "parquet_derive_test" | ||
version = "0.13.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same - the version is out of date. we need to keep it the same as the Arrow version.
|
||
(quote! { | ||
impl#generics RecordWriter<#derived_for#generics> for &[#derived_for#generics] { | ||
fn write_to_row_group(&self, row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>) -> Result<(), parquet::errors::ParquetError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: long line.
use super::super::file::writer::RowGroupWriter; | ||
|
||
pub trait RecordWriter<T> { | ||
fn write_to_row_group( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking whether we can have a higher API, so that users do not need to directly manipulate row groups. Instead, could we pass a file writer, like the following?
fn write(&self, file: &mut dyn FileWriter) -> Result<()>;
this internally will write a row group for each call.
fn write_to_row_group( | ||
&self, | ||
row_group_writer: &mut Box<RowGroupWriter>, | ||
) -> Result<(), ParquetError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can import and use the result type from parquet_rs
so this can just be Result<()>
.
let when = Field::from(&fields[0]); | ||
assert_eq!(when.writer_snippet().to_string(),(quote!{ | ||
{ | ||
let vals : Vec<_> = records.iter().map(|rec| rec.henceforth.signed_duration_since(chrono::NaiveDate::from_ymd(1970, 1, 1)).num_days() as i32).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: long lines - let's keep them within 90 chars.
/// } | ||
/// ``` | ||
/// | ||
#[proc_macro_derive(ParquetRecordWriter)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of ParquetRecordWriter
, what do you think of making it shorter, such as ParquetWrite
, ParquetSerialize
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the trick is ParquetRecordWriter
does both serialization (conversion to column type from a variety of inputs types) and writing out batches of data. I'll think about this more, but I'm leaning towards ParquetWrite
or ParquetWriter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParquetWriter
👍
/// | ||
/// Can only generate writers for basic structs, for example: | ||
/// | ||
/// struct Record { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: quotes around these?
hey all, any update on this - it would be great to start using this |
@bryantbiggs this feature is still near and dear to my heart, my team uses it everyday with great success. I will have to carve out some time to address the PR feedback. Are you open to trying this code out in your project? Could you give any feedback? |
hello @xrl - my apologies for the delayed response, just coming back around to this project. I created a branch off of yours to start making the recommended changes and will be running this on our current project. Any input/feedback is appreciated - I'd love to get this merged in if possible master...clowdhaus:parquet_derive |
also @sunchao ☝️ |
@bryantbiggs thanks for the assist! I have not had the bandwidth to work on this more (and parquet writer/schema derive have been working great for us). Happy to merge in your work and keep this ball rolling. |
It looks like this PR still has some activity. We're trying to close out stale PRs, but it would be great to see this work completed at some point, so I'll leave this open :) |
yes sorry I was poking at this but got stuck at this release issue. any insights @wesm or @andygrove on release tags:
|
98b6906
to
76c65ef
Compare
@sunchao given the age of this PR, I'd like to propose merging it if CI is green, we can make further changes in separate PRs. I suspect that if people start using the functionality we'd be able to get more eyes on the code. |
@nevi-me sounds good to me - let's do that. |
@kszucs I might need help :( I believe I updated |
Will try to take a look at it, also cc @kou |
Could you apply 0001-Update-parquet_derive-version.txt? (I don't know why I can't push to
|
it's because the repo was forked under the organisation :( Thanks, I see what I missed now. I've pushed your patch |
I've merged this now, the CI failures were unrelated. Thanks for the contribution, and apologies that we've taken this long to get this merged @xrl @bryantbiggs. I'll try find some time to look at the comments that weren't addressed on this PR. |
A rebase and significant rewrite of sunchao/parquet-rs#197 Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply. Goal === Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group. How to Use === ``` extern crate parquet; #[macro_use] extern crate parquet_derive; #[derive(ParquetRecordWriter)] struct ACompleteRecord<'a> { pub a_bool: bool, pub a_str: &'a str, } ``` RecordWriter trait === This is the new trait which `parquet_derive` will implement for your structs. ``` use super::RowGroupWriter; pub trait RecordWriter<T> { fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>); } ``` How does it work? === The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much: ``` [lib] proc-macro = true ``` The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl: - the name of the struct - the lifetime variables of the struct - the fields of the struct The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like: ``` impl RecordWriter for $struct_name { fn write_row_group(..) { $({ $column_writer_snippet }) } } ``` this template is then added under the struct definition, ending up something like: ``` struct MyStruct { } impl RecordWriter for MyStruct { fn write_row_group(..) { { write_col_1(); }; { write_col_2(); } } } ``` and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about. Viewing the Derived Code === To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do: ``` $WORK_DIR/parquet-rs/parquet_derive_test cargo expand --lib > ../temp.rs ``` then you can dump the contents: ``` struct DumbRecord { pub a_bool: bool, pub a2_bool: bool, } impl RecordWriter<DumbRecord> for &[DumbRecord] { fn write_to_row_group( &self, row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>, ) { let mut row_group_writer = row_group_writer; { let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); }; { let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); } } } ``` now I need to write out all the combinations of types we support and make sure it writes out data. Procedural Macros === The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code. The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`. I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile! Potentials For Better Design === - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop. - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin) - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement) - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to apache#203) Status === I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file). I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!). Closes apache#4140 from xrl/parquet_derive Lead-authored-by: Xavier Lange <xrlange@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
A rebase and significant rewrite of sunchao/parquet-rs#197 Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply. Goal === Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group. How to Use === ``` extern crate parquet; #[macro_use] extern crate parquet_derive; #[derive(ParquetRecordWriter)] struct ACompleteRecord<'a> { pub a_bool: bool, pub a_str: &'a str, } ``` RecordWriter trait === This is the new trait which `parquet_derive` will implement for your structs. ``` use super::RowGroupWriter; pub trait RecordWriter<T> { fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>); } ``` How does it work? === The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much: ``` [lib] proc-macro = true ``` The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl: - the name of the struct - the lifetime variables of the struct - the fields of the struct The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like: ``` impl RecordWriter for $struct_name { fn write_row_group(..) { $({ $column_writer_snippet }) } } ``` this template is then added under the struct definition, ending up something like: ``` struct MyStruct { } impl RecordWriter for MyStruct { fn write_row_group(..) { { write_col_1(); }; { write_col_2(); } } } ``` and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about. Viewing the Derived Code === To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do: ``` $WORK_DIR/parquet-rs/parquet_derive_test cargo expand --lib > ../temp.rs ``` then you can dump the contents: ``` struct DumbRecord { pub a_bool: bool, pub a2_bool: bool, } impl RecordWriter<DumbRecord> for &[DumbRecord] { fn write_to_row_group( &self, row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>, ) { let mut row_group_writer = row_group_writer; { let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); }; { let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); } } } ``` now I need to write out all the combinations of types we support and make sure it writes out data. Procedural Macros === The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code. The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`. I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile! Potentials For Better Design === - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop. - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin) - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement) - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to apache#203) Status === I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file). I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!). Closes apache#4140 from xrl/parquet_derive Lead-authored-by: Xavier Lange <xrlange@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
A rebase and significant rewrite of sunchao/parquet-rs#197
Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply.
Goal
Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this
derive(ParquetRecordWriter)
will write out all the fields, in the order in which they are defined, to a row_group.How to Use
RecordWriter trait
This is the new trait which
parquet_derive
will implement for your structs.How does it work?
The
parquet_derive
crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any specialbuild.rs
steps or anything like that, it's automatic by includingparquet_derive
in their project. Theparquet_derive/src/Cargo.toml
has a section saying as much:The rust struct tagged with
#[derive(ParquetRecordWriter)]
is provided to theparquet_record_writer
function inparquet_derive/src/lib.rs
. Thesyn
crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating aRecordWriter
impl:The fields of the struct are translated from AST to a flat
FieldInfo
struct. It has the bits I care about for writing a column:field_name
,field_lifetime
,field_type
,is_option
,column_writer_variant
.The code then does the equivalent of templating to build the
RecordWriter
implementation. The templating functionality is provided by thequote
crate. At a high-level the template forRecordWriter
looks like:this template is then added under the struct definition, ending up something like:
and finally THIS is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their
struct MyValue
definition theParquetRecordWriter
will be regenerated. There's no intermediate values to version control or worry about.Viewing the Derived Code
To see the generated code before it's compiled, one very useful bit is to install
cargo expand
more info on gh, then you can do:then you can dump the contents:
now I need to write out all the combinations of types we support and make sure it writes out data.
Procedural Macros
The
parquet_derive
crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate,
parquet_derive_test
.I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!
Potentials For Better Design
quote!
AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.It would be nicer if I didn't have to be so picky about data going in to the(not tackling in this generation of the plugin)write_batch
function. Is it possible we could make a version of the function which acceptInto<DataType>
or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something likewrite_generic_batch(&[impl Into<DataType>])
would be neat.Another idea to improving writing columns, could we have a write function for(not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)Iterator
s? I already have aVec<DumbRecord>
, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec forwrite_batch
. Should have some significant memory advantages.It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.(moved to ARROW-367: converter json <=> Arrow file format for Integration tests #203)Status
I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).
I think this code is worth including in the project, with the caveat that it only generates simplistic
RecordWriter
s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!).