You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Migrated from previous github issue (which saw a lot of comments but at a rough transition time in the project): sunchao/parquet-rs#197
Goal
===
Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this derive(ParquetRecordWriter) will write out all the fields, in the order in which they are defined, to a row_group.
The parquet_derive crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special build.rs steps or anything like that, it's automatic by including parquet_derive in their project. The parquet_derive/src/Cargo.toml has a section saying as much:
[lib]
proc-macro = true
The rust struct tagged with #[derive(ParquetRecordWriter)] is provided to the parquet_record_writer function in parquet_derive/src/lib.rs. The syn crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a RecordWriter impl:
- the name of the struct
- the lifetime variables of the struct
- the fields of the struct
The fields of the struct are translated from AST to a flat FieldInfo struct. It has the bits I care about for writing a column: field_name, field_lifetime, field_type, is_option, column_writer_variant.
The code then does the equivalent of templating to build the RecordWriter implementation. The templating functionality is provided by the quote crate. At a high-level the template for RecordWriter looks like:
and finally THIS is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their struct MyValue definition the ParquetRecordWriter will be regenerated. There's no intermediate values to version control or worry about.
Viewing the Derived Code
To see the generated code before it's compiled, one very useful bit is to install cargo expandmore info on gh, then you can do:
struct DumbRecord {
pub a_bool: bool,
pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
fn write_to_row_group(
&self,
row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
) {
let mut row_group_writer = row_group_writer;
{
let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
column_writer
{
typed.write_batch(&vals[..], None, None).unwrap();
}
row_group_writer.close_column(column_writer).unwrap();
};
{
let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
column_writer
{
typed.write_batch(&vals[..], None, None).unwrap();
}
row_group_writer.close_column(column_writer).unwrap();
}
}
}
now I need to write out all the combinations of types we support and make sure it writes out data.
Procedural Macros
The parquet_derive crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.
The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, parquet_derive_test.
I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!
Potentials For Better Design
- [x] Recursion could be limited by generating the code as "snippets" instead of one big quote! AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
- [X] It would be nicer if I didn't have to be so picky about data going in to the write_batch function. Is it possible we could make a version of the function which accept Into<DataType> or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like write_generic_batch(&[impl Into<DataType>]) would be neat. (not tackling in this generation of the plugin)
- [X] Another idea to improving writing columns, could we have a write function for Iterators? I already have a Vec<DumbRecord>, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for write_batch. Should have some significant memory advantages. (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
- [X] It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors. (moved to #203)
Status
I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).
I think this code is worth including in the project, with the caveat that it only generates simplistic RecordWriters. As people start to use we can add code generation for more complex, nested structs.
Migrated from previous github issue (which saw a lot of comments but at a rough transition time in the project): sunchao/parquet-rs#197
Goal
===
Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this
derive(ParquetRecordWriter)
will write out all the fields, in the order in which they are defined, to a row_group.How to Use
RecordWriter trait
This is the new trait which
parquet_derive
will implement for your structs.How does it work?
The
parquet_derive
crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any specialbuild.rs
steps or anything like that, it's automatic by includingparquet_derive
in their project. Theparquet_derive/src/Cargo.toml
has a section saying as much:The rust struct tagged with
#[derive(ParquetRecordWriter)]
is provided to theparquet_record_writer
function inparquet_derive/src/lib.rs
. Thesyn
crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating aRecordWriter
impl:- the name of the struct
- the lifetime variables of the struct
- the fields of the struct
The fields of the struct are translated from AST to a flat
FieldInfo
struct. It has the bits I care about for writing a column:field_name
,field_lifetime
,field_type
,is_option
,column_writer_variant
.The code then does the equivalent of templating to build the
RecordWriter
implementation. The templating functionality is provided by thequote
crate. At a high-level the template forRecordWriter
looks like:this template is then added under the struct definition, ending up something like:
and finally THIS is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their
struct MyValue
definition theParquetRecordWriter
will be regenerated. There's no intermediate values to version control or worry about.Viewing the Derived Code
To see the generated code before it's compiled, one very useful bit is to install
cargo expand
more info on gh, then you can do:then you can dump the contents:
now I need to write out all the combinations of types we support and make sure it writes out data.
Procedural Macros
The
parquet_derive
crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate,
parquet_derive_test
.I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!
Potentials For Better Design
- [x] Recursion could be limited by generating the code as "snippets" instead of one big
quote!
AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.- [X]
It would be nicer if I didn't have to be so picky about data going in to the(not tackling in this generation of the plugin)write_batch
function. Is it possible we could make a version of the function which acceptInto<DataType>
or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something likewrite_generic_batch(&[impl Into<DataType>])
would be neat.- [X]
Another idea to improving writing columns, could we have a write function for(not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)Iterator
s? I already have aVec<DumbRecord>
, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec forwrite_batch
. Should have some significant memory advantages.- [X]
It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.(moved to #203)Status
I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).
I think this code is worth including in the project, with the caveat that it only generates simplistic
RecordWriter
s. As people start to use we can add code generation for more complex, nested structs.Reporter: Xavier Lange
PRs and other links:
Note: This issue was originally created as ARROW-5123. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: