Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5123: [Rust] Parquet derive for simple structs #4140

Closed
wants to merge 36 commits into from

Conversation

xrl
Copy link
Contributor

@xrl xrl commented Apr 11, 2019

A rebase and significant rewrite of sunchao/parquet-rs#197

Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply.

Goal

Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this derive(ParquetRecordWriter) will write out all the fields, in the order in which they are defined, to a row_group.

How to Use

extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}

RecordWriter trait

This is the new trait which parquet_derive will implement for your structs.

use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}

How does it work?

The parquet_derive crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special build.rs steps or anything like that, it's automatic by including parquet_derive in their project. The parquet_derive/src/Cargo.toml has a section saying as much:

[lib]
proc-macro = true

The rust struct tagged with #[derive(ParquetRecordWriter)] is provided to the parquet_record_writer function in parquet_derive/src/lib.rs. The syn crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a RecordWriter impl:

  • the name of the struct
  • the lifetime variables of the struct
  • the fields of the struct

The fields of the struct are translated from AST to a flat FieldInfo struct. It has the bits I care about for writing a column: field_name, field_lifetime, field_type, is_option, column_writer_variant.

The code then does the equivalent of templating to build the RecordWriter implementation. The templating functionality is provided by the quote crate. At a high-level the template for RecordWriter looks like:

impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  } 
}

this template is then added under the struct definition, ending up something like:

struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}

and finally THIS is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their struct MyValue definition the ParquetRecordWriter will be regenerated. There's no intermediate values to version control or worry about.

Viewing the Derived Code

To see the generated code before it's compiled, one very useful bit is to install cargo expand more info on gh, then you can do:

$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs

then you can dump the contents:

struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}

now I need to write out all the combinations of types we support and make sure it writes out data.

Procedural Macros

The parquet_derive crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, parquet_derive_test.

I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

Potentials For Better Design

  • Recursion could be limited by generating the code as "snippets" instead of one big quote! AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
  • It would be nicer if I didn't have to be so picky about data going in to the write_batch function. Is it possible we could make a version of the function which accept Into<DataType> or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like write_generic_batch(&[impl Into<DataType>]) would be neat. (not tackling in this generation of the plugin)
  • Another idea to improving writing columns, could we have a write function for Iterators? I already have a Vec<DumbRecord>, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for write_batch. Should have some significant memory advantages. (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
  • It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors. (moved to ARROW-367: converter json <=> Arrow file format for Integration tests #203)

Status

I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it only generates simplistic RecordWriters. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!).

@xrl
Copy link
Contributor Author

xrl commented Apr 11, 2019

@sadikovi @sunchao I hope we can pick up where we left off on the last PR 😆

@codecov-io
Copy link

codecov-io commented Apr 11, 2019

Codecov Report

Merging #4140 into master will decrease coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4140      +/-   ##
==========================================
- Coverage   87.78%   87.74%   -0.05%     
==========================================
  Files         758      758              
  Lines       92394    92193     -201     
  Branches     1251     1251              
==========================================
- Hits        81112    80893     -219     
- Misses      11165    11179      +14     
- Partials      117      121       +4
Impacted Files Coverage Δ
go/arrow/math/int64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/memory/memory_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/float64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/uint64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/memory/memory_amd64.go 28.57% <0%> (-14.29%) ⬇️
go/arrow/math/math_amd64.go 31.57% <0%> (-5.27%) ⬇️
js/src/ipc/metadata/json.ts 92.39% <0%> (-4.35%) ⬇️
cpp/src/arrow/csv/column-builder.cc 95.32% <0%> (-1.76%) ⬇️
cpp/src/parquet/arrow/reader.cc 84.15% <0%> (-1.48%) ⬇️
cpp/src/plasma/thirdparty/ae/ae.c 70.75% <0%> (-0.95%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d333f4...45348e7. Read the comment docs.

@sunchao
Copy link
Member

sunchao commented Apr 11, 2019

Thanks for the PR @xrl ! will take a look at it soon.

@sunchao sunchao changed the title [ARROW-5123][RUST] Parquet derive for simple structs ARROW-5123: [Rust] Parquet derive for simple structs Apr 11, 2019
@xrl
Copy link
Contributor Author

xrl commented Apr 12, 2019

I'm now adding support for casting NaiveDateTime to a i64 to support TIMESTAMP_MILLIS. This is a feature that could use some design work, or feedback. I think I can support other timestamp types too but NaiveDateTime is the "most accurate" because it assumes UTC which I think is the most compatible with how parquet treats timestamps?

In any case, to be clear, the follow now automatically works:

#[derive(ParquetRecordWriter)]
struct MyStruct {
  timestamp: NaiveDateTime
}

are there are other logical types that would be useful in this preliminary release? Timestamps scratch my itch since I'm translating records from postgres over to parquet and our app uses a lot of timezone-free timestamps.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xrl ! I have some comments on the PR. Still reading it so may have more coming. :)

rust/parquet_derive/README.md Outdated Show resolved Hide resolved
rust/parquet_derive/README.md Outdated Show resolved Hide resolved
rust/parquet_derive/Cargo.toml Outdated Show resolved Hide resolved
rust/parquet_derive/Cargo.toml Outdated Show resolved Hide resolved
rust/parquet_derive/src/lib.rs Show resolved Hide resolved
rust/parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
rust/parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
rust/parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
rust/parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
syn::PathArguments::AngleBracketed(angle_args) => {
let mut gen_args_iter = angle_args.args.iter();
let first_arg = gen_args_iter.next().unwrap();
assert!(gen_args_iter.next().is_none());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be simplified to?

let first_arg = &angle_args.args[0];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure there is only one generic argument to the type, it's just an assertion for preconditions so the code doesn't mysteriously break further in the code generation. We don't support things like Map<&str, bool>, for example. Perhaps it should be a different kind of error? This is the only assert! in the code so I think I should go to the more standard unimplemented!()

Copy link
Member

@sunchao sunchao Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case you can use this:

assert!(angle_args.args.len() == 1);
let first_arg = &angle_args.args[0];

It's better than using iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! As long as this has a sufficient documentation on how to use this feature, it should be good to merge.

rust/parquet_derive_test/src/lib.rs Outdated Show resolved Hide resolved
rust/parquet_derive_test/src/lib.rs Show resolved Hide resolved
rust/parquet_derive_test/Cargo.toml Outdated Show resolved Hide resolved
rust/parquet_derive_test/Cargo.toml Outdated Show resolved Hide resolved
dev/release/00-prepare.sh Outdated Show resolved Hide resolved
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xrl . Some more comments.

rust/parquet_derive/src/lib.rs Outdated Show resolved Hide resolved
rust/parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
rust/parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
rust/parquet_derive/src/parquet_field.rs Show resolved Hide resolved
rust/parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
rust/parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
rust/parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
rust/parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
rust/parquet_derive_test/src/lib.rs Show resolved Hide resolved
@kszucs kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29
@emkornfield
Copy link
Contributor

@xrl @sunchao is this change still relevant, it doesn't look like there has been any update in 6 months?

@xrl
Copy link
Contributor Author

xrl commented Oct 19, 2019

@emkornfield yes, this is still relevant. I need to address some final points and make sure this still compiles with the latest parquet-rs. I have been using this in production without a hitch and it's time to push the work over the finish line.

@xrl
Copy link
Contributor Author

xrl commented Oct 24, 2019

@sunchao can you take another look at this PR? :)

@xrl
Copy link
Contributor Author

xrl commented Nov 29, 2019

@sunchao ping :)

@sunchao
Copy link
Member

sunchao commented Dec 2, 2019

Sorry @xrl - didn't see your comment earlier. Will take a look this week.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xrl for continuing work on this, and sorry for the late review. I think this overall looks pretty good! I just have some cosmetic comments plus one suggestion on the API. Let me know what you think. :)


[package]
name = "parquet_derive"
version = "0.13.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this version is out of date.


[package]
name = "parquet_derive_test"
version = "0.13.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same - the version is out of date. we need to keep it the same as the Arrow version.


(quote! {
impl#generics RecordWriter<#derived_for#generics> for &[#derived_for#generics] {
fn write_to_row_group(&self, row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>) -> Result<(), parquet::errors::ParquetError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: long line.

use super::super::file::writer::RowGroupWriter;

pub trait RecordWriter<T> {
fn write_to_row_group(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking whether we can have a higher API, so that users do not need to directly manipulate row groups. Instead, could we pass a file writer, like the following?

fn write(&self, file: &mut dyn FileWriter) -> Result<()>;

this internally will write a row group for each call.

fn write_to_row_group(
&self,
row_group_writer: &mut Box<RowGroupWriter>,
) -> Result<(), ParquetError>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can import and use the result type from parquet_rs so this can just be Result<()>.

let when = Field::from(&fields[0]);
assert_eq!(when.writer_snippet().to_string(),(quote!{
{
let vals : Vec<_> = records.iter().map(|rec| rec.henceforth.signed_duration_since(chrono::NaiveDate::from_ymd(1970, 1, 1)).num_days() as i32).collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: long lines - let's keep them within 90 chars.

/// }
/// ```
///
#[proc_macro_derive(ParquetRecordWriter)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of ParquetRecordWriter, what do you think of making it shorter, such as ParquetWrite, ParquetSerialize, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the trick is ParquetRecordWriter does both serialization (conversion to column type from a variety of inputs types) and writing out batches of data. I'll think about this more, but I'm leaning towards ParquetWrite or ParquetWriter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParquetWriter 👍

rust/parquet_derive/src/parquet_field.rs Outdated Show resolved Hide resolved
///
/// Can only generate writers for basic structs, for example:
///
/// struct Record {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: quotes around these?

@bryantbiggs
Copy link
Contributor

hey all, any update on this - it would be great to start using this

@xrl
Copy link
Contributor Author

xrl commented Jan 21, 2020

@bryantbiggs this feature is still near and dear to my heart, my team uses it everyday with great success. I will have to carve out some time to address the PR feedback.

Are you open to trying this code out in your project? Could you give any feedback?

@bryantbiggs
Copy link
Contributor

hello @xrl - my apologies for the delayed response, just coming back around to this project. I created a branch off of yours to start making the recommended changes and will be running this on our current project. Any input/feedback is appreciated - I'd love to get this merged in if possible master...clowdhaus:parquet_derive

@bryantbiggs
Copy link
Contributor

also @sunchao ☝️

@xrl
Copy link
Contributor Author

xrl commented Mar 15, 2020

@bryantbiggs thanks for the assist! I have not had the bandwidth to work on this more (and parquet writer/schema derive have been working great for us). Happy to merge in your work and keep this ball rolling.

@wesm
Copy link
Member

wesm commented Apr 1, 2020

It looks like this PR still has some activity. We're trying to close out stale PRs, but it would be great to see this work completed at some point, so I'll leave this open :)

@bryantbiggs
Copy link
Contributor

yes sorry I was poking at this but got stuck at this release issue. any insights @wesm or @andygrove on release tags:

Failure: test_version_post_tag(PrepareTest)
/home/runner/work/arrow/arrow/dev/release/00-prepare-test.rb:320:in `test_version_post_tag'
     317:       prepare("VERSION_PRE_TAG",
     318:               "VERSION_POST_TAG")
     319:     end
  => 320:     assert_equal([
     321:                    {
     322:                      path: "c_glib/configure.ac",
     323:                      hunks: [
/home/runner/work/arrow/arrow/dev/release/00-prepare-test.rb:36:in `block (2 levels) in setup'
/home/runner/work/arrow/arrow/dev/release/00-prepare-test.rb:31:in `chdir'
/home/runner/work/arrow/arrow/dev/release/00-prepare-test.rb:31:in `block in setup'
/opt/hostedtoolcache/Ruby/2.6.5/x64/lib/ruby/2.6.0/tmpdir.rb:93:in `mktmpdir'
/home/runner/work/arrow/arrow/dev/release/00-prepare-test.rb:28:in `setup'
<[{:hunks=>
   [["-m4_define([arrow_glib_version], 1.0.0)",
     "+m4_define([arrow_glib_version], 2.0.0-SNAPSHOT)"]],
  :path=>"c_glib/configure.ac"},
 {:hunks=>[["-version = '1.0.0'", "+version = '2.0.0-SNAPSHOT'"]],
  :path=>"c_glib/meson.build"},
 {:hunks=>[["-pkgver=1.0.0", "+pkgver=1.0.0.9000"]],
  :path=>"ci/scripts/PKGBUILD"},
 {:hunks=>
   [["-set(ARROW_VERSION \"1.0.0\")",
     "+set(ARROW_VERSION \"2.0.0-SNAPSHOT\")"]],

@nevi-me
Copy link
Contributor

nevi-me commented Sep 13, 2020

@sunchao given the age of this PR, I'd like to propose merging it if CI is green, we can make further changes in separate PRs. I suspect that if people start using the functionality we'd be able to get more eyes on the code.

@sunchao
Copy link
Member

sunchao commented Sep 13, 2020

@nevi-me sounds good to me - let's do that.

@nevi-me
Copy link
Contributor

nevi-me commented Sep 14, 2020

@kszucs I might need help :( I believe I updated dev/release/00-prepare-test.rb correctly, but I'm still getting a test failure. I'm on Windows, so not sure of how I can test locally

@kszucs
Copy link
Member

kszucs commented Sep 14, 2020

Will try to take a look at it, also cc @kou

@kou
Copy link
Member

kou commented Sep 14, 2020

Could you apply 0001-Update-parquet_derive-version.txt? (I don't know why I can't push to tureus:parquet_derive.)

$ git am < 0001-Update-parquet_derive-version.txt

@nevi-me
Copy link
Contributor

nevi-me commented Sep 14, 2020

Could you apply 0001-Update-parquet_derive-version.txt? (I don't know why I can't push to tureus:parquet_derive.)

$ git am < 0001-Update-parquet_derive-version.txt

it's because the repo was forked under the organisation :(

Thanks, I see what I missed now. I've pushed your patch

@nevi-me nevi-me closed this in 90e474d Sep 14, 2020
@nevi-me
Copy link
Contributor

nevi-me commented Sep 14, 2020

I've merged this now, the CI failures were unrelated.

Thanks for the contribution, and apologies that we've taken this long to get this merged @xrl @bryantbiggs. I'll try find some time to look at the comments that weren't addressed on this PR.

emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
A rebase and significant rewrite of sunchao/parquet-rs#197

Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply.

Goal
===

Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group.

How to Use
===

```
extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}
```

RecordWriter trait
===

This is the new trait which `parquet_derive` will implement for your structs.

```
use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}
```

How does it work?
===

The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much:

```
[lib]
proc-macro = true
```

The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl:

 - the name of the struct
 - the lifetime variables of the struct
 - the fields of the struct

The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.

The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like:

```
impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  }
}
```

this template is then added under the struct definition, ending up something like:

```
struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}
```

and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about.

Viewing the Derived Code
===

To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do:

```
$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs
```

then you can dump the contents:

```
struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}
```

now I need to write out all the combinations of types we support and make sure it writes out data.

Procedural Macros
===

The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`.

I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

Potentials For Better Design
===

 - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
 - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin)
 - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
 - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to apache#203)

Status
===

I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!).

Closes apache#4140 from xrl/parquet_derive

Lead-authored-by: Xavier Lange <xrlange@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
A rebase and significant rewrite of sunchao/parquet-rs#197

Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply.

Goal
===

Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group.

How to Use
===

```
extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}
```

RecordWriter trait
===

This is the new trait which `parquet_derive` will implement for your structs.

```
use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}
```

How does it work?
===

The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much:

```
[lib]
proc-macro = true
```

The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl:

 - the name of the struct
 - the lifetime variables of the struct
 - the fields of the struct

The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.

The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like:

```
impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  }
}
```

this template is then added under the struct definition, ending up something like:

```
struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}
```

and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about.

Viewing the Derived Code
===

To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do:

```
$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs
```

then you can dump the contents:

```
struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}
```

now I need to write out all the combinations of types we support and make sure it writes out data.

Procedural Macros
===

The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`.

I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

Potentials For Better Design
===

 - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
 - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin)
 - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
 - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to apache#203)

Status
===

I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!).

Closes apache#4140 from xrl/parquet_derive

Lead-authored-by: Xavier Lange <xrlange@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.