[Go] Document `pqarrow.FileWriter.WriteBuffered` & `pqarrow.FileWriter.Write` perfromance-wise #36095

candiduslynx · 2023-06-15T13:08:40Z

Describe the enhancement requested

A follow-up to cloudquery/filetypes#203.

Basically, we saw a drastic improvement after switching to a buffered mode, so maybe making Write be buffered and introducing a WriteUnbuffered func to perform as current Write is a better way?

Component(s)

Go

The text was updated successfully, but these errors were encountered:

candiduslynx · 2023-06-15T13:08:52Z

cc: @zeroshade

zeroshade · 2023-06-15T14:38:01Z

I'd be curious why the WriteBuffered was more performant than unbuffered, and if you'd be better served by just using https://pkg.go.dev/bufio#Writer as the underlying io.Writer rather than using the WriteBuffered for parquet (or using larger / different record sizes).

In general, the unbuffered writing is going to be significantly more memory efficient unless you're trying to essentially collect multiple records into a single row-group or otherwise need to write to multiple columns before actually writing it to the parquet file.

candiduslynx · 2023-06-15T14:44:38Z

@zeroshade we already used bufio.Writer in the benchmark: https://github.com/cloudquery/filetypes/blob/main/parquet/write_read_test.go#L86

I think the main issue is that each record has its own row group that adds quite an overhead for a large amount of records.
In our benchmark tests (see PR description) we generated 10K records and then wrote them.

I opened this issue to discuss the performance &, possibly, to make a better option the default one of fix the default option.

zeroshade · 2023-06-15T15:04:03Z

That makes sense to me, you'd want more than 10k rows per row-group. Personally I don't want to change the default out from under people given the semantic difference that would happen by changing this default.

candiduslynx · 2023-06-15T15:13:01Z

That makes sense to me, you'd want more than 10k rows per row-group. Personally I don't want to change the default out from under people given the semantic difference that would happen by changing this default.

@zeroshade I tried with 100 records (each with a single row), the difference still persists:

 go test \
  -test.run=BenchmarkWrite \
  -test.bench=BenchmarkWrite \
-test.count 10 -test.benchmem -test.benchtime 100x

Write

goos: darwin
goarch: arm64
pkg: github.com/cloudquery/filetypes/v3/parquet
BenchmarkWrite-10            100           3579741 ns/op         5923408 B/op      45296 allocs/op
BenchmarkWrite-10            100           3579718 ns/op         5923500 B/op      45296 allocs/op
BenchmarkWrite-10            100           3542423 ns/op         5923485 B/op      45296 allocs/op
BenchmarkWrite-10            100           3539510 ns/op         5923228 B/op      45294 allocs/op
BenchmarkWrite-10            100           3582708 ns/op         5923466 B/op      45295 allocs/op
BenchmarkWrite-10            100           3631385 ns/op         5923999 B/op      45297 allocs/op
BenchmarkWrite-10            100           3615708 ns/op         5923475 B/op      45296 allocs/op
BenchmarkWrite-10            100           3573869 ns/op         5923664 B/op      45297 allocs/op
BenchmarkWrite-10            100           3625930 ns/op         5923538 B/op      45296 allocs/op
BenchmarkWrite-10            100           3643602 ns/op         5923328 B/op      45294 allocs/op
PASS
ok      github.com/cloudquery/filetypes/v3/parquet      4.667s

WriteBuffered

goos: darwin
goarch: arm64
pkg: github.com/cloudquery/filetypes/v3/parquet
BenchmarkWrite-10            100           1063981 ns/op         1288338 B/op      17082 allocs/op
BenchmarkWrite-10            100           1067547 ns/op         1288197 B/op      17083 allocs/op
BenchmarkWrite-10            100           1034933 ns/op         1288312 B/op      17082 allocs/op
BenchmarkWrite-10            100           1044867 ns/op         1288250 B/op      17081 allocs/op
BenchmarkWrite-10            100           1020637 ns/op         1288325 B/op      17081 allocs/op
BenchmarkWrite-10            100           1013818 ns/op         1288170 B/op      17083 allocs/op
BenchmarkWrite-10            100           1028132 ns/op         1288359 B/op      17083 allocs/op
BenchmarkWrite-10            100           1012284 ns/op         1288004 B/op      17082 allocs/op
BenchmarkWrite-10            100           1009228 ns/op         1288231 B/op      17082 allocs/op
BenchmarkWrite-10            100           1038570 ns/op         1288419 B/op      17082 allocs/op
PASS
ok      github.com/cloudquery/filetypes/v3/parquet      2.183s

zeroshade · 2023-06-15T15:42:36Z

The difference makes sense to me, as you pointed out, it's likely due to the overhead of having many separate small row-groups rather than one big row group. One suggestion I'd make is to gather the records to create an arrow.Table and try using WriteTable and see what the performance is like (with some large chunk size) as it will use buffered writing internally to do that.

My personal opinion is still that I wouldn't necessarily want to change the default behavior out from under people. We could, however, improve the documentation and comments on the respective functions though.

Alternately, another solution might be to have a Write method which can choose to use buffered/non-buffered based on whether a consumer calls NewRowGroup or NewBufferedRowGroup themselves etc. and do the writes up until the max row group length and return how many rows were written / how many rows were left over / some other indication. And give consumers more granular control over the writing in that respect.

candiduslynx · 2023-06-16T10:04:32Z

I think the doc change would be fine, so that the users will know about different modes & will choose the write mode with enough info in mind.

zeroshade · 2023-06-16T15:12:22Z

Would you be up for adding a PR for this? 😄

candiduslynx · 2023-06-16T18:44:04Z

🤷‍♂️
Do you have specific contents you want to put there?
Should it be for both Write & WriteBuffered methods?

zeroshade · 2023-06-16T20:21:05Z

We should definitely include updated docs for both Write and WriteBuffered (I believe WriteBuffered currently doesn't actually have a doc string associated with it).

As for specific contents, I say go for mentioning the pros and cons for one over the other when dealing with record batches. If your record batches are significantly large (i.e. you want row groups to be roughly the same layout as your records) then Write makes sense, alternately if you are getting lots of smaller records that you want to ensure get aggregated into a larger row group, then you'll need to use WriteBuffered which will have higher memory usage but better write performance (and likely better read performance too).

candiduslynx · 2023-06-17T11:34:38Z

I think for small records (with small amount of rows) the buffered version will have less memory footprint, too, as shown by our tests.
I'll try to perform some additional testing regarding the amount of rows it takes to match the performance, so that's included in the docs, too.

candiduslynx · 2023-06-19T11:02:30Z

I've checked that even for 100 rows/record the buffered mode behaves better than the simple write, I wonder if the scale is 1000s to match them

zeroshade · 2023-06-20T02:06:33Z

I'd expect it to need to be thousands or even tens of thousands for non-buffered to be better most likely.

### Rationale for this change Docs to help people decide on the best-performing option. ### What changes are included in this PR? Doc change only. ### Are these changes tested? N/A ### Are there any user-facing changes? Doc * Closes: #36095 Authored-by: candiduslynx <candiduslynx@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

candiduslynx added the Type: enhancement label Jun 15, 2023

github-actions bot added the Component: Go label Jun 15, 2023

kou changed the title ~~Make WriteBuffered a default~~ [Go] Make WriteBuffered a default Jun 15, 2023

candiduslynx changed the title ~~[Go] Make WriteBuffered a default~~ [Go] Document pqarrow.FileWriter.WriteBuffered & pqarrow.FileWriter.Write perfromance-wise Jun 19, 2023

github-actions bot mentioned this issue Jun 19, 2023

GH-36095: [Go] Add doc for pqarrow.FileWriter.WriteBuffered #36163

Merged

github-actions bot assigned candiduslynx Jun 19, 2023

zeroshade closed this as completed in #36163 Jun 20, 2023

zeroshade added this to the 13.0.0 milestone Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Go] Document `pqarrow.FileWriter.WriteBuffered` & `pqarrow.FileWriter.Write` perfromance-wise #36095

[Go] Document `pqarrow.FileWriter.WriteBuffered` & `pqarrow.FileWriter.Write` perfromance-wise #36095

candiduslynx commented Jun 15, 2023

candiduslynx commented Jun 15, 2023

zeroshade commented Jun 15, 2023

candiduslynx commented Jun 15, 2023 •

edited

Loading

zeroshade commented Jun 15, 2023

candiduslynx commented Jun 15, 2023

zeroshade commented Jun 15, 2023

candiduslynx commented Jun 16, 2023

zeroshade commented Jun 16, 2023

candiduslynx commented Jun 16, 2023

zeroshade commented Jun 16, 2023

candiduslynx commented Jun 17, 2023

candiduslynx commented Jun 19, 2023

zeroshade commented Jun 20, 2023

[Go] Document pqarrow.FileWriter.WriteBuffered & pqarrow.FileWriter.Write perfromance-wise #36095

[Go] Document pqarrow.FileWriter.WriteBuffered & pqarrow.FileWriter.Write perfromance-wise #36095

Comments

candiduslynx commented Jun 15, 2023

Describe the enhancement requested

Component(s)

candiduslynx commented Jun 15, 2023

zeroshade commented Jun 15, 2023

candiduslynx commented Jun 15, 2023 • edited Loading

zeroshade commented Jun 15, 2023

candiduslynx commented Jun 15, 2023

zeroshade commented Jun 15, 2023

candiduslynx commented Jun 16, 2023

zeroshade commented Jun 16, 2023

candiduslynx commented Jun 16, 2023

zeroshade commented Jun 16, 2023

candiduslynx commented Jun 17, 2023

candiduslynx commented Jun 19, 2023

zeroshade commented Jun 20, 2023

[Go] Document `pqarrow.FileWriter.WriteBuffered` & `pqarrow.FileWriter.Write` perfromance-wise #36095

[Go] Document `pqarrow.FileWriter.WriteBuffered` & `pqarrow.FileWriter.Write` perfromance-wise #36095

candiduslynx commented Jun 15, 2023 •

edited

Loading