Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go] Document pqarrow.FileWriter.WriteBuffered & pqarrow.FileWriter.Write perfromance-wise #36095

Closed
candiduslynx opened this issue Jun 15, 2023 · 13 comments · Fixed by #36163
Closed

Comments

@candiduslynx
Copy link
Contributor

Describe the enhancement requested

A follow-up to cloudquery/filetypes#203.

Basically, we saw a drastic improvement after switching to a buffered mode, so maybe making Write be buffered and introducing a WriteUnbuffered func to perform as current Write is a better way?

Component(s)

Go

@candiduslynx
Copy link
Contributor Author

cc: @zeroshade

@zeroshade
Copy link
Member

I'd be curious why the WriteBuffered was more performant than unbuffered, and if you'd be better served by just using https://pkg.go.dev/bufio#Writer as the underlying io.Writer rather than using the WriteBuffered for parquet (or using larger / different record sizes).

In general, the unbuffered writing is going to be significantly more memory efficient unless you're trying to essentially collect multiple records into a single row-group or otherwise need to write to multiple columns before actually writing it to the parquet file.

@candiduslynx
Copy link
Contributor Author

candiduslynx commented Jun 15, 2023

@zeroshade we already used bufio.Writer in the benchmark: https://github.com/cloudquery/filetypes/blob/main/parquet/write_read_test.go#L86

I think the main issue is that each record has its own row group that adds quite an overhead for a large amount of records.
In our benchmark tests (see PR description) we generated 10K records and then wrote them.

I opened this issue to discuss the performance &, possibly, to make a better option the default one of fix the default option.

@zeroshade
Copy link
Member

That makes sense to me, you'd want more than 10k rows per row-group. Personally I don't want to change the default out from under people given the semantic difference that would happen by changing this default.

@candiduslynx
Copy link
Contributor Author

That makes sense to me, you'd want more than 10k rows per row-group. Personally I don't want to change the default out from under people given the semantic difference that would happen by changing this default.

@zeroshade I tried with 100 records (each with a single row), the difference still persists:

 go test \
  -test.run=BenchmarkWrite \
  -test.bench=BenchmarkWrite \
-test.count 10 -test.benchmem -test.benchtime 100x
Write
goos: darwin
goarch: arm64
pkg: github.com/cloudquery/filetypes/v3/parquet
BenchmarkWrite-10            100           3579741 ns/op         5923408 B/op      45296 allocs/op
BenchmarkWrite-10            100           3579718 ns/op         5923500 B/op      45296 allocs/op
BenchmarkWrite-10            100           3542423 ns/op         5923485 B/op      45296 allocs/op
BenchmarkWrite-10            100           3539510 ns/op         5923228 B/op      45294 allocs/op
BenchmarkWrite-10            100           3582708 ns/op         5923466 B/op      45295 allocs/op
BenchmarkWrite-10            100           3631385 ns/op         5923999 B/op      45297 allocs/op
BenchmarkWrite-10            100           3615708 ns/op         5923475 B/op      45296 allocs/op
BenchmarkWrite-10            100           3573869 ns/op         5923664 B/op      45297 allocs/op
BenchmarkWrite-10            100           3625930 ns/op         5923538 B/op      45296 allocs/op
BenchmarkWrite-10            100           3643602 ns/op         5923328 B/op      45294 allocs/op
PASS
ok      github.com/cloudquery/filetypes/v3/parquet      4.667s
WriteBuffered
goos: darwin
goarch: arm64
pkg: github.com/cloudquery/filetypes/v3/parquet
BenchmarkWrite-10            100           1063981 ns/op         1288338 B/op      17082 allocs/op
BenchmarkWrite-10            100           1067547 ns/op         1288197 B/op      17083 allocs/op
BenchmarkWrite-10            100           1034933 ns/op         1288312 B/op      17082 allocs/op
BenchmarkWrite-10            100           1044867 ns/op         1288250 B/op      17081 allocs/op
BenchmarkWrite-10            100           1020637 ns/op         1288325 B/op      17081 allocs/op
BenchmarkWrite-10            100           1013818 ns/op         1288170 B/op      17083 allocs/op
BenchmarkWrite-10            100           1028132 ns/op         1288359 B/op      17083 allocs/op
BenchmarkWrite-10            100           1012284 ns/op         1288004 B/op      17082 allocs/op
BenchmarkWrite-10            100           1009228 ns/op         1288231 B/op      17082 allocs/op
BenchmarkWrite-10            100           1038570 ns/op         1288419 B/op      17082 allocs/op
PASS
ok      github.com/cloudquery/filetypes/v3/parquet      2.183s

@zeroshade
Copy link
Member

The difference makes sense to me, as you pointed out, it's likely due to the overhead of having many separate small row-groups rather than one big row group. One suggestion I'd make is to gather the records to create an arrow.Table and try using WriteTable and see what the performance is like (with some large chunk size) as it will use buffered writing internally to do that.

My personal opinion is still that I wouldn't necessarily want to change the default behavior out from under people. We could, however, improve the documentation and comments on the respective functions though.

Alternately, another solution might be to have a Write method which can choose to use buffered/non-buffered based on whether a consumer calls NewRowGroup or NewBufferedRowGroup themselves etc. and do the writes up until the max row group length and return how many rows were written / how many rows were left over / some other indication. And give consumers more granular control over the writing in that respect.

@kou kou changed the title Make WriteBuffered a default [Go] Make WriteBuffered a default Jun 15, 2023
@candiduslynx
Copy link
Contributor Author

I think the doc change would be fine, so that the users will know about different modes & will choose the write mode with enough info in mind.

@zeroshade
Copy link
Member

Would you be up for adding a PR for this? 😄

@candiduslynx
Copy link
Contributor Author

🤷‍♂️
Do you have specific contents you want to put there?
Should it be for both Write & WriteBuffered methods?

@zeroshade
Copy link
Member

We should definitely include updated docs for both Write and WriteBuffered (I believe WriteBuffered currently doesn't actually have a doc string associated with it).

As for specific contents, I say go for mentioning the pros and cons for one over the other when dealing with record batches. If your record batches are significantly large (i.e. you want row groups to be roughly the same layout as your records) then Write makes sense, alternately if you are getting lots of smaller records that you want to ensure get aggregated into a larger row group, then you'll need to use WriteBuffered which will have higher memory usage but better write performance (and likely better read performance too).

@candiduslynx
Copy link
Contributor Author

I think for small records (with small amount of rows) the buffered version will have less memory footprint, too, as shown by our tests.
I'll try to perform some additional testing regarding the amount of rows it takes to match the performance, so that's included in the docs, too.

@candiduslynx candiduslynx changed the title [Go] Make WriteBuffered a default [Go] Document pqarrow.FileWriter.WriteBuffered & pqarrow.FileWriter.Write perfromance-wise Jun 19, 2023
@candiduslynx
Copy link
Contributor Author

I've checked that even for 100 rows/record the buffered mode behaves better than the simple write, I wonder if the scale is 1000s to match them

@zeroshade
Copy link
Member

I'd expect it to need to be thousands or even tens of thousands for non-buffered to be better most likely.

zeroshade pushed a commit that referenced this issue Jun 20, 2023
### Rationale for this change

Docs to help people decide on the best-performing option.

### What changes are included in this PR?

Doc change only.

### Are these changes tested?

N/A

### Are there any user-facing changes?

Doc

* Closes: #36095

Authored-by: candiduslynx <candiduslynx@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@zeroshade zeroshade added this to the 13.0.0 milestone Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants