-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Go] Document pqarrow.FileWriter.WriteBuffered
& pqarrow.FileWriter.Write
perfromance-wise
#36095
Comments
cc: @zeroshade |
I'd be curious why the In general, the unbuffered writing is going to be significantly more memory efficient unless you're trying to essentially collect multiple records into a single row-group or otherwise need to write to multiple columns before actually writing it to the parquet file. |
@zeroshade we already used I think the main issue is that each record has its own row group that adds quite an overhead for a large amount of records. I opened this issue to discuss the performance &, possibly, to make a better option the default one of fix the default option. |
That makes sense to me, you'd want more than 10k rows per row-group. Personally I don't want to change the default out from under people given the semantic difference that would happen by changing this default. |
@zeroshade I tried with 100 records (each with a single row), the difference still persists: go test \
-test.run=BenchmarkWrite \
-test.bench=BenchmarkWrite \
-test.count 10 -test.benchmem -test.benchtime 100x Write
WriteBuffered
|
The difference makes sense to me, as you pointed out, it's likely due to the overhead of having many separate small row-groups rather than one big row group. One suggestion I'd make is to gather the records to create an My personal opinion is still that I wouldn't necessarily want to change the default behavior out from under people. We could, however, improve the documentation and comments on the respective functions though. Alternately, another solution might be to have a Write method which can choose to use buffered/non-buffered based on whether a consumer calls |
I think the doc change would be fine, so that the users will know about different modes & will choose the write mode with enough info in mind. |
Would you be up for adding a PR for this? 😄 |
🤷♂️ |
We should definitely include updated docs for both Write and WriteBuffered (I believe WriteBuffered currently doesn't actually have a doc string associated with it). As for specific contents, I say go for mentioning the pros and cons for one over the other when dealing with record batches. If your record batches are significantly large (i.e. you want row groups to be roughly the same layout as your records) then |
I think for small records (with small amount of rows) the buffered version will have less memory footprint, too, as shown by our tests. |
WriteBuffered
a defaultpqarrow.FileWriter.WriteBuffered
& pqarrow.FileWriter.Write
perfromance-wise
I've checked that even for 100 rows/record the buffered mode behaves better than the simple write, I wonder if the scale is 1000s to match them |
I'd expect it to need to be thousands or even tens of thousands for non-buffered to be better most likely. |
### Rationale for this change Docs to help people decide on the best-performing option. ### What changes are included in this PR? Doc change only. ### Are these changes tested? N/A ### Are there any user-facing changes? Doc * Closes: #36095 Authored-by: candiduslynx <candiduslynx@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Describe the enhancement requested
A follow-up to cloudquery/filetypes#203.
Basically, we saw a drastic improvement after switching to a buffered mode, so maybe making
Write
be buffered and introducing aWriteUnbuffered
func to perform as currentWrite
is a better way?Component(s)
Go
The text was updated successfully, but these errors were encountered: