Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use bufio.Writer instead of binary.Write to reduce memory footprint while serializing #103

Merged

Conversation

thanhpk
Copy link
Contributor

@thanhpk thanhpk commented Mar 16, 2022

While writing (or reading) a very big bitset (few GBs) into a file, the memory usage will be doubled, crashing the program.
This is because you are converting the whole set into another big []byte slice. (through binary.Write)
This pull request fixs the issue by not trying to convert the whole set ([]uint64) into []byte, but convert one by one item and push it to a buffered stream.
The memory usage is the same after calling WriteTo or ReadFrom

Refs:
bufio is the recomented way to read and writing buffered data. (https://pkg.go.dev/bufio#pkg-overview)
A loop while reading out from bufio.Reader is also quite common (https://pkg.go.dev/bufio#example-Scanner.Bytes)

@lemire
Copy link
Member

lemire commented Mar 16, 2022

Your claim, as I understand it, is that binary.Write and binary.Read allocates buffers and is inefficient. Nothing in your analysis supports this. You are not providing any evidence that the new code will be more performant. You claim "current implementation bufio.Reader is more memory efficient than binary.Read for large set" but you do not provide evidence or even a rationale for why that might be.

You need to provide strong and detailed evidence. Our code is default textbook code. Before we adopt something more complicated, we need to have a deep understanding and much evidence.

@lemire lemire closed this Mar 16, 2022
@thanhpk
Copy link
Contributor Author

thanhpk commented Mar 17, 2022

In the implementation of binary.Write, its clear that binary.Write will allocates another []byte with same size with the input.
https://github.com/golang/go/blob/66865363f017a8d4cb0b07d84a3a6117fcf1cd30/src/encoding/binary/binary.go#L341

I ran a benchmark to see how much binary.Write allocated compare to buffio.Write

$ go test -bench=. -benchmem
goos: linux
goarch: amd64
pkg: github.com/thanhpk/testbitset
cpu: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
BenchmarkBinaryWrite-12    	       1	2727863238 ns/op	3200014368 B/op	      12 allocs/op
BenchmarkBufioWriter-12    	       1	3783645077 ns/op	1600008360 B/op	       9 allocs/op
PASS
ok  	github.com/thanhpk/testbitset	6.922s

As you can see, the binary version is faster but use 2x memory. Note that this is only true for large data.
Here is the code for the benchmark

package main

import (
	"bufio"
	"encoding/binary"
	"os"
	"testing"
)

func BenchmarkBinaryWrite(b *testing.B) {
	f, err := os.OpenFile("./binary.dat", os.O_WRONLY|os.O_CREATE, 0600)
	checkErr(err)
	writer := bufio.NewWriter(f)

	data := make([]uint64, 200000000) // 1.5GB

	for i := 0; i < b.N; i++ {
		err = binary.Write(writer, binary.LittleEndian, data)
		checkErr(err)
	}

	err = writer.Flush()
	checkErr(err)

	err = os.Remove("./binary.dat")
	checkErr(err)
}

func BenchmarkBufioWriter(b *testing.B) {
	f, err := os.OpenFile("./buffio.dat", os.O_WRONLY|os.O_CREATE, 0600)
	checkErr(err)

	writer := bufio.NewWriter(f)

	data := make([]uint64, 200000000) // 1.5GB

	for i := 0; i < b.N; i++ {
		var item = make([]byte, 8) // for serializing uint64
		for _, x := range data {
			binary.LittleEndian.PutUint64(item, x)
			_, err := writer.Write(item)
			checkErr(err)
		}
	}

	err = writer.Flush()
	checkErr(err)

	err = os.Remove("./buffio.dat")
	checkErr(err)
}

func checkErr(err error) {
	if err != nil {
		panic(err)
	}
}

@lemire
Copy link
Member

lemire commented Mar 17, 2022

Thanks.

@josharian : can you comment?

@lemire lemire reopened this Mar 17, 2022
@josharian
Copy link

I'm fighting some fires on other fronts, so this is based on only a cursory look...

Yes, binary.Write allocates the entire buffer up front. This is arguably a bug: It could do internal batching for large slices. I'm not sure how much excitement there'd be upstream for fixing it.

The proposed replacement seems fine, in broad strokes. bufio.Writer doesn't really add all that much, since the relevant sizes can be calculated in advance; you could manage a single buffer yourself just as easily. Or if the only goal is to avoid gigantic allocs, you could call binary.Write chunk by chunk. But this also seems reasonable enough.

Let me know if you'd like me to take a more careful look at the code.

bitset.go Outdated
// binary.Write for large set
writer := bufio.NewWriter(stream)
var item = make([]byte, 8) // for serializing uint64
var n = binary.Size(uint64(0)) // number of bytes written
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems odd, doesn't it? You assume that the size is 8, and then you next compute it? Do I get this right?

bitset.go Outdated
if err != nil {
return int64(n+nn), err
}
n += nn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you do all of these additions?

bitset.go Outdated
reader := bufio.NewReader(stream)
i := 0
var item = make([]byte, 8) // one uint64
for {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not go with the standard...

for i:=0 ....

@lemire
Copy link
Member

lemire commented Mar 17, 2022

@thanhpk Let us push this forward. Please have a look at my comments.

@josharian

Yes, binary.Write allocates the entire buffer up front. This is arguably a bug: It could do internal batching for large slices. I'm not sure how much excitement there'd be upstream for fixing it.

It is disappointing code.

@josharian
Copy link

I suspect that the upstream response is: "binary.Write is not meant for giant data structures; for very large things you probably want enough control that you should write it yourself". But don't take my word for it: File an issue. I don't recall having seen anything specifically about this; the closest I recall are golang/go#27757 and golang/go#27403.

@thanhpk
Copy link
Contributor Author

thanhpk commented Mar 18, 2022

@lemire I fixed the issues you commented.

@lemire lemire merged commit 117140e into bits-and-blooms:master Mar 18, 2022
@lemire
Copy link
Member

lemire commented Mar 18, 2022

Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants