Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less write lock contention, better read timeout handling #97

Merged
merged 6 commits into from Mar 6, 2023

Conversation

magik6k
Copy link
Collaborator

@magik6k magik6k commented Mar 1, 2023

This PR aims to improve/fix filecoin-project/lotus#8362

  • Reduces lock contention on the write lock - it should now be held only when we're actually sending data, response marshaling happens async
  • Makes sure that read deadline is reset even if a single frame is taking a long time to read
  • Makes frame unmarshaling not block data reading until we're really, really backlogged
  • Adds pprof labels to goroutines handling jsonrpc calls, so if we need to debug this further it's possible to tell apart clients in goroutine dumps.

@magik6k magik6k force-pushed the fix/less-write-lk-contention branch from 2010bd6 to 07ab412 Compare March 1, 2023 12:48
@magik6k magik6k force-pushed the fix/less-write-lk-contention branch from 00105b3 to ae30cc0 Compare March 1, 2023 22:35
@magik6k magik6k changed the title handler: Less write lock contention Less write lock contention, better read timeout handling Mar 2, 2023
rpc_test.go Outdated Show resolved Hide resolved
rpc_test.go Show resolved Hide resolved
rpc_test.go Show resolved Hide resolved
rpc_test.go Show resolved Hide resolved
rpc_test.go Outdated Show resolved Hide resolved
for n := range ch {
fmt.Println("received")
if n != prevN+1 {
panic("bad order")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider use testing.T methods to fail from this goroutine instead of panicing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically shouldn't use those in a goroutine that isn't running the test, panic was good enough

// json.NewDecoder(r).Decode would read the whole frame as well, so might as well do it
// with ReadAll which should be much faster
// use a autoResetReader in case the read takes a long time
buf, err := io.ReadAll(c.autoResetReader(r)) // todo buffer pool

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge without buffer pool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's not really the point of the PR, and can be done separately

}

// got the whole frame, can start reading the next one in background
go c.nextMessage()
Copy link
Contributor

@Kubuxu Kubuxu Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we are starting goroutines only to kill them and start again bothers me a bit, but it shouldn't matter apart from making goroutine numbers annoying to deal with.

func (r *deadlineResetReader) Read(p []byte) (n int, err error) {
n, err = r.r.Read(p)
if time.Since(r.lastReset) > onReadDeadlineResetInterval {
log.Warnw("slow/large read, resetting deadline while reading the frame", "since", time.Since(r.lastReset), "n", n, "err", err, "p", len(p))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Warn? It isn't actionable by the user and will happen during any request which requires more than 5s to send.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably whenever any RPC takes that long to transfer, something is broken - so the user should either open an issue, or investigate their networking.

Copy link
Contributor

@Kubuxu Kubuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locking looks correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load
4 participants