pss: Improve pressure backstop queue handling - no mutex #1695

kortatu · 2019-08-27T14:21:44Z

Implementation of parallelization of forwarded messages using solution proposed by @zelig using only channels.
We have paralellize the message processing inside the main processing loop:

go func() {
		for slot := range p.outbox.process {
			go func(slot int) {
				msg := p.outbox.msg(slot)
				sent := p.forwardFunc(msg.msg)
				if sent {
					// free the outbox slot
					p.outbox.free(slot)
                                        ...
				} else {
					// if we failed to send to anyone, re-insert message in the send-queue										
					p.outbox.reenqueue(slot)
				}
			}(slot)
		}
	}()

All functions and variables related to the outbox are encapsulated in a new outbox type

type outbox struct {
	queue   []*outboxMsg
	slots   chan int
	process chan int
	quitC   chan struct{}
}

When a new message is received for forwarding we call outbox.enqueue():

func (o *outbox) enqueue(outboxmsg *outboxMsg) error {
	// first we try to obtain a slot in the outbox
	select {
	case slot := <-o.slots:
		o.queue[slot] = outboxmsg
		metrics.GetOrRegisterGauge("pss.outbox.len", nil).Update(int64(o.len()))
		// we send this message slot to process
		select {
		case o.process <- slot:
		case <-o.quitC:
		}
		return nil
	default:
		metrics.GetOrRegisterCounter("pss.enqueue.outbox.full", nil).Inc(1)
		return errors.New("outbox full")
	}
}

Note also that with this implementation, the enqueue method is blocking in channel outbox.process, so some tests (that didn't start the ps process) have been modified to pull from that channel.

Added a benchmark test has been implemented (BenchmarkMessageProcessing) to compare performance with mutex implementation (PR #1680).

Also added a new pluggable forward function in pss for testing purposes.

This PR is related with issue #1654

…ocess messages in parallel

zelig

looks so much neater than #1680 no?
minor suggestions only

pss/pss.go

pss/pss_test.go

nolash · 2019-08-28T07:31:05Z

The Benchmark results for the channels are rather peculiar.

$ git checkout issue-1654-channels
Switched to branch 'issue-1654-channels'
Your branch is up to date with 'epiclabs/issue-1654-channels'.
[lash@sostenuto swarm]$ go test -v -cpu 4 ./pss/ -run ^$ -bench MessageProcessing
goos: linux
goarch: amd64
pkg: github.com/ethersphere/swarm/pss
BenchmarkMessageProcessing/FailProb0.00-4         	       1	2095323190 ns/op	172160096 B/op	 1794805 allocs/op
BenchmarkMessageProcessing/FailProb0.01-4         	1000000000	         0.77 ns/op	       0 B/op	       0 allocs/op
BenchmarkMessageProcessing/FailProb0.05-4         	50000000	        20.9 ns/op	       2 B/op	       0 allocs/op
PASS
ok  	github.com/ethersphere/swarm/pss	63.340s

$ git checkout issue-1654
Switched to branch 'issue-1654'
Your branch is up to date with 'epiclabs/issue-1654'.
[lash@sostenuto swarm]$ go test -v -cpu 4 ./pss/ -run ^$ -bench MessageProcessing
goos: linux
goarch: amd64
pkg: github.com/ethersphere/swarm/pss
BenchmarkMessageProcessing/0.00_-4         	       1	1220816527 ns/op	163138096 B/op	 1699684 allocs/op
BenchmarkMessageProcessing/0.01_-4         	       1	1028150022 ns/op	79193224 B/op	 1516835 allocs/op
BenchmarkMessageProcessing/0.05_-4         	       2	 510930503 ns/op	38372388 B/op	  747695 allocs/op
PASS
ok  	github.com/ethersphere/swarm/pss	4.505s

kortatu · 2019-08-28T07:48:45Z

The Benchmark results for the channels are rather peculiar.

In fact I don't understand fully the go benchmark tests. Finally I got something more stable with a benchtime of 2s:
(issue-1654-channels)

9:21 $ go test -v -cpu 4 -bench=BenchmarkMessageProcessing -benchtime 2s   -run=^$                                                                                                                                                          
goos: linux                                                                                                                                                                                                                                  
goarch: amd64                                                                                                                                                                                                                                
pkg: github.com/ethersphere/swarm/pss                                                                                                                                                                                                        
BenchmarkMessageProcessing/FailProb0.00-4               3000000000               0.25 ns/op            0 B/op          0 allocs/op                                                                                                           
BenchmarkMessageProcessing/FailProb0.01-4               3000000000               0.25 ns/op            0 B/op          0 allocs/op                                                                                                           
BenchmarkMessageProcessing/FailProb0.05-4               1000000000               1.62 ns/op            0 B/op          0 allocs/op

(issue-1654)

16:02 $ go test -v -cpu 4 -bench=BenchmarkMessageProcessing -benchtime 5s   -run=^$ 
goos: linux
goarch: amd64
pkg: github.com/ethersphere/swarm/pss
BenchmarkMessageProcessing/0.00_-4              3000000000               0.48 ns/op            0 B/op          0 allocs/op
BenchmarkMessageProcessing/0.01_-4              2000000000               0.32 ns/op            0 B/op          0 allocs/op
BenchmarkMessageProcessing/0.05_-4              2000000000               0.30 ns/op            0 B/op          0 allocs/op

I don't think there is much change in performance between the two implementations, althoug times change a lot among executions of the benchmark tests.

nolash · 2019-08-28T07:38:06Z

pss/pss.go

-				if err != nil {
-					log.Error(err.Error())
-					metrics.GetOrRegisterCounter("pss.forward.err", nil).Inc(1)
+	// In any case, if the message


Does this comment belong here?

Removed comment

FIxed in 1b91b66

nolash · 2019-08-28T07:51:40Z

pss/pss_test.go

-	select {
-	case outmsg = <-ps.outbox:
-	default:
+	if len(processed) > 1 {


superfluous

Removed processed slice and checks on it. FIxed in 1b91b66

nolash · 2019-08-28T07:52:44Z

pss/pss_test.go

-		Data:  []byte{0x66, 0x6f, 0x6f},
+	outboxCapacity := 2
+
+	processed := make([]*PssMsg, 0)


no need for this just time out on successC instead

FIxed in 1b91b66

nolash · 2019-08-28T08:02:25Z

pss/pss_test.go

+	ps.Stop()
+
+	// finish processing message
+	procChan <- struct{}{}


Shouldn't we here instead just check if the channel is closed if that's what the test is about? The error state of a test should not be panic

select { case _, ok := <-procChan if !ok { t.Fatal(...) } default: }

The idea was closing ps /outbox in the middle of a message processing and avoid panics (because of closed channels). Not sure how to test that.

Which channel did you want to test for panic on close?

Well, process and slots channels because they are the ones that could be used by one routine processing a message (process if the forwarding failed in reenqueue(), slots if succeded in free())

I'm not sure this should be tested. But if the problem is the attempt to re-enqueue on fail after channel is closed then maybe the re-enqueue should have a check on the quit channel instead on higher priority, and if its closed then don't even try to re-enqueue. Or am I missing the point?

Maybe we shouldn't test that, but in the early stages of implementation I had panic's when closing so I did this test to feel safe about it.
I can remove it completely.

Test removed

nolash · 2019-08-28T08:04:05Z

pss/pss_test.go

+}
+
+func benchmarkMessageProcessing(b *testing.B, failProb float32) {
+	rand.Seed(0)


Please add the b.N loop

That could be the reason for the strange results

Ok, added. Now it seems more stable

this benchmark is myterious, please rethink :)

FIxed in 1b91b66. Now it is more stable with these values

$ go test -v -bench=BenchmarkMessageProcessing -run=^$ goos: linux goarch: amd64 pkg: github.com/ethersphere/swarm/pss BenchmarkMessageProcessing/FailProb0.00-4 1 2915534902 ns/op 168677016 B/op 1752294 allocs/op BenchmarkMessageProcessing/FailProb0.01-4 1 1072452793 ns/op 92210520 B/op 1668221 allocs/op BenchmarkMessageProcessing/FailProb0.05-4 2 758722005 ns/op 89287224 B/op 1703359 allocs/op PASS ok github.com/ethersphere/swarm/pss 6.974s

@zelig comments like that aren't helpful. If you have objections then please share them concisely.

zelig · 2019-08-28T10:18:35Z

pss/pss_test.go

+	succesForward := func(msg *PssMsg) bool {
+		roll := rand.Float32()
+		if failProb > roll {
+			failed++


race on incrementing

I will remove the failed counter as we are not using it anymore.

Removed counter in commit dad34db

zelig · 2019-08-28T10:19:27Z

pss/pss_test.go

+}
+
+func benchmarkMessageProcessing(b *testing.B, failProb float32) {
+	rand.Seed(0)


this benchmark is myterious, please rethink :)

nolash · 2019-08-28T10:49:57Z

pss/pss_test.go

-	procChan<- struct{}{}
-	procChan<- struct{}{}
+	procChan <- struct{}{}
+	procChan <- struct{}{}
 	//Must wait a bit for the routines processing the messages to free the slots
 	time.Sleep(1 * time.Millisecond)


I really don't like sleeps in tests.

Me neither, but is the only way I've found to be sure the processing routine finish freeing the slot. Maybe I should loop with timeout until that.

Sleep removed. Instead used a timed waiting in the slots channel

//There should be a slot again in the outbox select { case <-ps.outbox.slots: case <-time.After(2 * time.Second): t.Fatalf("timeout waiting for a free slot") }

zelig

looks good to me. Any meaningful comparison with benchmarks?
Do u expect performance diffference?

kortatu · 2019-08-28T12:21:25Z

looks good to me. Any meaningful comparison with benchmarks?
Do u expect performance diffference?

No, I've been testing both branches and I think performance-wise there is no difference. But code-wise this one is way much better.

nolash · 2019-08-28T13:11:43Z

If we're going with this one please remove the benchmark after we're done.

kortatu · 2019-08-28T13:26:22Z

If we're going with this one please remove the benchmark after we're done.

Don't you think it is interesting to have a benchmark in case a future change deteriorate the performance of the outbox processing?

nolash · 2019-08-28T14:50:17Z

Don't you think it is interesting to have a benchmark in case a future change deteriorate the performance of the outbox processing?

I don't know, really. It's a very shallow benchmark, and I seriously doubt we will be changing this part of the code anytime soon.

kortatu · 2019-08-30T13:29:15Z

I don't know, really. It's a very shallow benchmark, and I seriously doubt we will be changing this part of the code anytime soon.

Benchmark test removed

* 'master' of github.com:ethersphere/swarm: pss: Modularize crypto and remove Whisper. Step 1 - isolate whisper code (ethersphere#1698) pss: Improve pressure backstop queue handling - no mutex (ethersphere#1695) cmd/swarm-snapshot: if 2 nodes to create snapshot use connectChain (ethersphere#1709) network: Add API for Capabilities (ethersphere#1675) pss: fixed flaky test that was using a global variable instead of a local one (ethersphere#1702) pss: Port tests to `network/simulation` (ethersphere#1682) storage: fix hasherstore seen check to happen when error is nil (ethersphere#1700) vendor: upgrade go-ethereum to 1.9.2 (ethersphere#1689) bzzeth: initial support for bzz-eth protocol (ethersphere#1571) network/stream: terminate runUpdateSyncing on peer quit (ethersphere#1696) all: first working SWAP version (ethersphere#1554) version: update to v0.5.0 unstable (ethersphere#1694) chunk, storage: storage with multi chunk Set method (ethersphere#1684) chunk, storage: add HasMulti to chunk.Store (ethersphere#1686) chunk, shed, storage: chunk.Store GetMulti method (ethersphere#1691) api, chunk: progress bar support (ethersphere#1649)

pss: reimplement outbox so failed messages could be reenqueued and pr…

617bfae

…ocess messages in parallel

kortatu added enhancement ready for review pss labels Aug 27, 2019

kortatu requested review from zelig and nolash August 27, 2019 14:21

zelig approved these changes Aug 27, 2019

View reviewed changes

nolash suggested changes Aug 28, 2019

View reviewed changes

pss: fixes comments in PR. Implemented correctly benchmark test

1b91b66

zelig suggested changes Aug 28, 2019

View reviewed changes

kortatu requested a review from nolash August 28, 2019 10:23

pss: removed failed counter to avoid race conditions

dad34db

nolash reviewed Aug 28, 2019

View reviewed changes

pss: lint error corrected

9b09635

zelig approved these changes Aug 28, 2019

View reviewed changes

nolash added this to In review (includes Documentation) in Swarm Core - Sprint planning Aug 30, 2019

nolash moved this from In review (includes Documentation) to Done in Swarm Core - Sprint planning Aug 30, 2019

nolash moved this from Done to In review (includes Documentation) in Swarm Core - Sprint planning Aug 30, 2019

nolash removed this from In review (includes Documentation) in Swarm Core - Sprint planning Aug 30, 2019

nolash mentioned this pull request Aug 30, 2019

pss: Improve pressure backstop queue handling #1680

Closed

kortatu added 3 commits August 30, 2019 11:28

Merge branch 'master' into issue-1654-channels

cb9421c

pss: lint

ee5aea8

pss: removed unnecessary tests. Fixed sleep in TestOutboxFull

5e113e8

nolash approved these changes Sep 2, 2019

View reviewed changes

nolash merged commit 9c8262f into ethersphere:master Sep 2, 2019

kortatu deleted the issue-1654-channels branch September 11, 2019 07:36

skylenet added this to the 0.5.0 milestone Sep 17, 2019

pss: Improve pressure backstop queue handling - no mutex #1695

pss: Improve pressure backstop queue handling - no mutex #1695

Conversation

kortatu commented Aug 27, 2019

zelig left a comment

Choose a reason for hiding this comment

nolash commented Aug 28, 2019

kortatu commented Aug 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kortatu Aug 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zelig left a comment

Choose a reason for hiding this comment

kortatu commented Aug 28, 2019

nolash commented Aug 28, 2019

kortatu commented Aug 28, 2019

nolash commented Aug 28, 2019

kortatu commented Aug 30, 2019

kortatu Aug 28, 2019 •

edited

Loading