Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

pss: Improve pressure backstop queue handling - no mutex #1695

Merged
merged 7 commits into from
Sep 2, 2019

Conversation

kortatu
Copy link
Contributor

@kortatu kortatu commented Aug 27, 2019

Implementation of parallelization of forwarded messages using solution proposed by @zelig using only channels.
We have paralellize the message processing inside the main processing loop:

go func() {
		for slot := range p.outbox.process {
			go func(slot int) {
				msg := p.outbox.msg(slot)
				sent := p.forwardFunc(msg.msg)
				if sent {
					// free the outbox slot
					p.outbox.free(slot)
                                        ...
				} else {
					// if we failed to send to anyone, re-insert message in the send-queue										
					p.outbox.reenqueue(slot)
				}
			}(slot)
		}
	}()

All functions and variables related to the outbox are encapsulated in a new outbox type

type outbox struct {
	queue   []*outboxMsg
	slots   chan int
	process chan int
	quitC   chan struct{}
}

When a new message is received for forwarding we call outbox.enqueue():

func (o *outbox) enqueue(outboxmsg *outboxMsg) error {
	// first we try to obtain a slot in the outbox
	select {
	case slot := <-o.slots:
		o.queue[slot] = outboxmsg
		metrics.GetOrRegisterGauge("pss.outbox.len", nil).Update(int64(o.len()))
		// we send this message slot to process
		select {
		case o.process <- slot:
		case <-o.quitC:
		}
		return nil
	default:
		metrics.GetOrRegisterCounter("pss.enqueue.outbox.full", nil).Inc(1)
		return errors.New("outbox full")
	}
}

Note also that with this implementation, the enqueue method is blocking in channel outbox.process, so some tests (that didn't start the ps process) have been modified to pull from that channel.

Added a benchmark test has been implemented (BenchmarkMessageProcessing) to compare performance with mutex implementation (PR #1680).

Also added a new pluggable forward function in pss for testing purposes.

This PR is related with issue #1654

Copy link
Member

@zelig zelig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks so much neater than #1680 no?
minor suggestions only

pss/pss.go Outdated Show resolved Hide resolved
pss/pss.go Outdated Show resolved Hide resolved
pss/pss.go Outdated Show resolved Hide resolved
pss/pss.go Outdated Show resolved Hide resolved
pss/pss.go Outdated Show resolved Hide resolved
pss/pss_test.go Outdated Show resolved Hide resolved
pss/pss_test.go Outdated Show resolved Hide resolved
pss/pss_test.go Outdated Show resolved Hide resolved
pss/pss_test.go Outdated Show resolved Hide resolved
pss/pss_test.go Show resolved Hide resolved
@nolash
Copy link
Contributor

nolash commented Aug 28, 2019

The Benchmark results for the channels are rather peculiar.

$ git checkout issue-1654-channels
Switched to branch 'issue-1654-channels'
Your branch is up to date with 'epiclabs/issue-1654-channels'.
[lash@sostenuto swarm]$ go test -v -cpu 4 ./pss/ -run ^$ -bench MessageProcessing
goos: linux
goarch: amd64
pkg: github.com/ethersphere/swarm/pss
BenchmarkMessageProcessing/FailProb0.00-4         	       1	2095323190 ns/op	172160096 B/op	 1794805 allocs/op
BenchmarkMessageProcessing/FailProb0.01-4         	1000000000	         0.77 ns/op	       0 B/op	       0 allocs/op
BenchmarkMessageProcessing/FailProb0.05-4         	50000000	        20.9 ns/op	       2 B/op	       0 allocs/op
PASS
ok  	github.com/ethersphere/swarm/pss	63.340s
$ git checkout issue-1654
Switched to branch 'issue-1654'
Your branch is up to date with 'epiclabs/issue-1654'.
[lash@sostenuto swarm]$ go test -v -cpu 4 ./pss/ -run ^$ -bench MessageProcessing
goos: linux
goarch: amd64
pkg: github.com/ethersphere/swarm/pss
BenchmarkMessageProcessing/0.00_-4         	       1	1220816527 ns/op	163138096 B/op	 1699684 allocs/op
BenchmarkMessageProcessing/0.01_-4         	       1	1028150022 ns/op	79193224 B/op	 1516835 allocs/op
BenchmarkMessageProcessing/0.05_-4         	       2	 510930503 ns/op	38372388 B/op	  747695 allocs/op
PASS
ok  	github.com/ethersphere/swarm/pss	4.505s

@kortatu
Copy link
Contributor Author

kortatu commented Aug 28, 2019

The Benchmark results for the channels are rather peculiar.

In fact I don't understand fully the go benchmark tests. Finally I got something more stable with a benchtime of 2s:
(issue-1654-channels)

9:21 $ go test -v -cpu 4 -bench=BenchmarkMessageProcessing -benchtime 2s   -run=^$                                                                                                                                                          
goos: linux                                                                                                                                                                                                                                  
goarch: amd64                                                                                                                                                                                                                                
pkg: github.com/ethersphere/swarm/pss                                                                                                                                                                                                        
BenchmarkMessageProcessing/FailProb0.00-4               3000000000               0.25 ns/op            0 B/op          0 allocs/op                                                                                                           
BenchmarkMessageProcessing/FailProb0.01-4               3000000000               0.25 ns/op            0 B/op          0 allocs/op                                                                                                           
BenchmarkMessageProcessing/FailProb0.05-4               1000000000               1.62 ns/op            0 B/op          0 allocs/op       

(issue-1654)

16:02 $ go test -v -cpu 4 -bench=BenchmarkMessageProcessing -benchtime 5s   -run=^$ 
goos: linux
goarch: amd64
pkg: github.com/ethersphere/swarm/pss
BenchmarkMessageProcessing/0.00_-4              3000000000               0.48 ns/op            0 B/op          0 allocs/op
BenchmarkMessageProcessing/0.01_-4              2000000000               0.32 ns/op            0 B/op          0 allocs/op
BenchmarkMessageProcessing/0.05_-4              2000000000               0.30 ns/op            0 B/op          0 allocs/op

I don't think there is much change in performance between the two implementations, althoug times change a lot among executions of the benchmark tests.

pss/pss.go Outdated
if err != nil {
log.Error(err.Error())
metrics.GetOrRegisterCounter("pss.forward.err", nil).Inc(1)
// In any case, if the message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment belong here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIxed in 1b91b66

pss/pss_test.go Outdated
select {
case outmsg = <-ps.outbox:
default:
if len(processed) > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superfluous

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed processed slice and checks on it. FIxed in 1b91b66

pss/pss_test.go Outdated
Data: []byte{0x66, 0x6f, 0x6f},
outboxCapacity := 2

processed := make([]*PssMsg, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this just time out on successC instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIxed in 1b91b66

ps.Stop()

// finish processing message
procChan <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we here instead just check if the channel is closed if that's what the test is about? The error state of a test should not be panic

select {
    case _, ok := <-procChan
        if !ok {
            t.Fatal(...)
        }
     default:
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was closing ps /outbox in the middle of a message processing and avoid panics (because of closed channels). Not sure how to test that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which channel did you want to test for panic on close?

Copy link
Contributor Author

@kortatu kortatu Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, process and slots channels because they are the ones that could be used by one routine processing a message (process if the forwarding failed in reenqueue(), slots if succeded in free())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this should be tested. But if the problem is the attempt to re-enqueue on fail after channel is closed then maybe the re-enqueue should have a check on the quit channel instead on higher priority, and if its closed then don't even try to re-enqueue. Or am I missing the point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we shouldn't test that, but in the early stages of implementation I had panic's when closing so I did this test to feel safe about it.
I can remove it completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test removed

pss/pss_test.go Outdated
}

func benchmarkMessageProcessing(b *testing.B, failProb float32) {
rand.Seed(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the b.N loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be the reason for the strange results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added. Now it seems more stable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this benchmark is myterious, please rethink :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIxed in 1b91b66. Now it is more stable with these values

$ go test -v -bench=BenchmarkMessageProcessing -run=^$
goos: linux
goarch: amd64
pkg: github.com/ethersphere/swarm/pss
BenchmarkMessageProcessing/FailProb0.00-4                      1        2915534902 ns/op        168677016 B/op   1752294 allocs/op
BenchmarkMessageProcessing/FailProb0.01-4                      1        1072452793 ns/op        92210520 B/op    1668221 allocs/op
BenchmarkMessageProcessing/FailProb0.05-4                      2         758722005 ns/op        89287224 B/op    1703359 allocs/op
PASS
ok      github.com/ethersphere/swarm/pss        6.974s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zelig comments like that aren't helpful. If you have objections then please share them concisely.

pss/pss_test.go Outdated
succesForward := func(msg *PssMsg) bool {
roll := rand.Float32()
if failProb > roll {
failed++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

race on incrementing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the failed counter as we are not using it anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed counter in commit dad34db

pss/pss_test.go Outdated
}

func benchmarkMessageProcessing(b *testing.B, failProb float32) {
rand.Seed(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this benchmark is myterious, please rethink :)

@kortatu kortatu requested a review from nolash August 28, 2019 10:23
pss/pss_test.go Outdated
procChan<- struct{}{}
procChan<- struct{}{}
procChan <- struct{}{}
procChan <- struct{}{}
//Must wait a bit for the routines processing the messages to free the slots
time.Sleep(1 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like sleeps in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me neither, but is the only way I've found to be sure the processing routine finish freeing the slot. Maybe I should loop with timeout until that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleep removed. Instead used a timed waiting in the slots channel

//There should be a slot again in the outbox
select {
case <-ps.outbox.slots:
case <-time.After(2 * time.Second):
    t.Fatalf("timeout waiting for a free slot")
}

Copy link
Member

@zelig zelig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. Any meaningful comparison with benchmarks?
Do u expect performance diffference?

@kortatu
Copy link
Contributor Author

kortatu commented Aug 28, 2019

looks good to me. Any meaningful comparison with benchmarks?
Do u expect performance diffference?

No, I've been testing both branches and I think performance-wise there is no difference. But code-wise this one is way much better.

@nolash
Copy link
Contributor

nolash commented Aug 28, 2019

If we're going with this one please remove the benchmark after we're done.

@kortatu
Copy link
Contributor Author

kortatu commented Aug 28, 2019

If we're going with this one please remove the benchmark after we're done.

Don't you think it is interesting to have a benchmark in case a future change deteriorate the performance of the outbox processing?

@nolash
Copy link
Contributor

nolash commented Aug 28, 2019

Don't you think it is interesting to have a benchmark in case a future change deteriorate the performance of the outbox processing?

I don't know, really. It's a very shallow benchmark, and I seriously doubt we will be changing this part of the code anytime soon.

@nolash nolash added this to In review (includes Documentation) in Swarm Core - Sprint planning Aug 30, 2019
@nolash nolash moved this from In review (includes Documentation) to Done in Swarm Core - Sprint planning Aug 30, 2019
@nolash nolash moved this from Done to In review (includes Documentation) in Swarm Core - Sprint planning Aug 30, 2019
@nolash nolash removed this from In review (includes Documentation) in Swarm Core - Sprint planning Aug 30, 2019
@kortatu
Copy link
Contributor Author

kortatu commented Aug 30, 2019

I don't know, really. It's a very shallow benchmark, and I seriously doubt we will be changing this part of the code anytime soon.

Benchmark test removed

@nolash nolash merged commit 9c8262f into ethersphere:master Sep 2, 2019
@kortatu kortatu deleted the issue-1654-channels branch September 11, 2019 07:36
@skylenet skylenet added this to the 0.5.0 milestone Sep 17, 2019
chadsr added a commit to chadsr/swarm that referenced this pull request Sep 23, 2019
* 'master' of github.com:ethersphere/swarm:
  pss: Modularize crypto and remove Whisper. Step 1 - isolate whisper code (ethersphere#1698)
  pss: Improve pressure backstop queue handling - no mutex (ethersphere#1695)
  cmd/swarm-snapshot: if 2 nodes to create snapshot use connectChain (ethersphere#1709)
  network: Add API for Capabilities (ethersphere#1675)
  pss: fixed flaky test that was using a global variable instead of a local one (ethersphere#1702)
  pss: Port tests to `network/simulation` (ethersphere#1682)
  storage: fix hasherstore seen check to happen when error is nil (ethersphere#1700)
  vendor: upgrade go-ethereum to 1.9.2 (ethersphere#1689)
  bzzeth: initial support for bzz-eth protocol (ethersphere#1571)
  network/stream: terminate runUpdateSyncing on peer quit (ethersphere#1696)
  all: first working SWAP version (ethersphere#1554)
  version: update to v0.5.0 unstable (ethersphere#1694)
  chunk, storage: storage with multi chunk Set method (ethersphere#1684)
  chunk, storage: add HasMulti to chunk.Store (ethersphere#1686)
  chunk, shed, storage: chunk.Store GetMulti method (ethersphere#1691)
  api, chunk: progress bar support (ethersphere#1649)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants