feat: weighted candidate scoring w/ biased random selection #253

rvagg · 2023-05-25T13:15:02Z

Removed the failure counting and suspension
Incorporated failure and success counting, with exponential moving average (EMA) (α=0.5)
Store connect and manipulate time as milliseconds rather than Duration for EMA (α=0.5)
Accumulate an EMA (α=0.8) of all connect times ("overall_ema" below), higher α means newer times turn the ship slower
Added scoring for candidate comparison
- Connect time uses a exp(-λx), where λ is 1/overall_ema, giving us a [0,1] range (curve shape below for overall_ema of 50, values from [0,250] for illustration).
- Success is the plain EMA, so is already [0,1]
- Graphsync VerifiedDeal is a 1, with a weight multiplier of 3
- Graphsync FastRetrieval is a 1, with a weight multiplier of 2 (important, but Verified beats it)
Candidate comparison involves a dice-roll, line up two candidates, scale their combined values to the 1, roll a [0,1] dice and if falls within the first candidate then it wins.
- Happily, because we're essentially building a sort algorithm and the prioritywaitqueue basically does a per-candidate sort routine, we don't need to line them up to do the picking, we do it on a a:b comparison basis. It means there's more dice rolling involved (one per a:b comparison rather than one after lining them all up), but I believe that averages out over the set.
Session config has α and weight options. There's a DefaultConfig() which should be used and some sugar utility methods to modify the config according to needs.
Added a WithoutRandomness() config modifier that sets up a fixed dice roll of 0.5 so the best is always selected. This is useful for tests of course.
Added a bunch of tests to exercise the scoring and ranking, with various permutations of combined metrics.

In #92 we still have ttfb and bandwidth. I think the approach with both of these would be the same as the one with connect time (overall EMA, make a [0,1] curve, plot per-SP EMA on that curve).

There's also concurrency which we're still keeping a strict lid on. We should talk about if/how to incorporate that in scoring.

codecov-commenter · 2023-05-25T13:17:16Z

Codecov Report

Merging #253 (882af7d) into main (871d92a) will increase coverage by 0.05%.
The diff coverage is 82.97%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #253      +/-   ##
==========================================
+ Coverage   71.48%   71.54%   +0.05%     
==========================================
  Files          69       69              
  Lines        6043     6126      +83     
==========================================
+ Hits         4320     4383      +63     
- Misses       1484     1504      +20     
  Partials      239      239

Impacted Files	Coverage Δ
pkg/session/nilstate.go	`0.00% <0.00%> (ø)`
pkg/session/session.go	`75.86% <ø> (+3.73%)`	⬆️
pkg/session/config.go	`50.00% <53.06%> (-11.12%)`	⬇️
pkg/session/state.go	`94.68% <97.72%> (-0.29%)`	⬇️
pkg/lassie/lassie.go	`61.15% <100.00%> (-0.32%)`	⬇️
pkg/retriever/parallelpeerretriever.go	`98.00% <100.00%> (+2.92%)`	⬆️
...g/retriever/prioritywaitqueue/prioritywaitqueue.go	`93.26% <100.00%> (+0.87%)`	⬆️
pkg/retriever/retriever.go	`86.69% <100.00%> (+0.40%)`	⬆️

... and 3 files with indirect coverage changes

rvagg · 2023-05-25T13:24:39Z

Happily, because we're essentially building a sort algorithm and the prioritywaitqueue basically does a per-candidate sort routine, we don't need to line them up to do the picking

aaaactually this isn't true. I just realised why this might be flaky: the a:b comparison sort-esq action of the prioritywaitqueue requires determinism, "is there one better than me? I'll check them all". If they are all doing that, then the randomness introduces the possibility of being in a situation where they all find that they are the worst candidate so shouldn't run; leading to deadlock.

I've seen deadlock-like action in tests a couple of times and thought I ironed it out, but a CI failure here kind of looks like it, which is strange because WithoutRandomness() was supposed to fix that.

Converted to draft until I sort that out. Probably means finally changing prioritywaitqueue to do a line-them-all-up-and-pick-one.

hannahhoward · 2023-05-26T23:49:15Z

agree on the randomness bugs. I've long thought the architecture of the priority wait queue would eventually run into problems. I think you just need one more coordinating go-routine. Essentially, that's what would have the choice whom to run each time the queue frees up, choosing from all the possibilities. I know you're aiming to avoid it for as long as needed, but maybe now we're here.

hannahhoward

Generally, this is awesome and the right direction. If we eventually get to cross protocol choices, we'll probably want to reduce the graphsync weights (+ maybe add some protocol prioritizations).
One amazing thing to get to would be instead of protocol racing, we just have everyone running through the priority wait queue with a parallelization of >1. Then again, I have no idea how that works with not "whole request" logic for protocols like Bitswap. TBD and far away.

However, before we do anything further (and I think we need to focus on TTFB next, followed by bandwidth), I think that we need to:

Resolve the decision making process so it's "pick one from many" rather than "pick one of two".
Focus on how we record what decisions were made and why, so we can see how they affect performance when deployed.

Anyway though, this seems good and we should probably merge it if we can get the tests reliably passing.

pkg/retriever/retriever.go

pkg/session/config.go

pkg/session/state.go

rvagg · 2023-05-29T09:59:19Z

Good to go

Updated PWQ to compare all waiting candidates and run the "best" next.
Replaced CompareStorageProviders with ChooseNextProvider

hannahhoward

I've put a few remaining comments, but IMHO, this is LGTM.

pkg/retriever/prioritywaitqueue/prioritywaitqueue.go

pkg/session/state.go

hannahhoward · 2023-05-31T21:11:32Z

pkg/session/state.go

+	for ii, p := range peers {
+		ind[ii] = ii
+		gsmd, _ := mda[ii].(*metadata.GraphsyncFilecoinV1)
+		scores[ii] = spt.scoreProvider(p, gsmd)


just something to consider: maybe at some point we start squaring these values to reduce randomness. (or perhaps make that it's own adjustable parameter)

I'm hoping we can come up with ways to be able to know what to do here without guessing too much; I'm trying modelling for now, but insights and A/B testing live might be better.

Ref: #92

* Rename Graphsync specific options * Remove PaidRetrievals option

* add Duration property to FirstByte event * fix http retriever ttfb, duration and speed metrics * add MockSession to track metric reporting * ttfb & bandwidth contribute to scoring

rvagg requested review from hannahhoward and kylehuntsman May 25, 2023 13:15

rvagg marked this pull request as draft May 25, 2023 13:20

hannahhoward reviewed May 26, 2023

View reviewed changes

pkg/retriever/retriever.go Show resolved Hide resolved

pkg/session/config.go Show resolved Hide resolved

pkg/session/config.go Show resolved Hide resolved

pkg/session/state.go Outdated Show resolved Hide resolved

rvagg force-pushed the rvagg/weighted-rank branch 2 times, most recently from c4d32b9 to 65fef84 Compare May 29, 2023 09:48

rvagg marked this pull request as ready for review May 29, 2023 09:55

rvagg force-pushed the rvagg/weighted-rank branch 2 times, most recently from 57c9b10 to 1765385 Compare May 29, 2023 10:08

rvagg mentioned this pull request May 30, 2023

feat: record and use ttfb & bandwidth metrics #271

Merged

hannahhoward approved these changes May 31, 2023

View reviewed changes

rvagg added 4 commits June 1, 2023 18:08

feat: weighted candidate scoring w/ biased random selection

1c693a4

Ref: #92

feat: add complete config options

ad5be1f

* Rename Graphsync specific options * Remove PaidRetrievals option

feat!: switch to compare-all instead of compare-two

54818de

feat: pick-next provider instead of compare-two

882af7d

rvagg force-pushed the rvagg/weighted-rank branch from 1765385 to 882af7d Compare June 1, 2023 08:09

feat: record and use ttfb & bandwidth metrics

3000af2

* add Duration property to FirstByte event * fix http retriever ttfb, duration and speed metrics * add MockSession to track metric reporting * ttfb & bandwidth contribute to scoring

rvagg merged commit 9622d8b into main Jun 1, 2023
15 of 16 checks passed

rvagg deleted the rvagg/weighted-rank branch June 1, 2023 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: weighted candidate scoring w/ biased random selection #253

feat: weighted candidate scoring w/ biased random selection #253

rvagg commented May 25, 2023

codecov-commenter commented May 25, 2023 •

edited

rvagg commented May 25, 2023

hannahhoward commented May 26, 2023

hannahhoward left a comment

rvagg commented May 29, 2023

hannahhoward left a comment

hannahhoward May 31, 2023

rvagg Jun 1, 2023

feat: weighted candidate scoring w/ biased random selection #253

feat: weighted candidate scoring w/ biased random selection #253

Conversation

rvagg commented May 25, 2023

codecov-commenter commented May 25, 2023 • edited

Codecov Report

rvagg commented May 25, 2023

hannahhoward commented May 26, 2023

hannahhoward left a comment

Choose a reason for hiding this comment

rvagg commented May 29, 2023

hannahhoward left a comment

Choose a reason for hiding this comment

hannahhoward May 31, 2023

Choose a reason for hiding this comment

rvagg Jun 1, 2023

Choose a reason for hiding this comment

codecov-commenter commented May 25, 2023 •

edited