Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: weighted candidate scoring w/ biased random selection #253

Merged
merged 5 commits into from
Jun 1, 2023

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented May 25, 2023

  • Removed the failure counting and suspension
  • Incorporated failure and success counting, with exponential moving average (EMA) (α=0.5)
  • Store connect and manipulate time as milliseconds rather than Duration for EMA (α=0.5)
  • Accumulate an EMA (α=0.8) of all connect times ("overall_ema" below), higher α means newer times turn the ship slower
  • Added scoring for candidate comparison
    • Connect time uses a exp(-λx), where λ is 1/overall_ema, giving us a [0,1] range (curve shape below for overall_ema of 50, values from [0,250] for illustration).
    • Success is the plain EMA, so is already [0,1]
    • Graphsync VerifiedDeal is a 1, with a weight multiplier of 3
    • Graphsync FastRetrieval is a 1, with a weight multiplier of 2 (important, but Verified beats it)
  • Candidate comparison involves a dice-roll, line up two candidates, scale their combined values to the 1, roll a [0,1] dice and if falls within the first candidate then it wins.
    • Happily, because we're essentially building a sort algorithm and the prioritywaitqueue basically does a per-candidate sort routine, we don't need to line them up to do the picking, we do it on a a:b comparison basis. It means there's more dice rolling involved (one per a:b comparison rather than one after lining them all up), but I believe that averages out over the set.
  • Session config has α and weight options. There's a DefaultConfig() which should be used and some sugar utility methods to modify the config according to needs.
  • Added a WithoutRandomness() config modifier that sets up a fixed dice roll of 0.5 so the best is always selected. This is useful for tests of course.
  • Added a bunch of tests to exercise the scoring and ranking, with various permutations of combined metrics.

Screenshot 2023-05-24 at 2 27 19 pm

In #92 we still have ttfb and bandwidth. I think the approach with both of these would be the same as the one with connect time (overall EMA, make a [0,1] curve, plot per-SP EMA on that curve).

There's also concurrency which we're still keeping a strict lid on. We should talk about if/how to incorporate that in scoring.

@codecov-commenter
Copy link

codecov-commenter commented May 25, 2023

Codecov Report

Merging #253 (882af7d) into main (871d92a) will increase coverage by 0.05%.
The diff coverage is 82.97%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #253      +/-   ##
==========================================
+ Coverage   71.48%   71.54%   +0.05%     
==========================================
  Files          69       69              
  Lines        6043     6126      +83     
==========================================
+ Hits         4320     4383      +63     
- Misses       1484     1504      +20     
  Partials      239      239              
Impacted Files Coverage Δ
pkg/session/nilstate.go 0.00% <0.00%> (ø)
pkg/session/session.go 75.86% <ø> (+3.73%) ⬆️
pkg/session/config.go 50.00% <53.06%> (-11.12%) ⬇️
pkg/session/state.go 94.68% <97.72%> (-0.29%) ⬇️
pkg/lassie/lassie.go 61.15% <100.00%> (-0.32%) ⬇️
pkg/retriever/parallelpeerretriever.go 98.00% <100.00%> (+2.92%) ⬆️
...g/retriever/prioritywaitqueue/prioritywaitqueue.go 93.26% <100.00%> (+0.87%) ⬆️
pkg/retriever/retriever.go 86.69% <100.00%> (+0.40%) ⬆️

... and 3 files with indirect coverage changes

@rvagg rvagg marked this pull request as draft May 25, 2023 13:20
@rvagg
Copy link
Member Author

rvagg commented May 25, 2023

Happily, because we're essentially building a sort algorithm and the prioritywaitqueue basically does a per-candidate sort routine, we don't need to line them up to do the picking

aaaactually this isn't true. I just realised why this might be flaky: the a:b comparison sort-esq action of the prioritywaitqueue requires determinism, "is there one better than me? I'll check them all". If they are all doing that, then the randomness introduces the possibility of being in a situation where they all find that they are the worst candidate so shouldn't run; leading to deadlock.

I've seen deadlock-like action in tests a couple of times and thought I ironed it out, but a CI failure here kind of looks like it, which is strange because WithoutRandomness() was supposed to fix that.

Converted to draft until I sort that out. Probably means finally changing prioritywaitqueue to do a line-them-all-up-and-pick-one.

@hannahhoward
Copy link
Collaborator

agree on the randomness bugs. I've long thought the architecture of the priority wait queue would eventually run into problems. I think you just need one more coordinating go-routine. Essentially, that's what would have the choice whom to run each time the queue frees up, choosing from all the possibilities. I know you're aiming to avoid it for as long as needed, but maybe now we're here.

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this is awesome and the right direction. If we eventually get to cross protocol choices, we'll probably want to reduce the graphsync weights (+ maybe add some protocol prioritizations).
One amazing thing to get to would be instead of protocol racing, we just have everyone running through the priority wait queue with a parallelization of >1. Then again, I have no idea how that works with not "whole request" logic for protocols like Bitswap. TBD and far away.

However, before we do anything further (and I think we need to focus on TTFB next, followed by bandwidth), I think that we need to:

  1. Resolve the decision making process so it's "pick one from many" rather than "pick one of two".
  2. Focus on how we record what decisions were made and why, so we can see how they affect performance when deployed.

Anyway though, this seems good and we should probably merge it if we can get the tests reliably passing.

pkg/retriever/retriever.go Show resolved Hide resolved
pkg/session/config.go Show resolved Hide resolved
pkg/session/config.go Show resolved Hide resolved
pkg/session/state.go Outdated Show resolved Hide resolved
@rvagg rvagg force-pushed the rvagg/weighted-rank branch 2 times, most recently from c4d32b9 to 65fef84 Compare May 29, 2023 09:48
@rvagg rvagg marked this pull request as ready for review May 29, 2023 09:55
@rvagg
Copy link
Member Author

rvagg commented May 29, 2023

Good to go

  • Updated PWQ to compare all waiting candidates and run the "best" next.
  • Replaced CompareStorageProviders with ChooseNextProvider

@rvagg rvagg force-pushed the rvagg/weighted-rank branch 2 times, most recently from 57c9b10 to 1765385 Compare May 29, 2023 10:08
Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put a few remaining comments, but IMHO, this is LGTM.

pkg/retriever/prioritywaitqueue/prioritywaitqueue.go Outdated Show resolved Hide resolved
pkg/session/state.go Outdated Show resolved Hide resolved
for ii, p := range peers {
ind[ii] = ii
gsmd, _ := mda[ii].(*metadata.GraphsyncFilecoinV1)
scores[ii] = spt.scoreProvider(p, gsmd)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just something to consider: maybe at some point we start squaring these values to reduce randomness. (or perhaps make that it's own adjustable parameter)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping we can come up with ways to be able to know what to do here without guessing too much; I'm trying modelling for now, but insights and A/B testing live might be better.

* add Duration property to FirstByte event
* fix http retriever ttfb, duration and speed metrics
* add MockSession to track metric reporting
* ttfb & bandwidth contribute to scoring
@rvagg rvagg merged commit 9622d8b into main Jun 1, 2023
15 of 16 checks passed
@rvagg rvagg deleted the rvagg/weighted-rank branch June 1, 2023 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants