-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce search multithreading overhead #1892
Comments
use std::io::BufRead;
use std::iter;
use rand::seq::SliceRandom;
use rand::thread_rng;
fn run(n: usize, m: usize, total: usize) {
let mut res = vec![0; m];
let mut basevec = Vec::new();
for s in 0..n {
basevec.extend(iter::repeat(s).take(m));
}
for _ in 0..total {
basevec.shuffle(&mut thread_rng());
let mut cnts = vec![0; n];
for e in &basevec[0..m] {
cnts[*e] += 1;
}
res[*cnts.iter().max().unwrap() - 1] += 1;
}
let mut prob: f64 = 0.0;
for (v, c) in res.iter().enumerate() {
prob += *c as f64 / total as f64;
println!("{} => {:.5}", v, prob * 100.0);
}
println!("=====");
}
fn main() {
loop {
let mut buf = String::new();
std::io::stdin().lock().read_line(&mut buf).ok();
let v: Vec<_> = buf
.split(' ')
.map(|s| s.trim().parse::<usize>().unwrap())
.collect();
run(v[0], v[1], v[2]);
}
} |
n shards, top m are requested x - axis - number of shards, one plot for different m values We see its most tight for the bound 10, which also seems to be one of the most common requests |
I tried to do back-of-the-envelope calculations for this and failed miserably because order statistics is hard :) However I did find this and they suggest using
|
@royjacobson if you like math you're welcomed to join me! 🎩 I tried using it (I studied Probability theory for a year!! 😆 ) but it quickly became evident that there is no way to verify your results. And besides writing an experiment is faster and more reliable |
The next optimization is for structured search. Approaching this problem very simply and trying to find a very reliable upper bound Given an query with Now, imagine we have found Let's be pessimistic and assume our shard x has already the highest number of hits, so all others will be lower. Next we want to find how for a query For this I've written a simple script fn run(n: u32) {
let total = 300_000;
let mut out = HashMap::<u32, Vec<u32>>::new();
for _ in 0..total {
let m: usize = thread_rng().gen::<usize>() % 2000usize + 1;
let mut bv = vec![0u32; m];
for v in &mut bv {
*v = thread_rng().gen::<u32>() % n;
}
let mut cnts = vec![0u32; n as usize];
for v in bv {
cnts[v as usize] += 1;
}
//let vavg = cnts.iter().sum::<u32>() / cnts.len() as u32;
let vmax = *cnts.iter().max().unwrap();
let vmin = *cnts.iter().min().unwrap();
if vmax > 100 {
continue;
}
out.entry(vmax).or_insert_with(|| vec![]).push(vmin);
}
let mut ps: Vec<(u32, u32)> = vec![];
for (k, mut v) in out {
v.sort();
let i = (v.len() as f32 * 0.01f32).floor();
ps.push((k, v[i as usize]));
}
ps.sort();
let vs: Vec<u32> = ps.iter().map(|(k, v)| *v).collect();
println!("{:?}", vs);
} |
True. Its not so much about the tightest formula now, but about what assumptions we can generally make and what techniques we can use. And what logical pitfalls there are For example, optimization (1) can only be applied after check (2). Because our formulas for (1) only hold if every shard really has at least |
|
Your approach is very good! Did you think of it yourself or you read a paper? Seems like something that worthy publishing in academic literature. |
It would take ages in Python, etc. So I can only write them in GO/C++/Rust and I
That is true and I thought about it, but the first shard to finish has to make the decision on its own. If we improve the estimation for shards k... (k reliable datapoints) and onward it won't help much because our first shard will keep ruining everything
Thanks 😄 I didn't, I tried searching for quite some time but didn't find much. There certainly must be something on that topic on arxiv. Anyways I was just thinking about what really obvious optimizations we can apply |
|
I can re-write them. The graphs are a one-liner from python. Let me finish my research first 🤣
To some extent. If the bound is I will think about this, it requires a more complicated experiment |
I do not understand your last comment. Suppose we want to receive |
You forgot that k is at most k :) So for n=16 and m=160, we can be 99% sure that each shard has 16 entries if we found ~40 or more entries (that's the blue line). Hence we care only about values < 40 Lets say we found 30 entries, then with 99% each shard has at least ~8 entries, which takes us to a bound of 128 with 99%. Hence shrinking 30 -> down is already risky. Yet it is also extremely unlikely that every other shard lies on the lower bound, but my model just doesn't take that into account Instead of calculating expected 99th percentile Regarding max variance for knn from experiment (1), I'm not sure if the values technically hold for the |
That is the 99th percentile bound for But I don't yet fully understand what conclusions we can make from But how does leaving out on all shards a random value affect the sum in the end? Edit: i.e. doesn't the sum estimation imply we will have such outliers that will balance it. We need to simulate that one |
I am sorry, I am like 10 steps behind you 😄 |
I do not know if this helps or not but I asked this question on math exchange: |
This is my second script Pure experiment repeated a few million times. First, randomize the total number of results (namely m), then randomly assign each element (0...m) to one of the shards (0..n). Calculate min k, avg k, max k and store the pair (max k -> min k). At the end, group by max k (first element of the pair) and sort by min k in each group, so we get a P.S. There is a low limit on m (10k), which makes the experiment pessimistic. Also, I assume that m is evenly distributed over the range 0-10k, which is obviously not true. I assume in most cases there will be either very few data (specific queries) or very much data (broad queries). I tried generating in the range 0-200 and my lower bound still holds, but its much more tight |
You are doing the simulation of balls into bins problem: |
Attaching (what I think) is a relevant article. Specifically for Now, you correctly stated that upon sampling a bin (shard) we should be pessimistic, hence we can assume this shard has the highest number of results, therefore for a sample of where |
Orange is the formula, blue is my experiment. You can see the bound is much more strict (but its actually of the same slope - so both the formula and experiment reaffirm each other) I've also mainly googled about the balls into bins problem and the resulting binomial distribution. The problem is that most formulas are limits - with infinitely large input parameters the odds will be infinitely small. We neither have infinitely large numbers, neither need we the odds to be infinitely small, we can risk 1% in exchange for shaving off much more elements. Most of them were probably not meant to be used on small numbers Either way we have found a decent approach. I'll start implementing it, the formula and numbers can be tuned later |
|
I can't decide on the following issue: how to re-fetch documents? My first and main goal is to cut pointless serialization time. The most straightforward approach would be finding all search results and serializing only the first ones until the probabilistic bound. Since we already got the ids of the following ones, it'd be pity to not store them as sell. With that information, we don't actually need to repeat the query in an unlucky case, we just to fetch the remaining documents. However its not clear how to perform the second hop atomically. Hint: its not possible because we don't know ahead if its the last hop. We can fetch documents outside of the transactional world. However as a result of this, the search command becomes non atomic. It could return theoretically incorrect results Alternatively we need to repeat the whole query which is not cool. With a fast re-fetch approach we could even decide to lower the prob bound to ~90-95 |
If it's a rare enough scenario, do we really need to fetch the remaining results? |
It's up to 1% likely - that's not zero. I really assume we should adhere to correctness. I think in many search systems there is an agreement that returning less items than requested indicates that there are no more - I remember always checking this myself when working with sql databases. I have one more idea💡Lets introduce write epochs to indices. If in-between the query and the follow up no writes happened, it is safe to return more documents from the same result set. Otherwise we need to perform the full query again |
+1 to not returning fewer results than requested unless there aren't enough |
@dranikpg if you feel strongly about returning all the results (and I agree with your arguments) then lets pay in latency. We should be fine that 1% queries will inhibit twice latency + CPU cost. |
Okay, I can keep it as a to-do with the write epochs |
Damn, I still need to return all ids with scores to verify knn selected the most optimal ones, I can't blindly chop them off 😵 |
Ah, in your analysis you focused on recall, but you have also have topK requirement. So now the question is how much to return between |
Yes, because without a top-k query there is no variance - I always select m/n documents from each shard (if there are enough on that shard), with top-k queries I sort them beforehand |
so maybe |
My preliminary results with my prototype DF 8 threads, 300k entries. Resulting RPS:
The difference grows with larger limit values. For limit 5, 10 the gap is suspiciously big 🤔 |
I didn't touch yet KNN optimization. I need to benchmark it first and see how much sense it makes
|
Currently document iteration order is shard-sequential, meaning first all documents are listed from shard 0, then from shard 1, etc...
Alternatively, we can list documents as the first documents from all shards, the second document from all shards, etc This will allow us to reduce the number of elements we fetch from each shard
Now if we need to fetch two items, we can fetch only one from each shard -> and we have enough.
Reducing the number of fetched items is probabilistic. For example, on 16 shards, we can assume its very unlikely that even a single shard contains more than half of the documents. If it really is the case, we are forced to perform one more hop. The specific weights need to be tuned.
The text was updated successfully, but these errors were encountered: