New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Issue/46 LogMergePolicy #49
Conversation
Changes Unknown when pulling d46d453 on currymj:issue/46 into * on fulmicoton:master*. |
fn compute_merge_candidates(&self, segments: &[SegmentMeta]) -> Vec<MergeCandidate> { | ||
// what about when segments is empty?? | ||
// take log of size of each segment | ||
// log2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return no merges if the array is empty.
let size_sorted_log_tuples: Vec<_> = size_sorted_tuples.iter() | ||
.map(|x| {(x.0, (x.1 as f64).log2())}).collect(); | ||
|
||
let (first_ind, first_score) = size_sorted_log_tuples[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above
I put a bunch of comments. About unit test, you could make calls with very specific values. (no segment, X segment on the same level, X segment on level 0, X segments on level 2, segments below the lowest level threshold) |
Changes Unknown when pulling 5cf154c on currymj:issue/46 into * on fulmicoton:master*. |
let mut current_max_log_size = first_score; | ||
let mut levels = Vec::new(); | ||
levels.push(Vec::new()); | ||
levels.last_mut().unwrap().push(first_ind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe avoid the unwrap and the three lines with.
let mut levels = vec!(vec!(first_ind));
// log2 | ||
let mut size_sorted_tuples = segments.iter() | ||
.map(|x| {x.num_docs}) | ||
.enumerate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can carry around SegmentId directly (as opposed to using usize). They are copyable and relatively small.
|
||
let result = levels.iter() | ||
.map(|ind_vec| { | ||
MergeCandidate(ind_vec.iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to trigger a merge only when we exceed a given a number of element in the same level. (probably a number between 6 and 8...)
2-way merge are very inefficient.
levels.push(Vec::new()); | ||
levels.last_mut().unwrap().push(first_ind); | ||
for &(ind, score) in (&size_sorted_log_tuples).iter().skip(1) { | ||
if score < (current_max_log_size - LEVEL_LOG_SIZE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a hidden flaw to doing the exponential thing all the way.
In a normal setting, people will roughly produce fresh segment that have the same size in average. For the wikipedia
example, with 2.5 GB of heap and 8 threads for instance, my segments count around 100K documents.
But users may have accidentally (or because of a timer constraint) committed segments with tiny number of documents. They might end up with segment 1, 3, 9.
With the current implementation, these segment will never be merged. We can avoid that by having a value for the lowest level. The value could be customizable by the user. In the wikipedia example, I would probably want to set it to 130K.
Lucene has a default of 1000, and lets the user set it on its own. Another approach might be to define it as the
highest newest segment size. In any case, could you add such a property, to your merge policy and we'll work later on designing how to make it settable by the user later?
@currymj Sorry I am new to the code review functionality. I wrote those comments weeks ago and apparently did not finalize the review. Let me know if those make sense. |
This is all helpful advice! I was just digging through the Lucene code to try to reverse engineer how/why they use the lowest level setting, but your explanation makes a lot of sense. I don't think I will find it too difficult to add this and to just copy the logic from Lucene for adding a level cutoff whenever the floor is reached, etc. |
Changes Unknown when pulling eee0b22 on currymj:issue/46 into * on fulmicoton:master*. |
Cool. I'll test that at home on indexing wikipedia. It will probably shave off a lot of time over the current merge policy. If I have time, I'll make a visualization of how it behaves. |
Right now the values for min merge size and segment size are very small hardcoded constants. They should probably be configurable somewhere -- maybe a LogMergePolicy instance should contain these parameters? |
@@ -6,6 +6,7 @@ pub struct LogMergePolicy; | |||
use std::f64; | |||
|
|||
const LEVEL_LOG_SIZE: f64 = 0.75; | |||
const MIN_SEGMENT_SIZE: u32 = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a limit in number of documents? If so I think the default value is a bit low. What about 1000 for instance?
SegmentMeta::new(SegmentId::generate_random(), 2), | ||
SegmentMeta::new(SegmentId::generate_random(), 2)]; | ||
let result_list = LogMergePolicy::default().compute_merge_candidates(&test_input); | ||
assert!(result_list.len() == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this checks that there is a merge candidate, but we should probably check the number of segments in the merge candidate as well.
fn test_log_merge_policy_empty() { | ||
let y = Vec::new(); | ||
let result_list = LogMergePolicy::default().compute_merge_candidates(&y); | ||
assert!(result_list.len() == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a tiny detail, so no need to fix necessarily, but assert_eq
or is_empty
would have been nicer here.
SegmentMeta::new(SegmentId::generate_random(), 10), | ||
SegmentMeta::new(SegmentId::generate_random(), 10)]; | ||
let result_list = LogMergePolicy::default().compute_merge_candidates(&test_input); | ||
assert!(result_list.len() == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the name of the unit test. It might be cool to test that merge is not trigger for only a pair of segments too. (MIN_MERGE_SIZE)
Implements a LogMergePolicy as mentioned in #46.
I would like to add some tests. However, I'm not totally clear on what might need to be tested. Thoughts?