Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Issue/46 LogMergePolicy #49

Merged
merged 9 commits into from Nov 4, 2016
Merged

Conversation

currymj
Copy link
Contributor

@currymj currymj commented Oct 27, 2016

Implements a LogMergePolicy as mentioned in #46.

I would like to add some tests. However, I'm not totally clear on what might need to be tested. Thoughts?

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling d46d453 on currymj:issue/46 into * on fulmicoton:master*.

fn compute_merge_candidates(&self, segments: &[SegmentMeta]) -> Vec<MergeCandidate> {
// what about when segments is empty??
// take log of size of each segment
// log2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return no merges if the array is empty.

let size_sorted_log_tuples: Vec<_> = size_sorted_tuples.iter()
.map(|x| {(x.0, (x.1 as f64).log2())}).collect();

let (first_ind, first_score) = size_sorted_log_tuples[0];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above

@fulmicoton
Copy link
Collaborator

I put a bunch of comments.

About unit test, you could make calls with very specific values. (no segment, X segment on the same level, X segment on level 0, X segments on level 2, segments below the lowest level threshold)

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5cf154c on currymj:issue/46 into * on fulmicoton:master*.

let mut current_max_log_size = first_score;
let mut levels = Vec::new();
levels.push(Vec::new());
levels.last_mut().unwrap().push(first_ind);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe avoid the unwrap and the three lines with.
let mut levels = vec!(vec!(first_ind));

// log2
let mut size_sorted_tuples = segments.iter()
.map(|x| {x.num_docs})
.enumerate()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can carry around SegmentId directly (as opposed to using usize). They are copyable and relatively small.


let result = levels.iter()
.map(|ind_vec| {
MergeCandidate(ind_vec.iter()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to trigger a merge only when we exceed a given a number of element in the same level. (probably a number between 6 and 8...)
2-way merge are very inefficient.

levels.push(Vec::new());
levels.last_mut().unwrap().push(first_ind);
for &(ind, score) in (&size_sorted_log_tuples).iter().skip(1) {
if score < (current_max_log_size - LEVEL_LOG_SIZE) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a hidden flaw to doing the exponential thing all the way.

In a normal setting, people will roughly produce fresh segment that have the same size in average. For the wikipedia
example, with 2.5 GB of heap and 8 threads for instance, my segments count around 100K documents.

But users may have accidentally (or because of a timer constraint) committed segments with tiny number of documents. They might end up with segment 1, 3, 9.

With the current implementation, these segment will never be merged. We can avoid that by having a value for the lowest level. The value could be customizable by the user. In the wikipedia example, I would probably want to set it to 130K.

Lucene has a default of 1000, and lets the user set it on its own. Another approach might be to define it as the
highest newest segment size. In any case, could you add such a property, to your merge policy and we'll work later on designing how to make it settable by the user later?

@fulmicoton
Copy link
Collaborator

@currymj Sorry I am new to the code review functionality. I wrote those comments weeks ago and apparently did not finalize the review. Let me know if those make sense.

@currymj
Copy link
Contributor Author

currymj commented Nov 3, 2016

This is all helpful advice! I was just digging through the Lucene code to try to reverse engineer how/why they use the lowest level setting, but your explanation makes a lot of sense. I don't think I will find it too difficult to add this and to just copy the logic from Lucene for adding a level cutoff whenever the floor is reached, etc.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling eee0b22 on currymj:issue/46 into * on fulmicoton:master*.

@fulmicoton
Copy link
Collaborator

Cool. I'll test that at home on indexing wikipedia. It will probably shave off a lot of time over the current merge policy. If I have time, I'll make a visualization of how it behaves.

@currymj
Copy link
Contributor Author

currymj commented Nov 4, 2016

Right now the values for min merge size and segment size are very small hardcoded constants. They should probably be configurable somewhere -- maybe a LogMergePolicy instance should contain these parameters?

@@ -6,6 +6,7 @@ pub struct LogMergePolicy;
use std::f64;

const LEVEL_LOG_SIZE: f64 = 0.75;
const MIN_SEGMENT_SIZE: u32 = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a limit in number of documents? If so I think the default value is a bit low. What about 1000 for instance?

SegmentMeta::new(SegmentId::generate_random(), 2),
SegmentMeta::new(SegmentId::generate_random(), 2)];
let result_list = LogMergePolicy::default().compute_merge_candidates(&test_input);
assert!(result_list.len() == 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this checks that there is a merge candidate, but we should probably check the number of segments in the merge candidate as well.

fn test_log_merge_policy_empty() {
let y = Vec::new();
let result_list = LogMergePolicy::default().compute_merge_candidates(&y);
assert!(result_list.len() == 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a tiny detail, so no need to fix necessarily, but assert_eq or is_empty would have been nicer here.

SegmentMeta::new(SegmentId::generate_random(), 10),
SegmentMeta::new(SegmentId::generate_random(), 10)];
let result_list = LogMergePolicy::default().compute_merge_candidates(&test_input);
assert!(result_list.len() == 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the name of the unit test. It might be cool to test that merge is not trigger for only a pair of segments too. (MIN_MERGE_SIZE)

@fulmicoton fulmicoton merged commit eee0b22 into quickwit-oss:master Nov 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants