{{ message }}
/ druid Public

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

# new interval based cost function#2972

Merged
merged 3 commits into from May 17, 2016
Merged

# new interval based cost function #2972

merged 3 commits into from May 17, 2016

## Conversation

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

### xvrl commented May 16, 2016 • edited

Our current segment balancing algorithm uses a cost function which has some unintended effects on the distribution of segments in the cluster. While it keeps the total number of segments roughly balanced across nodes, it does not evenly distribute segments within a given interval. This leads to many queries hitting only a fraction of the historical nodes in the cluster.

This is a proposal to revamp the cost function to address those shortcoming and improve overall segment distribution.

TL;DR here's how segment distribution has improved in the days since we rolled out this improved cost function to one of our Druid clusters (each dot is a segment, each color represents a different data source).

### Issues with our existing cost function

Here is a breakdown of the issues with the current cost function

1. Segments tend to form clusters with large gaps in between

If you have take segments roughly 30 days apart, the existing gapPenalty causes the cost of putting a segment anywhere between those two to be higher than adding a segment that overlaps with them. This causes the balancing to converge to form clusters of segments ~30 days apart as seen below.

2. Segments for recent time intervals tend to be distributed unevenly

The recencyPenalty, in combination with the gapPenalty, makes the cost function non-trivial for recent segments and causes strange gaps of data within the most recent 7 days worth of data.

3. The base joint cost between two segments is tied to segment byte sizes. This causes cost to be affected by different levels compression, and not necessarily tied to the actual cost of scanning a segment.

### Improved cost function

Our improved cost function is designed to be simple and have the following properties:

1. Our cost function reflect the cost of scanning a segment.
2. The joint cost should reflect the likelihood of segments being scanned simultaneously, i.e. the joint cost of querying two time-slices should decrease based on how far apart they are.
3. Our cost function should be additive, i.e. given three intervals A, B, and C, cost(A, B union C) = cost(A, B) + cost(A, C)

To satisfy 1. we will assume that the cost of scanning a unit time-slice of data is constant. This means our cost function will only depend on segment intervals.

Note: In practice, this means that if all segments in a cluster have the same segment granularity, then the cpu time spent scanning each partition should be roughly constant. Ensuring the unit of computation is constant is also good practice to reduce variance in segment scan times across the cluster.

To model the joint cost of querying two time-slices and satisfy 2., we use an exponential decay exp(-lambda * t), where lambda defines the rate at which the interaction decays, and t is the relative time differences between two to time slices. Currently lambda is set to give the decay a half-life of 24 hours.

Since we assume the cost of querying each time-slice is constant, we can compute the joint cost of two segments by simply integrating over the segment intervals, i.e. for two segments X and Y covering intervals [x_0, x_1) and [y_0, y_1) respectively, our joint cost becomes:

This has the nice property of being additive as in 3. and of having a closed form solution.

One nice side-effect of additivity is that re-indexing the same data at different segment granularities (e.g. going form hourly segments to daily segments) will not affect the joint cost, assuming the cost of scanning a unit of time stays the same per shard.

Segments of the same datasources are of course more likely to be queried simultaneously, so we multiply the cost by a constant factor (2 in this case) if the two segments are from the same data source. This ensures that if two datasources are distributed equally, we are more likely to spread them across servers.

Hopefully this all makes sense, but comments are of course welcome.

Regarding performance, the new cost function is more expensive to compute than the existing one, but thanks to the optimizations in #2910 it will not be worse than what we have in our current release, and maybe even a little bit faster still.

The results of rolling out this new cost function speak for themselves. You can see how segment distribution has massively improved in our Druid cluster for the problem cases depicted above.

Addresses issues with balancing of segments in the existing cost function
- gapPenalty led to clusters of segments ~30 days apart
- recencyPenalty caused imbalance among recent segments
- size-based cost could be skewed by compression

New cost function is purely based on segment intervals:
- assumes each time-slice of a partition is a constant cost
- cost is additive, i.e. cost(A, B union C) = cost(A, B) + cost(A, C)
- cost decays exponentially based on distance between time-slices
force-pushed the updated-balancing-costfunction branch from 1c08dd5 to 9d9b3eb May 16, 2016
added this to the 0.9.1 milestone May 16, 2016
 return 0; } } static final double HALF_LIFE = 24.0; // cost function half-life in hours

### fjy May 16, 2016

it'd be really nice to have some comments about what everything means and how the algo works

### xvrl May 16, 2016

half life is by definition ln(2) / lambda i.e. the time difference that will make the joint cost go down by half

### fjy commented May 16, 2016

 👍 if we have more comments about how the algorithm works

### xvrl commented May 16, 2016

 @fjy added a few more comments + links to the PR description in the code

 return intervalCost(y0, y0, y1) + intervalCost(beta, beta, gamma) + // cost of exactly overlapping intervals of size beta 2 * (beta + FastMath.exp(-beta) - 1);

### drcrallen May 17, 2016

Can this be a constant?

### drcrallen May 17, 2016

Or I suppose this is just the solution to the integral?

Can the comment be described as such?

### xvrl May 17, 2016 • edited

@drcrallen this can't be a constant, since beta depends on interval start / end. And yes, this is the solution to
\int_0^{\beta} \int_{0}^{\beta} e^{|x-y|}dxdy = 2 \cdot (\beta + e^{-\beta} - 1)

### drcrallen May 17, 2016

I meant the constant out front before I realized its just the equation solution

 { final long gapMillis = gapMillis(segment1.getInterval(), segment2.getInterval());

### nishantmonu51 May 17, 2016

you can probably also delete static method gapMillis now

### xvrl May 17, 2016

I already removed it.

### nishantmonu51 commented May 17, 2016

 👍

removed the Discuss label May 17, 2016
merged commit e79284d into apache:master May 17, 2016
1 check passed
deleted the updated-balancing-costfunction branch May 17, 2016
added a commit to metamx/druid that referenced this issue May 18, 2016
* new interval based cost function

Addresses issues with balancing of segments in the existing cost function
- gapPenalty led to clusters of segments ~30 days apart
- recencyPenalty caused imbalance among recent segments
- size-based cost could be skewed by compression

New cost function is purely based on segment intervals:
- assumes each time-slice of a partition is a constant cost
- cost is additive, i.e. cost(A, B union C) = cost(A, B) + cost(A, C)
- cost decays exponentially based on distance between time-slices

* comments and formatting

* add more comments to explain the calculation
mentioned this pull request May 18, 2016
added a commit to metamx/druid that referenced this issue May 18, 2016
Backport balancing improvements apache#2910 apache#2964 apache#2972
added Improvement and removed Bug labels May 20, 2016

### sascha-coenen commented May 23, 2016

 Awesome work! Can't wait to migrate. As a little teaser, could you maybe let me know how much impact this new balancing strategy has on the query performance?

mentioned this pull request May 24, 2016

### leventov commented Jun 23, 2017

 @xvrl could you comment why did you use FastMath instead of plain Math?

mentioned this pull request Jun 23, 2017

### xvrl commented Jul 10, 2017

 @leventov I used FastMath simply because it was faster than Math and I was ok with the precision trade-offs for this particular use-case. The objective was to keep the performance of the balancing comparable to what it was prior to #2972.

### leventov commented Jul 11, 2017

 It's interesting because according to https://issues.apache.org/jira/browse/MATH-1422 "FastMath" may not actually be faster, but OTOH is is supposed to be more precise. FYI @dgolitsyn

### xvrl commented Jul 11, 2017

 @leventov I only used FastMath.exp(), and in this particular case I do remember it being faster. It is quite possible that as time goes by, with better hardware and improved JIT this may not hold forever. If you have data showing it does not help much, I wouldn't have any objection in removing the fastmath dependency.

mentioned this pull request Dec 21, 2019
mentioned this pull request Apr 14, 2022