Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new interval based cost function #2972

Merged
merged 3 commits into from May 17, 2016

Conversation

xvrl
Copy link
Member

@xvrl xvrl commented May 16, 2016

Our current segment balancing algorithm uses a cost function which has some unintended effects on the distribution of segments in the cluster. While it keeps the total number of segments roughly balanced across nodes, it does not evenly distribute segments within a given interval. This leads to many queries hitting only a fraction of the historical nodes in the cluster.

This is a proposal to revamp the cost function to address those shortcoming and improve overall segment distribution.

TL;DR here's how segment distribution has improved in the days since we rolled out this improved cost function to one of our Druid clusters (each dot is a segment, each color represents a different data source).

c

Issues with our existing cost function

Here is a breakdown of the issues with the current cost function

  1. Segments tend to form clusters with large gaps in between

    If you have take segments roughly 30 days apart, the existing gapPenalty causes the cost of putting a segment anywhere between those two to be higher than adding a segment that overlaps with them. This causes the balancing to converge to form clusters of segments ~30 days apart as seen below.

    c_t_000

  2. Segments for recent time intervals tend to be distributed unevenly

    The recencyPenalty, in combination with the gapPenalty, makes the cost function non-trivial for recent segments and causes strange gaps of data within the most recent 7 days worth of data.

    h_t_000

  3. The base joint cost between two segments is tied to segment byte sizes. This causes cost to be affected by different levels compression, and not necessarily tied to the actual cost of scanning a segment.

Improved cost function

Our improved cost function is designed to be simple and have the following properties:

  1. Our cost function reflect the cost of scanning a segment.
  2. The joint cost should reflect the likelihood of segments being scanned simultaneously, i.e. the joint cost of querying two time-slices should decrease based on how far apart they are.
  3. Our cost function should be additive, i.e. given three intervals A, B, and C, cost(A, B union C) = cost(A, B) + cost(A, C)

To satisfy 1. we will assume that the cost of scanning a unit time-slice of data is constant. This means our cost function will only depend on segment intervals.

Note: In practice, this means that if all segments in a cluster have the same segment granularity, then the cpu time spent scanning each partition should be roughly constant. Ensuring the unit of computation is constant is also good practice to reduce variance in segment scan times across the cluster.

To model the joint cost of querying two time-slices and satisfy 2., we use an exponential decay exp(-lambda * t), where lambda defines the rate at which the interaction decays, and t is the relative time differences between two to time slices. Currently lambda is set to give the decay a half-life of 24 hours.

Since we assume the cost of querying each time-slice is constant, we can compute the joint cost of two segments by simply integrating over the segment intervals, i.e. for two segments X and Y covering intervals [x_0, x_1) and [y_0, y_1) respectively, our joint cost becomes:

screen shot 2016-05-16 at 3 41 55 pm

This has the nice property of being additive as in 3. and of having a closed form solution.

One nice side-effect of additivity is that re-indexing the same data at different segment granularities (e.g. going form hourly segments to daily segments) will not affect the joint cost, assuming the cost of scanning a unit of time stays the same per shard.

Segments of the same datasources are of course more likely to be queried simultaneously, so we multiply the cost by a constant factor (2 in this case) if the two segments are from the same data source. This ensures that if two datasources are distributed equally, we are more likely to spread them across servers.

Hopefully this all makes sense, but comments are of course welcome.

Regarding performance, the new cost function is more expensive to compute than the existing one, but thanks to the optimizations in #2910 it will not be worse than what we have in our current release, and maybe even a little bit faster still.

The results of rolling out this new cost function speak for themselves. You can see how segment distribution has massively improved in our Druid cluster for the problem cases depicted above.

h

c

Addresses issues with balancing of segments in the existing cost function
- `gapPenalty` led to clusters of segments ~30 days apart
- `recencyPenalty` caused imbalance among recent segments
- size-based cost could be skewed by compression

New cost function is purely based on segment intervals:
- assumes each time-slice of a partition is a constant cost
- cost is additive, i.e. cost(A, B union C) = cost(A, B) + cost(A, C)
- cost decays exponentially based on distance between time-slices
@xvrl xvrl force-pushed the updated-balancing-costfunction branch from 1c08dd5 to 9d9b3eb Compare May 16, 2016
@fjy fjy added this to the 0.9.1 milestone May 16, 2016
return 0;
}
}
static final double HALF_LIFE = 24.0; // cost function half-life in hours
Copy link
Contributor

@fjy fjy May 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be really nice to have some comments about what everything means and how the algo works

Copy link
Member Author

@xvrl xvrl May 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

half life is by definition ln(2) / lambda i.e. the time difference that will make the joint cost go down by half

@fjy
Copy link
Contributor

@fjy fjy commented May 16, 2016

👍 if we have more comments about how the algorithm works

@xvrl
Copy link
Member Author

@xvrl xvrl commented May 16, 2016

@fjy added a few more comments + links to the PR description in the code

return intervalCost(y0, y0, y1) +
intervalCost(beta, beta, gamma) +
// cost of exactly overlapping intervals of size beta
2 * (beta + FastMath.exp(-beta) - 1);
Copy link
Contributor

@drcrallen drcrallen May 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a constant?

Copy link
Contributor

@drcrallen drcrallen May 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I suppose this is just the solution to the integral?

Can the comment be described as such?

Copy link
Member Author

@xvrl xvrl May 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drcrallen this can't be a constant, since beta depends on interval start / end. And yes, this is the solution to
\int_0^{\beta} \int_{0}^{\beta} e^{|x-y|}dxdy = 2 \cdot (\beta + e^{-\beta} - 1)

screen shot 2016-05-16 at 9 14 26 pm

Copy link
Member Author

@xvrl xvrl May 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some comments

Copy link
Contributor

@drcrallen drcrallen May 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the constant out front before I realized its just the equation solution

{
final long gapMillis = gapMillis(segment1.getInterval(), segment2.getInterval());
Copy link
Member

@nishantmonu51 nishantmonu51 May 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can probably also delete static method gapMillis now

Copy link
Member Author

@xvrl xvrl May 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already removed it.

@nishantmonu51
Copy link
Member

@nishantmonu51 nishantmonu51 commented May 17, 2016

👍

@drcrallen drcrallen removed the Discuss label May 17, 2016
@drcrallen drcrallen merged commit e79284d into apache:master May 17, 2016
1 check passed
@drcrallen drcrallen deleted the updated-balancing-costfunction branch May 17, 2016
xvrl added a commit to metamx/druid that referenced this issue May 18, 2016
* new interval based cost function

Addresses issues with balancing of segments in the existing cost function
- `gapPenalty` led to clusters of segments ~30 days apart
- `recencyPenalty` caused imbalance among recent segments
- size-based cost could be skewed by compression

New cost function is purely based on segment intervals:
- assumes each time-slice of a partition is a constant cost
- cost is additive, i.e. cost(A, B union C) = cost(A, B) + cost(A, C)
- cost decays exponentially based on distance between time-slices

* comments and formatting

* add more comments to explain the calculation
drcrallen added a commit to metamx/druid that referenced this issue May 18, 2016
@xvrl xvrl added Improvement and removed Bug labels May 20, 2016
@sascha-coenen
Copy link

@sascha-coenen sascha-coenen commented May 23, 2016

Awesome work! Can't wait to migrate. As a little teaser, could you maybe let me know how much impact this new balancing strategy has on the query performance?

@leventov
Copy link
Member

@leventov leventov commented Jun 23, 2017

@xvrl could you comment why did you use FastMath instead of plain Math?

@xvrl
Copy link
Member Author

@xvrl xvrl commented Jul 10, 2017

@leventov I used FastMath simply because it was faster than Math and I was ok with the precision trade-offs for this particular use-case. The objective was to keep the performance of the balancing comparable to what it was prior to #2972.

@leventov
Copy link
Member

@leventov leventov commented Jul 11, 2017

It's interesting because according to https://issues.apache.org/jira/browse/MATH-1422 "FastMath" may not actually be faster, but OTOH is is supposed to be more precise.

FYI @dgolitsyn

@xvrl
Copy link
Member Author

@xvrl xvrl commented Jul 11, 2017

@leventov I only used FastMath.exp(), and in this particular case I do remember it being faster. It is quite possible that as time goes by, with better hardware and improved JIT this may not hold forever. If you have data showing it does not help much, I wouldn't have any objection in removing the fastmath dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants