Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(relay): Implement decaying rule [TET-607] #1692

Merged
merged 18 commits into from
Jan 9, 2023

Conversation

iambriccardo
Copy link
Member

@iambriccardo iambriccardo commented Dec 13, 2022

A decaying function is a function that interpolates the value of the sample rate based on a set of parameters (e.g. start, end, fromSampleRate). The rationale behind the decaying function is to be able to systematically change the sample rate based on multiple parameters. This should be especially helpful with the behavior of latest releases, in which the number of transactions grows non immediately and we want to inversely follow that trend.

There exist currently two types of decaying functions:

  • constant → returns the sampleRate irrespectively of the time of evaluation.

    {
    	...
    	"decayingFn": {
    		"type": "constant"	
      }
    	...
    }
  • linear → returns a decreasing sample rate from sampleRate to decayedSampleRate based on the time of evaluation.

    { 
      ...
    	"decayingFn": {
    		"function": "linear",
    		"decayedSampleRate": 0.6
    	}
    	...
    }

By default, if no decaying function is specified, the constant function will be used, which returns the sampleRate at any point within the specified time range.

Both decaying functions require time ranges, that are fetched from the previously existing timeRange field:

{ 
	...
  "timeRange": {
		"start": "2022-10-21 18:50:25+00:00",
		"end": "2022-10-21 19:50:25+00:00"
	},
	"decayingFn": {
		"function": "linear",
		"decayedSampleRate": 0.6
	}
	...
}

Depending on which decayingFn is used, the timeRange check will be different:

  • constant → a valid time range can be open or closed (which means that you can have both intervals closed or at most one side of the interval open) and, if both are set, start < end.
  • linear → a valid time range must be closed and start < end.

In case at least one of the required conditions are not met, the rule with be considered inactive and won’t be matched.

Implementation

The implementation of decaying rules introduces the idea of an ActiveRule which is a rule returned only in case it is active. A rule is active if it satisfies a series of conditions that are checked in relation to the decayingFn used.

An ActiveRule has the get_sample_rate(now) method which returns the sample rate at point in time now.

@iambriccardo iambriccardo changed the title feat(relay): Implement decaying rule feat(relay): Implement decaying rule [TET-607] Dec 14, 2022
@iambriccardo iambriccardo marked this pull request as ready for review December 14, 2022 13:53
Copy link
Member

@jjbayer jjbayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general concept makes sense to me, but I have some comments on the code structure.

// We round to 2 digits in order to avoid high-precision sample rates that are more difficult
// to work with. Rounding is performed following the nearest integer.
let sample_rate_difference =
f64::round((from_sample_rate - self.external_params.base_sample_rate) * 100.0) / 100.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are rounded sample rates easier to work with?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is easier to debug and in general I thought we were going to keep 2 decimal digits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By multiplying sample_rate_difference * inverse_progress, you potentially get a lot more decimals again, right? If we want to do 1% steps of decay (e.g. 0.7, 0.69, 0.68, ...) I would round when the computation is done, e.g. in get_sample_rate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are saying is correct but I did that on purpose.

I had at first rounded only the sample difference because it was behaving quite weird, in the sense that I was getting 0.7 - 0.2 = 0.4999999 due to floating arithmetic approximations I think. And this was going to screw up the whole calculation because we started doing that with an ill number.

I think that we could simplify everything and round at the end, with some more loss propagated through multiplications but that is fine. The only problem I see is that we will have weird numbers in tests if we choose for example the 50% progress time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do increased precision and floating point glitches really make this more difficult to work with? In the end, you can just make your "correct" calculation and use the result. If you ever have to debug print the value and receive 0.49999..., it is still clear what the value is. Besides, interpolation will give you all sorts of values anyway.

Also, please do not round to integer percentages. We need more precision than that, considering that a realistic uniform sample rate is 1%. You can, of course, round your rule targets, but please do not round here.

I would suggest to remove complexity and not round anywhere in Relay.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for the explanation Jan!

relay-sampling/src/lib.rs Outdated Show resolved Hide resolved
relay-sampling/src/lib.rs Outdated Show resolved Hide resolved
relay-sampling/src/lib.rs Outdated Show resolved Hide resolved
/// want to inversely reduce the sample rate in the interval (from - base_sample_rate).
/// Thus, in the 70% case, we would take only 30% of the (from - base_sample_rate) which
/// is equal to 0.06.
fn execute_linear_decay(&self, from_sample_rate: f64) -> f64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There are variables in this function that begin with start_ and end_, so to clarify intent I would rename from_sample_rate to start_sample_rate, and create an alias let end_sample_rate = self.external_params.base_sample_rate.


/// A struct containing the set of external params required by a decaying function.
#[derive(Clone, Copy, Debug)]
struct DecayingFunctionExternalParams {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that the types involved in this logic are too tightly coupled to the serialization format, in that we differentiate between internal and external params.

In order to decouple them, can we construct something like that from the SamplingRule, e.g.

struct LinearDecay {
    // whatever model parameters a linear function needs. In theory only two (`(k, d)` for `y = kx + d`),
    // but it might make sense to store `(start, end)` timestamps and sample rates as well.
    start_time: DateTime<Utc>,
    end_time: DateTime<Utc>,
    start_rate: f64,
    end_rate: f64,
}

enum DecayFunc {
    // ...
    Linear(LinearDecay),
}

impl SamplingRule {
   fn get_decay_fn(&self) -> DecayFunc {
       // construct DecayFunc from rule here. We can also cache it on the sample rule
   }
}

Copy link
Member Author

@iambriccardo iambriccardo Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you deserialize this? I wanted to keep my change as small as possible on the existing rule payload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your proposal makes total sense, indeed, it is the simplest way of doing it but this requires the removal of the timerange, which we can for sure do but we need to be careful with backward compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave the existing rule payload as-is, and create these helper structs independently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would end up creating duplication within the json. Do we want that? And how do we treat the TimeRange, because it won't be needed anymore if we keep intervals within the functions themselves.

///
/// A decaying function is responsible of decaying the sample rate from a value to another following
/// a given curve.
#[derive(Debug, Clone, Copy, PartialEq, Serialize, Deserialize)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[derive(Debug, Clone, Copy, PartialEq, Serialize, Deserialize)]
#[derive(Default, Debug, Clone, Copy, PartialEq, Serialize, Deserialize)]

relay-sampling/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 396 to 401
impl Default for DecayingFunction {
fn default() -> Self {
Self::Constant
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impl Default for DecayingFunction {
fn default() -> Self {
Self::Constant
}
}

if let Some(params) = self.get_decaying_function_params(now) {
self.decaying_fn.call(params)
} else {
self.sample_rate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that sample_rate is not the base sample rate anymore, I believe that instead of falling back to this value we should make sure that invalid rules do not match any events. We already have the is_active mechanism for that, should we extend that function so it returns false for invalid rules?

if !rule.is_active() {
return false;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jan-auer this is a good point, with the new implementation, we would need to decide what to do in case of invalid params. We could either:

  1. Implement a mechanism that automatically falls back either to decayedSampleRate or sampleRate based on the function type.
  2. Make automatically the rule invalid by performing a validation of params before get_sample_rate
  3. Propagate some error but this is generally not suggested and looking at the current implementation it is not done

Just to clarify, this can happen for example if timeRange has start and not end, irrespectively if we specify a decaying function or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could write a custom deserializer for SamplingConfig::rules, which removes structurally invalid rules and logs an error for them. But get_decaying_function_params also depends on now, and that part of the "validation" should go into is_active IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that first we would need to clarify whether we want to allow open intervals or not.

If we wanted to do it better, I would have redefined the timerange with only closed intervals and then automatically use the constant decaying if not specified. Then for each function build custom extra validation and fallback mechanism(s).

I don't know tho, if there is a proper rationale behind the decision of open intervals. For DS we don't need it (as of now).

relay-sampling/src/lib.rs Outdated Show resolved Hide resolved
relay-server/src/utils/dynamic_sampling.rs Outdated Show resolved Hide resolved
relay-sampling/src/lib.rs Outdated Show resolved Hide resolved
CHANGELOG.md Outdated

- Add support for `limits.keepalive_timeout` configuration. ([#1645](https://github.com/getsentry/relay/pull/1645))
- Add support for decaying functions in sampling rules. ([#1692](https://github.com/getsentry/relay/pull/1692))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Add support for decaying functions in sampling rules. ([#1692](https://github.com/getsentry/relay/pull/1692))
- Add support for decaying functions in dynamic sampling rules. ([#1692](https://github.com/getsentry/relay/pull/1692))

@@ -427,57 +399,95 @@ impl SamplingRule {
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update this doc comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ One line above, it says Returns whether the sampling rule is active. We should change it to "Returns an ActiveRule if the rule is active", or something like that.

end: Some(end),
} = self.time_range
{
if self.sample_rate > decayed_sample_rate && start < end && now >= start {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because time_range.contains is not called in this branch anymore, we need to also check that now is < end (feel free to change <= to < and vice versa).

Suggested change
if self.sample_rate > decayed_sample_rate && start < end && now >= start {
if self.sample_rate > decayed_sample_rate && start <= now && now < end {

We do not need to check start < end explicitly then, because it follows from start <= now < end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this was not caught by any test case. Maybe we should add one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this was done with a purpose but for the previous implementation, now we should explicitly verify like you said because the check is in is_active.

/// A sampling rule that has been successfully matched and that contains all the required data
/// to return the sample rate.
#[derive(Debug, Clone, Copy)]
pub struct MatchingRule {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ActiveRule might actually be a better name now that I think of it, because it is created by is_active.

Suggested change
pub struct MatchingRule {
pub struct ActiveRule {

DecayingFunction::Linear {
/// A struct representing the evaluation context of a sample rate.
#[derive(Debug, Clone, Copy)]
enum SampleRateEvaluator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually declare types before using them, so In this case, declare SampleRateEvaluator, then MatchingRule, then SamplingRule. See internal docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@jjbayer jjbayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nitpicks, apart from those this looks good to me!

@@ -427,57 +399,95 @@ impl SamplingRule {
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ One line above, it says Returns whether the sampling rule is active. We should change it to "Returns an ActiveRule if the rule is active", or something like that.

Comment on lines +128 to +130
// For consistency reasons we take a snapshot in time and use that time across all code that
// requires it.
let now = Utc::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

pub fn get_sample_rate(&self, now: DateTime<Utc>) -> f64 {
self.evaluator.evaluate(now)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Because SamplingRule uses active rule, I would change the declaration order to

SampleRateEvaluator
ActiveRule
SamplingRule

);

assert_eq!(
prepare_and_get_sampling_rule(1.0, EventType::Transaction, &project_state, now)
.unwrap()
.unwrap()
.sample_rate,
0.7
0.44999999999999996
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks flaky, should we rather do something like (expected - result).abs() < eps instead? See https://rust-lang.github.io/rust-clippy/master/#float_cmp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good call! I totally overlooked the precision, especially considering that it can differ from architecture to architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants