Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): Split buckets into partitions (dry run) [INGEST-1472] #1425

Merged
merged 21 commits into from
Aug 29, 2022

Conversation

jjbayer
Copy link
Member

@jjbayer jjbayer commented Aug 18, 2022

Measure what distributions we would obtain if we split flush buckets not only by project, but also by partition, which is determined by a configurable number of partitions and hashing the bucket key.

#skip-changelog

.entry(key.project_key)
.or_default()
.push(HashedBucket {
hashed_key: key.as_integer_lossy(), // TODO: Do we need a more reliable hasher?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we can rely on this hasher, this comment seems pretty clear:

// XXX: The way this hasher is used may be platform-dependent. If we want to produce the
// same hash across platforms, the `deterministic_hash` crate may be useful.

Copy link
Member

@untitaker untitaker Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I looked into this, didn't find any great, fast crates, and then @jan-auer pointed out that we already have FMV for this.

Considering we're doing sharding of kafka topics by org id at some point, I think it's time well spent to look into a hashing function that is deterministic and portable (can be replicated in Python), and I am not sure if this function here fulfills any of those criteria.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have FMV for this.

Good point, will replace with FnvHasher.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FnvHasher

If you have the time, ensure we're picking a hashing function that is easily usable in python in any case. while portability not required for this story, we will need it in future tasks for traffic steering

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a Python impl for FNV, and it looks simple enough to write ourselves if necessary:
https://pypi.org/project/fnvhash/
https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function

// XXX: The way this hasher is used may be platform-dependent. If we want to produce the
// same hash across platforms, the `deterministic_hash` crate may be useful.

// TODO(jjbayer): Use FnvHasher here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, fnv is fine and easy to impl, let's do it

relay_statsd::metric!(
histogram(MetricHistograms::BucketsPerBatch) = batch.len() as f64,
partition_key = partition_tag.as_str(),
batch_index = format!("{i}").as_str(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't tag this, as the cardinality is theoretically unbound, and instead add a histogram metric counting the number of batches within a partition, i.e. metric!(histogram(..) = capped_batches.len());

let capped_batches =
CappedBucketIter::new(buckets.into_iter(), self.config.max_flush_bytes);
let partition_tag = match partition_key {
Some(partition_key) => format!("{partition_key}"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we prefer to use .to_string() instead of format!("{..}")


f();

*METRICS_CLIENT.write() = old_client;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might have those changes locally, but this needs to be done per-thread

@jjbayer jjbayer requested a review from untitaker August 29, 2022 13:50
@@ -20,7 +20,7 @@ relay-system = { path = "../relay-system" }
serde = { version = "1.0.114", features = ["derive"] }
serde_json = "1.0.55"
failure = "0.1.8"
crc32fast = "1.2.1"
fnv = "1.0.7"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency was already in Cargo.lock.

// Create a 64-bit hash of the bucket key using FnvHasher.
// This is used for partition key computation and statsd logging.
fn hash64(&self) -> u64 {
let mut hasher = FnvHasher::default();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With fnv::FnvHasher we can auto-derive Hash, with hash32, we cannot.

@jjbayer jjbayer marked this pull request as ready for review August 29, 2022 13:52
@jjbayer jjbayer requested a review from a team August 29, 2022 13:52
@jjbayer jjbayer merged commit 8baa00c into master Aug 29, 2022
@jjbayer jjbayer deleted the feat/traffic-steering-dry-run branch August 29, 2022 14:13
jjbayer added a commit that referenced this pull request Aug 31, 2022
Convert the dry run implemented in #1425 into an actual batching
mechanism that splits metrics buckets into logical partitions.

The partition_key has to be passed through ProjectCache, EnvelopeManager
and UpstreamRelay to be set as a header on the outgoing envelope
request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants