-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(download): Measure how long it takes to download sources #483
Conversation
- now tracking the initial GET request - tracking throughput and duration to stream the source's content - now retrying the initial GET request and the stream if either of those fail
@@ -101,28 +101,26 @@ macro_rules! metric { | |||
.send(); | |||
}) | |||
}}; | |||
(timer($id:expr), $block:block $(, $k:expr => $v:expr)* $(,)?) => {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff is a little messy here, but an unused type of timer here metrics!(timer(..))
has been removed since it currently isn't being used anywhere in the codebase, as far as I can tell.
// histograms | ||
(histogram($id:expr) = $value:expr $(, $k:expr => $v:expr)* $(,)?) => {{ | ||
use $crate::metrics::_pred::*; | ||
$crate::metrics::with_client(|client| { | ||
client.time_with_tags($id, $value) | ||
client.histogram_with_tags($id, $value) | ||
$(.with_tag($k, $v))* | ||
.send(); | ||
}) | ||
}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like statsd directly supports histograms.
The metrics!
macro has been extended to include that functionality in this PR, but it appears that symbolicator is already emitting StacktraceMetrics
as a histogram via metrics!(timer_raw(...))
, which uses time_with_tags
.
Was there an explicit decision made not to use statsd's histogram_with_tags()
for histograms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding the explicit histogram is great! mostly historical and a resistance to change I think
@@ -61,37 +61,74 @@ impl HttpDownloader { | |||
&self, | |||
file_source: HttpRemoteDif, | |||
destination: PathBuf, | |||
) -> Result<DownloadStatus, DownloadError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a bit of an explanation.
In all non-Sentry, non-local FS sources only the initial GET request is/was being retried. In contrast, downloads from Sentry sources retry both the initial GET request and the stream attempts, which is what the diff in this file attempts to mimic. See:
symbolicator/crates/symbolicator/src/services/download/sentry.rs
Lines 236 to 314 in 919aafb
pub async fn download_source( | |
&self, | |
file_source: SentryRemoteDif, | |
destination: PathBuf, | |
) -> Result<DownloadStatus, DownloadError> { | |
let retries = future_utils::retry(|| { | |
self.download_source_once(file_source.clone(), destination.clone()) | |
}); | |
match retries.await { | |
Ok(DownloadStatus::NotFound) => { | |
log::debug!( | |
"Did not fetch debug file from {:?}: {:?}", | |
file_source.url(), | |
DownloadStatus::NotFound | |
); | |
Ok(DownloadStatus::NotFound) | |
} | |
Ok(status) => { | |
log::debug!( | |
"Fetched debug file from {:?}: {:?}", | |
file_source.url(), | |
status | |
); | |
Ok(status) | |
} | |
Err(err) => { | |
log::debug!( | |
"Failed to fetch debug file from {:?}: {}", | |
file_source.url(), | |
err | |
); | |
Err(err) | |
} | |
} | |
} | |
async fn download_source_once( | |
&self, | |
file_source: SentryRemoteDif, | |
destination: PathBuf, | |
) -> Result<DownloadStatus, DownloadError> { | |
let request = self | |
.client | |
.get(file_source.url()) | |
.header("User-Agent", USER_AGENT) | |
.bearer_auth(&file_source.source.token) | |
.send(); | |
let download_url = file_source.url(); | |
let source = RemoteDif::from(file_source); | |
let request = future_utils::measure_source_download( | |
"service.download.download_source", | |
source.source_metric_key(), | |
m::result, | |
request, | |
); | |
match request.await { | |
Ok(response) => { | |
if response.status().is_success() { | |
log::trace!("Success hitting {}", download_url); | |
let stream = response.bytes_stream().map_err(DownloadError::Reqwest); | |
super::download_stream(source, stream, destination).await | |
} else { | |
log::trace!( | |
"Unexpected status code from {}: {}", | |
download_url, | |
response.status() | |
); | |
Ok(DownloadStatus::NotFound) | |
} | |
} | |
Err(e) => { | |
log::trace!("Skipping response from {}: {}", download_url, e); | |
Ok(DownloadStatus::NotFound) // must be wrong type | |
} | |
} | |
} |
Is there any reason why the others (HTTP, S3, GCS) aren't retrying both the initial GET request as well as the stream? If not, I may change those to do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was only fixed in Sentry in response to some production issue as well. For Sentry we know this should be 100% reliable, but k8s sometimes shuts down servers we are downloading from so we retry the streaming part as well a few times. For other servers it is less clear how big the impact would be of doing this work, but it certainly won't do any harm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed retry-related logic in this PR to keep it focused. A new PR branched off of the changes in this one has been opened which extends retrying on both the header and the stream to all sources: #485. It's a pretty naive approach right now, but I'd love your feedback on the diff there as well @flub.
@@ -235,6 +235,102 @@ where | |||
} | |||
} | |||
|
|||
/// A guard to [`measure`] the amount of time it takes to download a source. This guard is also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was "heavily inspired" by MeasureGuard
a few lines above in the same file. As far as I can tell, fairly invasive changes would need to be made to the existing MeasureGuard
to work for the specific use case this PR needed.
Some examples: I was unable to figure out how to add an arbitrary number of tags to the metric it logs. It also doesn't support histograms out of the box. Throughput logging would have involved even more extensive changes to the struct.
Given what was mentioned above, I opted to just use it as a loose template for a new type of MeasureGuard
for source downloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing MeasureGuard
and measure
are a (perhaps not most successful) attempt at creating something that generically emits a futures.done
metric. I think it's fine not to use it on anything that doesn't yet emit that metric, it's a bit weird maybe, I don't know.
So I think creating this custom for timing downloads is fine, but would probably move it to the services::download
module instead of here. Also, I would make it more specific because it doesn't need the strange way of being generic with m::something
indirections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: I've moved MeasureSourceDownload
to services::download
and renamed it to an even wordier MeasureSourceDownloadGuard
. Per your other suggestions the API has also been slimmed down to remove some of the unneeded flexibility inherited from the original MeasureGuard
that isn't needed for this particular measure guard's use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
I guess retrying the whole download vs only the initial connection is fine.
However I can’t really answer your question regarding the time_raw vs histogram.
self.bytes_transferred = self | ||
.bytes_transferred | ||
.and_then(|old_count| old_count.checked_add(additional_bytes).or(Some(old_count))) | ||
.or(Some(additional_bytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option
has get_or_insert_default
https://doc.rust-lang.org/std/option/enum.Option.html#method.get_or_insert_default which might be super useful here.
Also, I wouldn’t worry about using checked_add
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this! I've done something a little different here: get_or_insert(Default::default())
is being used over the suggeted get_or_insert_default()
since the latter method hasn't hit stable yet.
Could I bug you to elaborate on why you suppose checked_add
isn't needed? For now I've opted to use saturated_add
courtesy of Floris's suggestion, but perhaps I'm being overly cautious here when I don't need to be.
pub fn add_bytes_transferred(&mut self, additional_bytes: u64) { | ||
self.bytes_transferred = self | ||
.bytes_transferred | ||
.and_then(|old_count| old_count.checked_add(additional_bytes).or(Some(old_count))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't a saturating add make more sense rather than sticking with the previous value if it overflows?
|
||
let duration = self.creation_time.elapsed().as_secs(); | ||
metric!( | ||
histogram(self.task_name) = duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only begrudgingly accepted that we were abusing timers as historgrams but now we're doing the opposite and I'm confused. Why is this not a timer?
Also all existing time values in datadog are recorded as miliseconds, this one is now recorded as seconds that will be confusing as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke to @flub about this on the side, and it turns out I misunderstood things here. It looks like timers (metric!(timer(...))
) under-the-hood are simply histograms, so it makes more sense to just use that here, instead of shoving the duration into a histogram. The diff has been updated to do exactly that.
@@ -235,6 +235,102 @@ where | |||
} | |||
} | |||
|
|||
/// A guard to [`measure`] the amount of time it takes to download a source. This guard is also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing MeasureGuard
and measure
are a (perhaps not most successful) attempt at creating something that generically emits a futures.done
metric. I think it's fine not to use it on anything that doesn't yet emit that metric, it's a bit weird maybe, I don't know.
So I think creating this custom for timing downloads is fine, but would probably move it to the services::download
module instead of here. Also, I would make it more specific because it doesn't need the strange way of being generic with m::something
indirections.
} | ||
|
||
/// Marks the download as terminated. | ||
pub fn done(mut self, status: &'static str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this take &Result<_, _>
instead of relying on the indirections?
// histograms | ||
(histogram($id:expr) = $value:expr $(, $k:expr => $v:expr)* $(,)?) => {{ | ||
use $crate::metrics::_pred::*; | ||
$crate::metrics::with_client(|client| { | ||
client.time_with_tags($id, $value) | ||
client.histogram_with_tags($id, $value) | ||
$(.with_tag($k, $v))* | ||
.send(); | ||
}) | ||
}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding the explicit histogram is great! mostly historical and a resistance to change I think
pub fn measure_source_download<'a, S, F>( | ||
task_name: &'a str, | ||
source_name: &'a str, | ||
get_status: S, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wrapper shouldn't need get_status
, it's Result for all of the downloaders
@@ -61,37 +61,74 @@ impl HttpDownloader { | |||
&self, | |||
file_source: HttpRemoteDif, | |||
destination: PathBuf, | |||
) -> Result<DownloadStatus, DownloadError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was only fixed in Sentry in response to some production issue as well. For Sentry we know this should be 100% reliable, but k8s sometimes shuts down servers we are downloading from so we retry the streaming part as well a few times. For other servers it is less clear how big the impact would be of doing this work, but it certainly won't do any harm.
/// | ||
/// An additional tag for the source name is also added to the metric. | ||
pub fn measure_source_download<'a, S, F>( | ||
task_name: &'a str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the same in all callers as well, so you could also remove this parameter?
- grab a direct mutable reference to the total byte count and update that - saturating add clamps the result to the maximum value of the type being used
…r a specific purpose
Additional changes worth noting:
|
let bytes = self.bytes_transferred.get_or_insert(Default::default()); | ||
*bytes = bytes.saturating_add(additional_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well use get_or_insert(0)
, since its a numeric value.
Since this is bytes that we add, we will overflow storage sooner than wrapping around a u64
. But the saturating_add
is a lot easier to read than the checked_add
you had before, that was my main concern.
where | ||
F: 'a + Future<Output = Result<T, E>>, | ||
{ | ||
let guard = MeasureSourceDownloadGuard::new("source.download", source_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe name this source.connect
? because download = connect + stream
:-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Good point, it is done. What do you think about adding a common segment for those two as well so it's easy to know that they're related? ie source.download.connect
and source.download.stream
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'll make that change and we can update the tag later if there are any issues with it
Just to be safe: @flub you left quite a bit of feedback. Are the newer changes in line with what you were thinking of? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, apologies for my reviewing failures yesterday!
This begins measuring requests and download throughput related to source-fetching in order to determine what a good set of default timeouts for such requests should be.
The changes can be broken down into three major parts:
MeasureSourceDownload
, which is a modified version ofMeasureGuard
focused on source downloadsdownload_stream
viaMeasureSourceDownload
Source downloads from the local filesystem have been skipped as it is unlikely that any meaningful or useful information will come from logging information about those.
Open questions and notes will be left directly on the diff.