Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(download): Measure how long it takes to download sources #483

Merged
merged 18 commits into from
Jul 9, 2021

Conversation

relaxolotl
Copy link
Contributor

This begins measuring requests and download throughput related to source-fetching in order to determine what a good set of default timeouts for such requests should be.

The changes can be broken down into three major parts:

  1. The addition of MeasureSourceDownload, which is a modified version of MeasureGuard focused on source downloads
  2. Logging stream throughput and duration in download_stream via MeasureSourceDownload
  3. Logging the initial GET request's duration in every type of download source

Source downloads from the local filesystem have been skipped as it is unlikely that any meaningful or useful information will come from logging information about those.

Open questions and notes will be left directly on the diff.

@relaxolotl relaxolotl requested review from flub and Swatinem July 7, 2021 02:48
@@ -101,28 +101,26 @@ macro_rules! metric {
.send();
})
}};
(timer($id:expr), $block:block $(, $k:expr => $v:expr)* $(,)?) => {{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is a little messy here, but an unused type of timer here metrics!(timer(..)) has been removed since it currently isn't being used anywhere in the codebase, as far as I can tell.

Comment on lines +115 to 123
// histograms
(histogram($id:expr) = $value:expr $(, $k:expr => $v:expr)* $(,)?) => {{
use $crate::metrics::_pred::*;
$crate::metrics::with_client(|client| {
client.time_with_tags($id, $value)
client.histogram_with_tags($id, $value)
$(.with_tag($k, $v))*
.send();
})
}};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like statsd directly supports histograms.

The metrics! macro has been extended to include that functionality in this PR, but it appears that symbolicator is already emitting StacktraceMetrics as a histogram via metrics!(timer_raw(...)), which uses time_with_tags.

Was there an explicit decision made not to use statsd's histogram_with_tags() for histograms?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding the explicit histogram is great! mostly historical and a resistance to change I think

@@ -61,37 +61,74 @@ impl HttpDownloader {
&self,
file_source: HttpRemoteDif,
destination: PathBuf,
) -> Result<DownloadStatus, DownloadError> {
Copy link
Contributor Author

@relaxolotl relaxolotl Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a bit of an explanation.

In all non-Sentry, non-local FS sources only the initial GET request is/was being retried. In contrast, downloads from Sentry sources retry both the initial GET request and the stream attempts, which is what the diff in this file attempts to mimic. See:

pub async fn download_source(
&self,
file_source: SentryRemoteDif,
destination: PathBuf,
) -> Result<DownloadStatus, DownloadError> {
let retries = future_utils::retry(|| {
self.download_source_once(file_source.clone(), destination.clone())
});
match retries.await {
Ok(DownloadStatus::NotFound) => {
log::debug!(
"Did not fetch debug file from {:?}: {:?}",
file_source.url(),
DownloadStatus::NotFound
);
Ok(DownloadStatus::NotFound)
}
Ok(status) => {
log::debug!(
"Fetched debug file from {:?}: {:?}",
file_source.url(),
status
);
Ok(status)
}
Err(err) => {
log::debug!(
"Failed to fetch debug file from {:?}: {}",
file_source.url(),
err
);
Err(err)
}
}
}
async fn download_source_once(
&self,
file_source: SentryRemoteDif,
destination: PathBuf,
) -> Result<DownloadStatus, DownloadError> {
let request = self
.client
.get(file_source.url())
.header("User-Agent", USER_AGENT)
.bearer_auth(&file_source.source.token)
.send();
let download_url = file_source.url();
let source = RemoteDif::from(file_source);
let request = future_utils::measure_source_download(
"service.download.download_source",
source.source_metric_key(),
m::result,
request,
);
match request.await {
Ok(response) => {
if response.status().is_success() {
log::trace!("Success hitting {}", download_url);
let stream = response.bytes_stream().map_err(DownloadError::Reqwest);
super::download_stream(source, stream, destination).await
} else {
log::trace!(
"Unexpected status code from {}: {}",
download_url,
response.status()
);
Ok(DownloadStatus::NotFound)
}
}
Err(e) => {
log::trace!("Skipping response from {}: {}", download_url, e);
Ok(DownloadStatus::NotFound) // must be wrong type
}
}
}

Is there any reason why the others (HTTP, S3, GCS) aren't retrying both the initial GET request as well as the stream? If not, I may change those to do the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only fixed in Sentry in response to some production issue as well. For Sentry we know this should be 100% reliable, but k8s sometimes shuts down servers we are downloading from so we retry the streaming part as well a few times. For other servers it is less clear how big the impact would be of doing this work, but it certainly won't do any harm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed retry-related logic in this PR to keep it focused. A new PR branched off of the changes in this one has been opened which extends retrying on both the header and the stream to all sources: #485. It's a pretty naive approach right now, but I'd love your feedback on the diff there as well @flub.

@@ -235,6 +235,102 @@ where
}
}

/// A guard to [`measure`] the amount of time it takes to download a source. This guard is also
Copy link
Contributor Author

@relaxolotl relaxolotl Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was "heavily inspired" by MeasureGuard a few lines above in the same file. As far as I can tell, fairly invasive changes would need to be made to the existing MeasureGuard to work for the specific use case this PR needed.

Some examples: I was unable to figure out how to add an arbitrary number of tags to the metric it logs. It also doesn't support histograms out of the box. Throughput logging would have involved even more extensive changes to the struct.

Given what was mentioned above, I opted to just use it as a loose template for a new type of MeasureGuard for source downloads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing MeasureGuard and measure are a (perhaps not most successful) attempt at creating something that generically emits a futures.done metric. I think it's fine not to use it on anything that doesn't yet emit that metric, it's a bit weird maybe, I don't know.

So I think creating this custom for timing downloads is fine, but would probably move it to the services::download module instead of here. Also, I would make it more specific because it doesn't need the strange way of being generic with m::something indirections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: I've moved MeasureSourceDownload to services::download and renamed it to an even wordier MeasureSourceDownloadGuard. Per your other suggestions the API has also been slimmed down to remove some of the unneeded flexibility inherited from the original MeasureGuard that isn't needed for this particular measure guard's use case.

@relaxolotl relaxolotl marked this pull request as ready for review July 7, 2021 03:09
@relaxolotl relaxolotl requested a review from a team July 7, 2021 03:09
Copy link
Member

@Swatinem Swatinem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!
I guess retrying the whole download vs only the initial connection is fine.
However I can’t really answer your question regarding the time_raw vs histogram.

crates/symbolicator/src/utils/futures.rs Outdated Show resolved Hide resolved
Comment on lines 270 to 273
self.bytes_transferred = self
.bytes_transferred
.and_then(|old_count| old_count.checked_add(additional_bytes).or(Some(old_count)))
.or(Some(additional_bytes));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option has get_or_insert_default https://doc.rust-lang.org/std/option/enum.Option.html#method.get_or_insert_default which might be super useful here.
Also, I wouldn’t worry about using checked_add here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this! I've done something a little different here: get_or_insert(Default::default()) is being used over the suggeted get_or_insert_default() since the latter method hasn't hit stable yet.

Could I bug you to elaborate on why you suppose checked_add isn't needed? For now I've opted to use saturated_add courtesy of Floris's suggestion, but perhaps I'm being overly cautious here when I don't need to be.

pub fn add_bytes_transferred(&mut self, additional_bytes: u64) {
self.bytes_transferred = self
.bytes_transferred
.and_then(|old_count| old_count.checked_add(additional_bytes).or(Some(old_count)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't a saturating add make more sense rather than sticking with the previous value if it overflows?


let duration = self.creation_time.elapsed().as_secs();
metric!(
histogram(self.task_name) = duration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only begrudgingly accepted that we were abusing timers as historgrams but now we're doing the opposite and I'm confused. Why is this not a timer?
Also all existing time values in datadog are recorded as miliseconds, this one is now recorded as seconds that will be confusing as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke to @flub about this on the side, and it turns out I misunderstood things here. It looks like timers (metric!(timer(...))) under-the-hood are simply histograms, so it makes more sense to just use that here, instead of shoving the duration into a histogram. The diff has been updated to do exactly that.

@@ -235,6 +235,102 @@ where
}
}

/// A guard to [`measure`] the amount of time it takes to download a source. This guard is also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing MeasureGuard and measure are a (perhaps not most successful) attempt at creating something that generically emits a futures.done metric. I think it's fine not to use it on anything that doesn't yet emit that metric, it's a bit weird maybe, I don't know.

So I think creating this custom for timing downloads is fine, but would probably move it to the services::download module instead of here. Also, I would make it more specific because it doesn't need the strange way of being generic with m::something indirections.

}

/// Marks the download as terminated.
pub fn done(mut self, status: &'static str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this take &Result<_, _> instead of relying on the indirections?

Comment on lines +115 to 123
// histograms
(histogram($id:expr) = $value:expr $(, $k:expr => $v:expr)* $(,)?) => {{
use $crate::metrics::_pred::*;
$crate::metrics::with_client(|client| {
client.time_with_tags($id, $value)
client.histogram_with_tags($id, $value)
$(.with_tag($k, $v))*
.send();
})
}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding the explicit histogram is great! mostly historical and a resistance to change I think

pub fn measure_source_download<'a, S, F>(
task_name: &'a str,
source_name: &'a str,
get_status: S,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrapper shouldn't need get_status, it's Result for all of the downloaders

@@ -61,37 +61,74 @@ impl HttpDownloader {
&self,
file_source: HttpRemoteDif,
destination: PathBuf,
) -> Result<DownloadStatus, DownloadError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only fixed in Sentry in response to some production issue as well. For Sentry we know this should be 100% reliable, but k8s sometimes shuts down servers we are downloading from so we retry the streaming part as well a few times. For other servers it is less clear how big the impact would be of doing this work, but it certainly won't do any harm.

///
/// An additional tag for the source name is also added to the metric.
pub fn measure_source_download<'a, S, F>(
task_name: &'a str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the same in all callers as well, so you could also remove this parameter?

@relaxolotl
Copy link
Contributor Author

Additional changes worth noting:

  • I've changed the tag names on the new metrics being recorded to better match the existing ones based off of a quick skim through Datadog:
    • Initial GET: "source.download.duration"
    • Stream duration: "source.stream.duration"
    • Stream throughput: "source.stream.throughput"
  • The guard currently takes a task name and simply appends ".duration" or ".throughput" when reporting their respective metrics. Users of the guard are simply expected to just provide root tag that doesn't mention exactly what's being mentioned.

Comment on lines 273 to 274
let bytes = self.bytes_transferred.get_or_insert(Default::default());
*bytes = bytes.saturating_add(additional_bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use get_or_insert(0), since its a numeric value.

Since this is bytes that we add, we will overflow storage sooner than wrapping around a u64. But the saturating_add is a lot easier to read than the checked_add you had before, that was my main concern.

where
F: 'a + Future<Output = Result<T, E>>,
{
let guard = MeasureSourceDownloadGuard::new("source.download", source_name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe name this source.connect? because download = connect + stream :-D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Good point, it is done. What do you think about adding a common segment for those two as well so it's easy to know that they're related? ie source.download.connect and source.download.stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'll make that change and we can update the tag later if there are any issues with it

@relaxolotl
Copy link
Contributor Author

Just to be safe: @flub you left quite a bit of feedback. Are the newer changes in line with what you were thinking of?

@relaxolotl relaxolotl requested a review from flub July 9, 2021 02:46
Copy link
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, apologies for my reviewing failures yesterday!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants