feat(download): Measure how long it takes to download sources #483

relaxolotl · 2021-07-07T02:48:32Z

This begins measuring requests and download throughput related to source-fetching in order to determine what a good set of default timeouts for such requests should be.

The changes can be broken down into three major parts:

The addition of MeasureSourceDownload, which is a modified version of MeasureGuard focused on source downloads
Logging stream throughput and duration in download_stream via MeasureSourceDownload
Logging the initial GET request's duration in every type of download source

Source downloads from the local filesystem have been skipped as it is unlikely that any meaningful or useful information will come from logging information about those.

Open questions and notes will be left directly on the diff.

- now tracking the initial GET request - tracking throughput and duration to stream the source's content - now retrying the initial GET request and the stream if either of those fail

…wnloaded

relaxolotl · 2021-07-07T02:49:52Z

crates/symbolicator/src/metrics.rs

@@ -101,28 +101,26 @@ macro_rules! metric {
                .send();
        })
    }};
-    (timer($id:expr), $block:block $(, $k:expr => $v:expr)* $(,)?) => {{


The diff is a little messy here, but an unused type of timer here metrics!(timer(..)) has been removed since it currently isn't being used anywhere in the codebase, as far as I can tell.

relaxolotl · 2021-07-07T02:52:54Z

crates/symbolicator/src/metrics.rs

+    // histograms
+    (histogram($id:expr) = $value:expr $(, $k:expr => $v:expr)* $(,)?) => {{
        use $crate::metrics::_pred::*;
        $crate::metrics::with_client(|client| {
-            client.time_with_tags($id, $value)
+            client.histogram_with_tags($id, $value)
                $(.with_tag($k, $v))*
                .send();
        })
    }};


It looks like statsd directly supports histograms.

The metrics! macro has been extended to include that functionality in this PR, but it appears that symbolicator is already emitting StacktraceMetrics as a histogram via metrics!(timer_raw(...)), which uses time_with_tags.

Was there an explicit decision made not to use statsd's histogram_with_tags() for histograms?

I think adding the explicit histogram is great! mostly historical and a resistance to change I think

relaxolotl · 2021-07-07T03:00:58Z

crates/symbolicator/src/services/download/http.rs

@@ -61,37 +61,74 @@ impl HttpDownloader {
        &self,
        file_source: HttpRemoteDif,
        destination: PathBuf,
+    ) -> Result<DownloadStatus, DownloadError> {


This needs a bit of an explanation.

In all non-Sentry, non-local FS sources only the initial GET request is/was being retried. In contrast, downloads from Sentry sources retry both the initial GET request and the stream attempts, which is what the diff in this file attempts to mimic. See:

symbolicator/crates/symbolicator/src/services/download/sentry.rs

Lines 236 to 314 in 919aafb

pub async fn download_source(

&self,

file_source: SentryRemoteDif,

destination: PathBuf,

) -> Result<DownloadStatus, DownloadError> {

let retries = future_utils::retry(|| {

self.download_source_once(file_source.clone(), destination.clone())

});

match retries.await {

Ok(DownloadStatus::NotFound) => {

log::debug!(

"Did not fetch debug file from {:?}: {:?}",

file_source.url(),

DownloadStatus::NotFound

);

Ok(DownloadStatus::NotFound)

}

Ok(status) => {

log::debug!(

"Fetched debug file from {:?}: {:?}",

file_source.url(),

status

);

Ok(status)

}

Err(err) => {

log::debug!(

"Failed to fetch debug file from {:?}: {}",

file_source.url(),

err

);

Err(err)

}

}

}

async fn download_source_once(

&self,

file_source: SentryRemoteDif,

destination: PathBuf,

) -> Result<DownloadStatus, DownloadError> {

let request = self

.client

.get(file_source.url())

.header("User-Agent", USER_AGENT)

.bearer_auth(&file_source.source.token)

.send();

let download_url = file_source.url();

let source = RemoteDif::from(file_source);

let request = future_utils::measure_source_download(

"service.download.download_source",

source.source_metric_key(),

m::result,

request,

);

match request.await {

Ok(response) => {

if response.status().is_success() {

log::trace!("Success hitting {}", download_url);

let stream = response.bytes_stream().map_err(DownloadError::Reqwest);

super::download_stream(source, stream, destination).await

} else {

log::trace!(

"Unexpected status code from {}: {}",

download_url,

response.status()

);

Ok(DownloadStatus::NotFound)

}

}

Err(e) => {

log::trace!("Skipping response from {}: {}", download_url, e);

Ok(DownloadStatus::NotFound) // must be wrong type

}

}

}

Is there any reason why the others (HTTP, S3, GCS) aren't retrying both the initial GET request as well as the stream? If not, I may change those to do the same.

This was only fixed in Sentry in response to some production issue as well. For Sentry we know this should be 100% reliable, but k8s sometimes shuts down servers we are downloading from so we retry the streaming part as well a few times. For other servers it is less clear how big the impact would be of doing this work, but it certainly won't do any harm.

I've removed retry-related logic in this PR to keep it focused. A new PR branched off of the changes in this one has been opened which extends retrying on both the header and the stream to all sources: #485. It's a pretty naive approach right now, but I'd love your feedback on the diff there as well @flub.

relaxolotl · 2021-07-07T03:09:30Z

crates/symbolicator/src/utils/futures.rs

@@ -235,6 +235,102 @@ where
    }
 }

+/// A guard to [`measure`] the amount of time it takes to download a source. This guard is also


This was "heavily inspired" by MeasureGuard a few lines above in the same file. As far as I can tell, fairly invasive changes would need to be made to the existing MeasureGuard to work for the specific use case this PR needed.

Some examples: I was unable to figure out how to add an arbitrary number of tags to the metric it logs. It also doesn't support histograms out of the box. Throughput logging would have involved even more extensive changes to the struct.

Given what was mentioned above, I opted to just use it as a loose template for a new type of MeasureGuard for source downloads.

The existing MeasureGuard and measure are a (perhaps not most successful) attempt at creating something that generically emits a futures.done metric. I think it's fine not to use it on anything that doesn't yet emit that metric, it's a bit weird maybe, I don't know.

So I think creating this custom for timing downloads is fine, but would probably move it to the services::download module instead of here. Also, I would make it more specific because it doesn't need the strange way of being generic with m::something indirections.

Done: I've moved MeasureSourceDownload to services::download and renamed it to an even wordier MeasureSourceDownloadGuard. Per your other suggestions the API has also been slimmed down to remove some of the unneeded flexibility inherited from the original MeasureGuard that isn't needed for this particular measure guard's use case.

Swatinem

lgtm!
I guess retrying the whole download vs only the initial connection is fine.
However I can’t really answer your question regarding the time_raw vs histogram.

crates/symbolicator/src/utils/futures.rs

Swatinem · 2021-07-07T08:12:43Z

crates/symbolicator/src/utils/futures.rs

+        self.bytes_transferred = self
+            .bytes_transferred
+            .and_then(|old_count| old_count.checked_add(additional_bytes).or(Some(old_count)))
+            .or(Some(additional_bytes));


Option has get_or_insert_default https://doc.rust-lang.org/std/option/enum.Option.html#method.get_or_insert_default which might be super useful here.
Also, I wouldn’t worry about using checked_add here.

Thank you for this! I've done something a little different here: get_or_insert(Default::default()) is being used over the suggeted get_or_insert_default() since the latter method hasn't hit stable yet.

Could I bug you to elaborate on why you suppose checked_add isn't needed? For now I've opted to use saturated_add courtesy of Floris's suggestion, but perhaps I'm being overly cautious here when I don't need to be.

flub · 2021-07-07T08:04:10Z

crates/symbolicator/src/utils/futures.rs

+    pub fn add_bytes_transferred(&mut self, additional_bytes: u64) {
+        self.bytes_transferred = self
+            .bytes_transferred
+            .and_then(|old_count| old_count.checked_add(additional_bytes).or(Some(old_count)))


Doesn't a saturating add make more sense rather than sticking with the previous value if it overflows?

flub · 2021-07-07T08:09:36Z

crates/symbolicator/src/utils/futures.rs

+
+        let duration = self.creation_time.elapsed().as_secs();
+        metric!(
+            histogram(self.task_name) = duration,


I only begrudgingly accepted that we were abusing timers as historgrams but now we're doing the opposite and I'm confused. Why is this not a timer?
Also all existing time values in datadog are recorded as miliseconds, this one is now recorded as seconds that will be confusing as well.

Spoke to @flub about this on the side, and it turns out I misunderstood things here. It looks like timers (metric!(timer(...))) under-the-hood are simply histograms, so it makes more sense to just use that here, instead of shoving the duration into a histogram. The diff has been updated to do exactly that.

flub · 2021-07-07T08:37:43Z

crates/symbolicator/src/utils/futures.rs

@@ -235,6 +235,102 @@ where
    }
 }

+/// A guard to [`measure`] the amount of time it takes to download a source. This guard is also


The existing MeasureGuard and measure are a (perhaps not most successful) attempt at creating something that generically emits a futures.done metric. I think it's fine not to use it on anything that doesn't yet emit that metric, it's a bit weird maybe, I don't know.

So I think creating this custom for timing downloads is fine, but would probably move it to the services::download module instead of here. Also, I would make it more specific because it doesn't need the strange way of being generic with m::something indirections.

flub · 2021-07-07T08:39:14Z

crates/symbolicator/src/utils/futures.rs

+    }
+
+    /// Marks the download as terminated.
+    pub fn done(mut self, status: &'static str) {


Can this take &Result<_, _> instead of relying on the indirections?

flub · 2021-07-07T08:42:30Z

crates/symbolicator/src/metrics.rs

+    // histograms
+    (histogram($id:expr) = $value:expr $(, $k:expr => $v:expr)* $(,)?) => {{
        use $crate::metrics::_pred::*;
        $crate::metrics::with_client(|client| {
-            client.time_with_tags($id, $value)
+            client.histogram_with_tags($id, $value)
                $(.with_tag($k, $v))*
                .send();
        })
    }};


I think adding the explicit histogram is great! mostly historical and a resistance to change I think

flub · 2021-07-07T08:43:17Z

crates/symbolicator/src/utils/futures.rs

+pub fn measure_source_download<'a, S, F>(
+    task_name: &'a str,
+    source_name: &'a str,
+    get_status: S,


This wrapper shouldn't need get_status, it's Result for all of the downloaders

flub · 2021-07-07T08:59:17Z

crates/symbolicator/src/services/download/http.rs

@@ -61,37 +61,74 @@ impl HttpDownloader {
        &self,
        file_source: HttpRemoteDif,
        destination: PathBuf,
+    ) -> Result<DownloadStatus, DownloadError> {


This was only fixed in Sentry in response to some production issue as well. For Sentry we know this should be 100% reliable, but k8s sometimes shuts down servers we are downloading from so we retry the streaming part as well a few times. For other servers it is less clear how big the impact would be of doing this work, but it certainly won't do any harm.

flub · 2021-07-07T09:01:59Z

crates/symbolicator/src/utils/futures.rs

+///
+/// An additional tag for the source name is also added to the metric.
+pub fn measure_source_download<'a, S, F>(
+    task_name: &'a str,


I think this is the same in all callers as well, so you could also remove this parameter?

- grab a direct mutable reference to the total byte count and update that - saturating add clamps the result to the maximum value of the type being used

…r a specific purpose

relaxolotl · 2021-07-08T01:36:55Z

Additional changes worth noting:

I've changed the tag names on the new metrics being recorded to better match the existing ones based off of a quick skim through Datadog:
- Initial GET: "source.download.duration"
- Stream duration: "source.stream.duration"
- Stream throughput: "source.stream.throughput"
The guard currently takes a task name and simply appends ".duration" or ".throughput" when reporting their respective metrics. Users of the guard are simply expected to just provide root tag that doesn't mention exactly what's being mentioned.

Swatinem · 2021-07-08T12:05:02Z

crates/symbolicator/src/services/download/mod.rs

+        let bytes = self.bytes_transferred.get_or_insert(Default::default());
+        *bytes = bytes.saturating_add(additional_bytes);


Might as well use get_or_insert(0), since its a numeric value.

Since this is bytes that we add, we will overflow storage sooner than wrapping around a u64. But the saturating_add is a lot easier to read than the checked_add you had before, that was my main concern.

Swatinem · 2021-07-08T12:06:43Z

crates/symbolicator/src/services/download/mod.rs

+where
+    F: 'a + Future<Output = Result<T, E>>,
+{
+    let guard = MeasureSourceDownloadGuard::new("source.download", source_name);


Maybe name this source.connect? because download = connect + stream :-D

Oh! Good point, it is done. What do you think about adding a common segment for those two as well so it's easy to know that they're related? ie source.download.connect and source.download.stream?

Actually I'll make that change and we can update the tag later if there are any issues with it

relaxolotl · 2021-07-08T19:05:34Z

Just to be safe: @flub you left quite a bit of feedback. Are the newer changes in line with what you were thinking of?

flub

LGTM, apologies for my reviewing failures yesterday!

relaxolotl added 7 commits July 6, 2021 20:33

remove unused metric timer

2cac911

record metrics related to downloading sources via http.

b50f121

- now tracking the initial GET request - tracking throughput and duration to stream the source's content - now retrying the initial GET request and the stream if either of those fail

time requests to grab sources sirectly from sentry

bd13433

time requests to grab sources in s3 buckets

6d140c4

time requests to grab sources in GCS

5ab7829

if the file could not be found don't log as if it was successfully do…

9e9b899

…wnloaded

consistency

919aafb

relaxolotl requested review from flub and Swatinem July 7, 2021 02:48

relaxolotl commented Jul 7, 2021

View reviewed changes

relaxolotl marked this pull request as ready for review July 7, 2021 03:09

relaxolotl requested a review from a team July 7, 2021 03:09

Swatinem approved these changes Jul 7, 2021

View reviewed changes

flub reviewed Jul 7, 2021

View reviewed changes

relaxolotl added 8 commits July 7, 2021 13:32

use timers for durations, and match units in measurements

78aae2a

make use of helpers that simplify what i'm doing

0f9ae8f

- grab a direct mutable reference to the total byte count and update that - saturating add clamps the result to the maximum value of the type being used

calculation was reversed and units were slightly wrong

7ceb6fe

remove unneeded flexibility on methods that are already being used fo…

1caf2ff

…r a specific purpose

move source download-specific guard to a more appropriate module

7d5108c

slightly smarter tag name generation, minor renames

df71b4b

remove retry logic from http for now

8e67b6d

changelog

116e2d7

relaxolotl mentioned this pull request Jul 8, 2021

ref(download): Retry downloads to S3/GCS/HTTP/local FS sources #485

Merged

missed a spot, throughput vs initial GET should have different tags

b0ccad4

Swatinem approved these changes Jul 8, 2021

View reviewed changes

feedback

aec01f5

use a gooder name for the base tag

04906e3

relaxolotl requested a review from flub July 9, 2021 02:46

flub approved these changes Jul 9, 2021

View reviewed changes

relaxolotl merged commit 2daccfc into master Jul 9, 2021

relaxolotl deleted the feat/src-dl-metrics branch July 9, 2021 13:59

relaxolotl mentioned this pull request Jul 13, 2021

feat(download): Allow configuring the timeout on the initial HEAD request on a source download #491

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(download): Measure how long it takes to download sources #483

feat(download): Measure how long it takes to download sources #483

relaxolotl commented Jul 7, 2021

relaxolotl Jul 7, 2021

relaxolotl Jul 7, 2021

flub Jul 7, 2021

relaxolotl Jul 7, 2021 •

edited

Loading

flub Jul 7, 2021

relaxolotl Jul 8, 2021

relaxolotl Jul 7, 2021 •

edited

Loading

flub Jul 7, 2021

relaxolotl Jul 8, 2021

Swatinem left a comment

Swatinem Jul 7, 2021

relaxolotl Jul 8, 2021

flub Jul 7, 2021

flub Jul 7, 2021

relaxolotl Jul 8, 2021

flub Jul 7, 2021

flub Jul 7, 2021

flub Jul 7, 2021

flub Jul 7, 2021

flub Jul 7, 2021

flub Jul 7, 2021

relaxolotl commented Jul 8, 2021

Swatinem Jul 8, 2021

Swatinem Jul 8, 2021

relaxolotl Jul 8, 2021

relaxolotl Jul 8, 2021

relaxolotl commented Jul 8, 2021

flub left a comment

	pub async fn download_source(
	&self,
	file_source: SentryRemoteDif,
	destination: PathBuf,
	) -> Result<DownloadStatus, DownloadError> {
	let retries = future_utils::retry(\|\| {
	self.download_source_once(file_source.clone(), destination.clone())
	});
	match retries.await {
	Ok(DownloadStatus::NotFound) => {
	log::debug!(
	"Did not fetch debug file from {:?}: {:?}",
	file_source.url(),
	DownloadStatus::NotFound
	);
	Ok(DownloadStatus::NotFound)
	}
	Ok(status) => {
	log::debug!(
	"Fetched debug file from {:?}: {:?}",
	file_source.url(),
	status
	);
	Ok(status)
	}
	Err(err) => {
	log::debug!(
	"Failed to fetch debug file from {:?}: {}",
	file_source.url(),
	err
	);
	Err(err)
	}
	}
	}

	async fn download_source_once(
	&self,
	file_source: SentryRemoteDif,
	destination: PathBuf,
	) -> Result<DownloadStatus, DownloadError> {
	let request = self
	.client
	.get(file_source.url())
	.header("User-Agent", USER_AGENT)
	.bearer_auth(&file_source.source.token)
	.send();

	let download_url = file_source.url();
	let source = RemoteDif::from(file_source);
	let request = future_utils::measure_source_download(
	"service.download.download_source",
	source.source_metric_key(),
	m::result,
	request,
	);

	match request.await {
	Ok(response) => {
	if response.status().is_success() {
	log::trace!("Success hitting {}", download_url);
	let stream = response.bytes_stream().map_err(DownloadError::Reqwest);

	super::download_stream(source, stream, destination).await
	} else {
	log::trace!(
	"Unexpected status code from {}: {}",
	download_url,
	response.status()
	);
	Ok(DownloadStatus::NotFound)
	}
	}
	Err(e) => {
	log::trace!("Skipping response from {}: {}", download_url, e);
	Ok(DownloadStatus::NotFound) // must be wrong type
	}
	}
	}

		let bytes = self.bytes_transferred.get_or_insert(Default::default());
		*bytes = bytes.saturating_add(additional_bytes);

feat(download): Measure how long it takes to download sources #483

feat(download): Measure how long it takes to download sources #483

Conversation

relaxolotl commented Jul 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

relaxolotl Jul 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

relaxolotl Jul 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Swatinem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

relaxolotl commented Jul 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

relaxolotl commented Jul 8, 2021

flub left a comment

Choose a reason for hiding this comment

relaxolotl Jul 7, 2021 •

edited

Loading

relaxolotl Jul 7, 2021 •

edited

Loading