Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns.lookup.duration should report error.type #92837

Closed
lmolkova opened this issue Sep 29, 2023 · 10 comments
Closed

dns.lookup.duration should report error.type #92837

lmolkova opened this issue Sep 29, 2023 · 10 comments
Assignees
Milestone

Comments

@lmolkova
Copy link

Description

Currently dns.lookup.duration only reports duration of he successful lookups, but failed lookups are quite interesting as well (count, rate, duration, failure reason).

Having error.type attribute on the dns.lookup.duration metric and reporting it when DNS lookup fails would provide all the details.

Adding an attribute later is considered a breaking change on OTel, since it increases the number of time series in the metric and breaks alerts and dashboards (so creating it as a bug).

Reproduction Steps

N/A

s_lookupDuration.Record(duration.TotalSeconds, KeyValuePair.Create("dns.question.name", (object?)hostName));

Expected behavior

The metric should have error.type attribute nad report duration of both, failed and successfully lookups.

Actual behavior

Metric only reports successful lookups

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Sep 29, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 29, 2023
@MihaZupan MihaZupan added area-System.Net and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Sep 29, 2023
@ghost
Copy link

ghost commented Sep 29, 2023

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Currently dns.lookup.duration only reports duration of he successful lookups, but failed lookups are quite interesting as well (count, rate, duration, failure reason).

Having error.type attribute on the dns.lookup.duration metric and reporting it when DNS lookup fails would provide all the details.

Adding an attribute later is considered a breaking change on OTel, since it increases the number of time series in the metric and breaks alerts and dashboards (so creating it as a bug).

Reproduction Steps

N/A

s_lookupDuration.Record(duration.TotalSeconds, KeyValuePair.Create("dns.question.name", (object?)hostName));

Expected behavior

The metric should have error.type attribute nad report duration of both, failed and successfully lookups.

Actual behavior

Metric only reports successful lookups

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

Author: lmolkova
Assignees: -
Labels:

area-System.Net, untriaged

Milestone: -

@MihaZupan
Copy link
Member

MihaZupan commented Sep 29, 2023

Currently dns.lookup.duration only reports duration of he successful lookups

Where are you seeing that? We're using the same code for both success & failure cases, and have tests for the failure case here

public static async Task ResolveInvalidHostName_MetricsRecorded()

Having error.type attribute on the dns.lookup.duration metric and reporting it when DNS lookup fails would provide all the details.

What sort of information would you expect to see there (we may have a limited amount of info from the OS)?

@lmolkova
Copy link
Author

Where are you seeing that? We're using the same code for both success & failure cases, and have tests for the failure case here

Thanks for the clarification! So it records all durations regardless of the result - I didn't realize that.

In this case, users are just not able to separate successful attempts from failed ones.
I'd like to see an exception type captured in error.type (or, if there is any enum-like list of DNS errors this would also be a great option).

@antonfirsov antonfirsov changed the title dns.lookup.duration metric for failed lookups dns.lookup.duration should report error.type Oct 3, 2023
@antonfirsov antonfirsov removed the untriaged New issue has not been triaged by the area owner label Oct 3, 2023
@antonfirsov antonfirsov added this to the 8.0.0 milestone Oct 3, 2023
@antonfirsov
Copy link
Member

@karelz after discussing with @noahfalk and @lmolkova, we think the addition of error.type is something we should get into 8.0:

  • Attribute additions are breaking changes according to otel, making the negative impact higher if we add it in future releases.
  • It's easy / low risk to implement

@antonfirsov
Copy link
Member

Linking the related aspnet issue: dotnet/aspnetcore#51029

@MihaZupan
Copy link
Member

Attribute additions are breaking changes according to otel, making the negative impact higher if we add it in future releases

To what degree are changes to emitted values for existing tags considered breaking?
E.g. if we had a managed implementation of DNS, would we be free to report different information, or would we want to mimic the values we'll end up using here?

@antonfirsov
Copy link
Member

Adding attributes may break existing Prometheus queries. I hope changes in the query result because of different values in the attributes are not considered to be breaking. @lmolkova do you have a comprehensive answer to this?

@lmolkova
Copy link
Author

lmolkova commented Oct 3, 2023

Imagine someone writes the following query:

avg(rate(dns_query_duration_count{question="foo.io"}[1m]))

At first, there was no error.type attribute and then this query returned an average number of DNS lookups per minute to foo.io.

Let's say we do ~100 of them: 90 are fine and 10 fail.

Now, we add error.type attribute.
From what used to be 1 time series (mind other attributes), we now have two of them:

  • dns_query_duration_count{error_type="fail"} has avg rate of 10
  • dns_query_duration_count{error_type=""} (or no error_type) has avg rate of 90

The average of 2 of them is 50.

I.e. we went from an average 100 to an average 50 by adding a new attribute. If someone set up an alert on this, then this alert is no longer valid. Users would also be surprised about a new range of this value.

Here's how the graph looks like (drop corresponds to adding new attribute)
image

and this is individual time series
image

So, certain queries and operators are sensitive to new attribute addition, but in some cases we can tolerate/justify it.
E.g. if we now only reported successful lookups, adding error.type would probably not be breaking since we'd start to report additional time series about things that didn't happen before.

Instead of justifying this change and inevitably shuffling things around in the future, we'd have much easier time updating it now.

@lmolkova
Copy link
Author

lmolkova commented Oct 3, 2023

To what degree are changes to emitted values for existing tags considered breaking?

TL;DR
New/different values of tags are mostly fine (unless you change units or do other drastic changes), but

  1. adding new tags (i.e. new time series) is definitely a problem
  2. breaking down tag value into multiple ones or grouping multiple values into one would be similar to adding a new tag (changes a number of time series).
  3. renaming tag value can be breaking (e.g. if someone filters by this specific value).
  4. adding a new tag value that has never been reported before (a new class of errors that never happened) should be fine

So we should in general avoid 1-3, but 2-3 should be less impactful than 1.

@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Oct 12, 2023
@antonfirsov
Copy link
Member

antonfirsov commented Oct 16, 2023

Fixed by #93414 in release/8.0, and by #93254 in main (9.0).

@ghost ghost locked as resolved and limited conversation to collaborators Nov 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants