Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the DNS lookup duration metric #93254

Merged
merged 5 commits into from
Oct 12, 2023

Conversation

antonfirsov
Copy link
Member

@antonfirsov antonfirsov commented Oct 9, 2023

This is a necessary change to ensure the metric is in sync with the spec and the OTel recommendations. An 8.0 backport is planned to avoid future breaking changes.

@ghost
Copy link

ghost commented Oct 9, 2023

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

This is a necessary change to ensure the metric is in sync with the spec and the OTel recommendations. An 8.0 backport is planned to avoid future breaking changes.

Author: antonfirsov
Assignees: antonfirsov
Labels:

area-System.Net

Milestone: -

Comment on lines 614 to 622
private static string GetErrorType(Exception exception) => (exception as SocketException)?.SocketErrorCode switch
{
SocketError.HostNotFound => "host_not_found",
SocketError.TryAgain => "try_again",
SocketError.AddressFamilyNotSupported => "address_family_not_supported",
SocketError.NoRecovery => "no_recovery",

_ => exception.GetType().Name
};
Copy link
Member Author

@antonfirsov antonfirsov Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked the codes we can realistically get from the PAL by looking at our mapping in the Unix PAL, and getaddrinfo return values on Windows.
There are some other codes that may theoretically occur, but I wouldn't expect to see them in practice. Ie. seeing InvalidArgument , TypeNotFound or SocketNotSupported means a programming error in our PAL rather than anything else if I'm not mistaken.

Copy link
Member

@JamesNK JamesNK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good. Details about the error attribute need to be added to dotnet/docs#37213 and/or the OTEL semantic conventions

}
}
}

private static string GetErrorType(Exception exception) => (exception as SocketException)?.SocketErrorCode switch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strings are specifically for metrics, right? Please add a comment on the method to that effect. Where do these result strings come from? Are they strings dictated by otel, or are you just snake_casing the names from the enum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strings are specifically for metrics, right?

Yes.

Where do these result strings come from?

They are defined by us, by mapping enum values to snake-cased strings which is an OTel convention. We do the same in HTTP metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is an OTel convention

Why don't we do the same for the _ => case then? i.e. it seems like our general goal here is to take any SocketError value or Exception name, and snake case it, and we're then just optimizing a few cases that we expect to be most common by hardcoding the transformation. In the name of robustness, shouldn't we be dynamically creating the snak_cased version of the resulting SocketError if there is one (outside of the range we've hardcoded) or for the exception type?

Copy link
Member Author

@antonfirsov antonfirsov Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noticed in #93254 (comment), this should be actually the full name of the exception, I will correct it.

Why don't we do the same for the _ => case then?

The specs don't require us to camel-case the exception names (would be very weird with full names). Tagging @lmolkova for a comprehensive explanation.

dynamically creating the snak_cased version of the resulting SocketError

The SocketError-s that can practically appear here are coming from a limited set. I picked the codes we can realistically get from the PAL by looking at our mapping in the Unix PAL, and getaddrinfo return values on Windows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And when that mapping is augmented in the future, we're going to remember to update this random method as well?

Copy link
Member Author

@antonfirsov antonfirsov Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And when that mapping is augmented in the future

That would only happen if getaddrinfo implementations would introduce new error codes which I find very unlikely. Implementing dynamic mapping here to prematurely solve that problem feels like overengineering. Users would see the value host_not_found in 99+% of the practical cases, sometimes try_again maybe.

If you think the mapping is problematic in the form I'm proposing, I would rather prefer to remove it, and just report the exception name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like our general goal here is to take any SocketError value or Exception name, and snake case it

I'd frame the goal as disambiguating the most common sources of error that users would appreciate not having grouped together when analyzing failures. I don't think it is a goal that every possible SocketError has been disambiguated, that new future SocketError values are automatically disambiguated, or that exception names are transformed into snake case. The hard-coded list Anton has here seems fine to me. If we ever changed it we'd probably want that to be an intentional choice so we can warn users that their queries might need to be updated.

@karelz karelz added this to the 9.0.0 milestone Oct 10, 2023
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I commented on the other issue that I do think FullName would be preferable for exception types for consistency across other .NET metrics.

@antonfirsov
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@antonfirsov
Copy link
Member Author

CI failures are unrelated (eg. #93413).

@lewing
Copy link
Member

lewing commented Oct 12, 2023

I opened #93418 to revert this because it broke the build

@karelz
Copy link
Member

karelz commented Oct 13, 2023

Looks like it didn't break the build after all - see explanation in #93418 (comment)

carlossanlop pushed a commit that referenced this pull request Oct 13, 2023
…istency (#93414)

* Rename `server.socket.address` to `network.peer.address` (#93255)

* Report Exception.FullName for http.client.request.duration/error.type (#93322)

* Adjust the DNS lookup duration metric (#93254)
@ghost ghost locked as resolved and limited conversation to collaborators Nov 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants