Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new (hopefully easier) telemetry mechanism #68261

Merged
merged 1 commit into from
Jun 13, 2023

Conversation

ToddGrun
Copy link
Contributor

The intent here is to provide a telemetry mechanism that is easy to use, and hopefully provides enough functionality that most telemetry users can use it directly.

The currently most common mechanism how Roslyn uses telemetry (that I've found) is through the Logger.Log call. This call aggregates multiple loggers through the ILogger interface and funnels the logging call to each implementationi (TelemetryLogger being one). I wanted to separate the telemetry calls out from this interface as I needed to add a minimum time threshold for some of the telemetry logging. Separating these calls out seems to make sense as I wanted to make it explicit that these calls were logging telemetry.

Additionally, the existing telemetry calls through ILogger don't support aggregating values. The new implementation supports this through the VSTelemetry histogram support (based on OTel). There is yet another telemetry mechanism in roslyn that does support histogram/count calls. The intent here is to switch over callers of that implementation to this new one (I can't yet though as I didn't add support for the Count part of the API).

The aggregated support will send histogram telemetry events every 30 minutes or upon process exit. For targeted notification support, non-aggregated telemetry calls are added upon a minimum time threshold being met.

Telemetry dashboards consuming these telemetry events here (work in progress)

@dotnet-issue-labeler dotnet-issue-labeler bot added Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead labels May 18, 2023
@@ -122,6 +123,7 @@ static ProjectCodeRefactoringProvider.ExtensionInfo GetExtensionInfo(ExportCodeR
Func<string, IDisposable?> addOperationScope,
CancellationToken cancellationToken)
{
using (TelemetryLogging.LogBlockTimeAggregated(FunctionId.CodeRefactoring_Summary, "Pri" + (int)priority))
Copy link
Member

@mavasani mavasani May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I like the fact that you have added lot of different fine grained aggregated log blocks for different parameter combinations (such as specific priority, fix vs refactoring, specific provider, etc.). The tricky part for someone new looking at this telemetry might be to figure out which ones to dig into for investigating their specific scenario. I think eventually adding a document or even a brief summary somewhere about the different type of events logged in each space (diagnostics, quick actions, etc.) with a small description will be helpful. Basically, documentation from investigation perspective, not the actual implementation perspective. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree, but it seems to me that such documentation would be very easy to get out of date. I think the dashboards are a nice piece of "documentation" from this perspective as they give the actual rundowns of the telemetry events that exist and are being fired.

var (allDiagnostics, upToDate) = await _diagnosticService.TryGetDiagnosticsForSpanAsync(
document, range, GetShouldIncludeDiagnosticPredicate(document, priorityProvider),
includeSuppressedDiagnostics: false, priorityProvider, DiagnosticKind.All, isExplicit: false, cancellationToken).ConfigureAwait(false);
var telemetryContext = "Pri" + (int)priorityProvider.Priority + "." + nameof(GetMostSevereFixAsync);
Copy link
Member

@mavasani mavasani May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you added name of the method here in the telemetry context, but not at other places. I am fine with either approach of appending the containing method name or not, but we probably want to be consistent? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree that consistency is good, but from my viewpoint the most important point is that the telemetry data when viewed is informative and can be reasoned about. I tried to name the events as such in this PR, but I don't have the context which you have in this area so I'm definitely willing to change any of the event names if you think different names would be more informative. Not sure if you have looked at the dashboard with this information, as that might give a more holistic view of how the events relate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about interpolated strings and interpolated string handlers.

@@ -216,6 +217,9 @@ public async Task<bool> TryGetAsync(ArrayBuilder<DiagnosticData> list, Cancellat
using var _2 = ArrayBuilder<AnalyzerWithState>.GetInstance(out var semanticSpanBasedAnalyzers);
using var _3 = ArrayBuilder<AnalyzerWithState>.GetInstance(out var semanticDocumentBasedAnalyzers);

var telemetryContext = "Pri" + (int)_priorityProvider.Priority;
using var _4 = TelemetryLogging.LogBlockTimeAggregated(FunctionId.RequestDiagnostics_Summary, telemetryContext);
Copy link
Member

@mavasani mavasani May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this will lead to duplication with the telemetry logged in code fix service around the single call to diagnosticService.GetDiagnosticsForSpanAsync. The log block there is dedicated for lightbulb Ctrl + Dot path, while the block here is for all the following paths: lightbulb Ctrl + Dot, margin lightbulb background analysis, tagger, all LSP pull diagnostics calls. If we want to log the telemetry here, you definitely need to add more parameterization to the context so we have different blocks for these code paths. The parameters to include would be _range, _incrementalAnalysis, _diagnosticKind (maybe we want to log separate aggregated blocks per-diagnostic kind and also one accumulated block which is agnostic of the diagnostic kind). #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I am not sure of the performance penalty we pay by logging multiple aggregated blocks here. If it is expensive, then a single summary block per each code path would be fine, but we definitely need more context than just the priority here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to leave this to a follow-up PR where we can discuss what you would like to see here. Note that every piece of data that we use to build up the event name multiplies the potential number of event names, so including something like range in the name isn't desirable.

@@ -61,6 +62,37 @@ private PerformanceQueue GetQueue(bool forSpanAnalysis)

public void AddSnapshot(IEnumerable<AnalyzerPerformanceInfo> snapshot, int unitCount, bool forSpanAnalysis)
{
foreach (var perfInfo in snapshot)
Copy link
Member

@mavasani mavasani May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume the intent for this PR to add aggregate logging for these snapshots, and a later PR will then delete our existing code that does custom aggregation of these snapshots and logs them in background? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our existing logging support in this area was a bit too complicated for me to tackle to start with. I feel like the old support can be mostly removed, but it's something I would like to discuss further with you outside this PR.

{
private TelemetryLogger? _telemetryLogger;
Copy link
Member

@mavasani mavasani May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm: The fact that the new telemetry logger is only hooked up in the remote workspace and not host workspace is not a limitation of the logger but just the need for this PR, correct? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on this question, can you clarify? TelemetryLogger.Create is called both from RemoteWorkspaceTelemetryService and from VisualStudioWorkspaceTelemetryService.

_isDisposed = false;
_lock = new object();

_ = PostCollectedTelemetryAsync();
Copy link
Member

@sharwell sharwell May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❗ Asynchronous operations must be tracked either by the life of the calling method or through asynchronous operation listeners. Fire-and-forget methods always fall into the latter category. However, since this is a long-lived operation that won't terminate as expected by our tracking mechanism, the preferred strategy is to use an AsyncBatchingWorkQueue, which will report that it's complete whenever the queue is empty. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about this, thanks! Switched to use this in commit 2.

@ToddGrun ToddGrun marked this pull request as ready for review May 22, 2023 17:42
@ToddGrun ToddGrun requested review from a team as code owners May 22, 2023 17:42
@@ -112,6 +113,8 @@ private async Task InvokeAsync()
{
try
{
using var _ = TelemetryLogging.LogBlockTimeAggregated(FunctionId.SuggestedAction_Application_Summary, "Total");
Copy link
Member

@CyrusNajmabadi CyrusNajmabadi May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we know what suggestion action this adds up to? or are we just wanting to get top level P numbers of invocation in general? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggregated collection won't tell us anything about what is being tracked other than a histogram of times and what is captured in the event name. So, the intent here is to capture a histogram of all "Total" costs of invoking a suggested action. If we find out that we are having perf issues in this area, then I could see us adding a TelemetryLogging.LogBlockTime call with a threshold time for firing individual events.

@@ -110,17 +111,22 @@ private partial class SuggestedActionsSource : IAsyncSuggestedActionsSource
// then pass it continuously from one priority group to the next.
var lowPriorityAnalyzers = new ConcurrentSet<DiagnosticAnalyzer>();

using var _2 = TelemetryLogging.LogBlockTimeAggregated(FunctionId.SuggestedAction_Summary, "Total");
Copy link
Member

@CyrusNajmabadi CyrusNajmabadi May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the diff between this and SuggestedAction_Application_Summary? Is it so we can measure what time is spent in the source, prior to actually doing computation at each pri level? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SuggestedAction_Summary is intended to measure the time to calculate the actions at each priority, whereas SuggestedAction_Application_Summary is intended to measure the time to apply the changes for an invoked suggested action.

@ToddGrun
Copy link
Contributor Author

Okey dokey, keep the cadence at 30 minutes, or would not doing it on shutdown change your opinion of that frequency?

Keeping at 30 minutes and removed disposal code.

@ToddGrun ToddGrun requested review from a team as code owners May 30, 2023 18:38
@ToddGrun
Copy link
Contributor Author

hrm, I seem to have done something wrong rebasing against main (I did so as this branch was based off a main a couple weeks old)

@ToddGrun
Copy link
Contributor Author

ToddGrun commented Jun 1, 2023

hrm, I seem to have done something wrong rebasing against main (I did so as this branch was based off a main a couple weeks old)

Fixed rebase via a force push. Not sure what went wrong, but that problem is solved. The tests were failing because some test code requires AsyncBatchingWorkQueue to be completely empty to progress. I now no longer automatically queue a new item once I've processed the old one instead waiting to do so until the log manager's GetLog method is called.

/// <summary>
/// Indicates whether <see cref="_postTelemetryQueue"/> has work already added to it.
/// </summary>
private int _workAdded = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this is needed. you can always just add more work to the queue. you're using the no-type-arg version, so it will be a no-op if htere's already an item in the queue.

// The queue doesn't have any work pending. Add an item so PostCollectedTelemetryAsync
// will get fired after the collection period.
_postTelemetryQueue.AddWork();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem necessary.

PostCollectedTelemetry();

// Reset to indicate the queue doesn't have any work pending.
_workAdded = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also not necessary. this is how the queue works to begin with. in the no-type-arg case you can think of it just as a beacon that pulses after the time limit has happened as long as at least one work item was added to it.

if (elapsed >= _minThresholdMs)
{
const string Name = nameof(Name);
const string Value = nameof(Value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second time i've seen these constants. Place in common location?

return;

const string Name = nameof(Name);
const string Value = nameof(Value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another copy :)

namespace Microsoft.CodeAnalysis.Telemetry
{
[InterpolatedStringHandler]
internal readonly struct TelemetryLoggingInterpolatedStringHandler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice.

PerformAnalysis_Summary = 721,
RequestDiagnostics_Delay = 730,
RequestDiagnostics_Summary = 731,
SuggestedAction_Delay = 740,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank lines betweent he sections.

Copy link
Member

@CyrusNajmabadi CyrusNajmabadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor suggestions/requests. None of them really blocking. Ok for this to go in once you're satisfied the suggestions are handled to your liking.

@ToddGrun
Copy link
Contributor Author

ToddGrun commented Jun 8, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@ToddGrun
Copy link
Contributor Author

ToddGrun commented Jun 8, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@ToddGrun
Copy link
Contributor Author

ToddGrun commented Jun 8, 2023

/azp run

@azure-pipelines
Copy link

Pull request contains merge conflicts.

@ToddGrun
Copy link
Contributor Author

ToddGrun commented Jun 9, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

The intent here is to provide a telemetry mechanism that is easy to use, and hopefully provides enough functionality that most telemetry users can use it directly.

The currently most common mechanism how Roslyn uses telemetry (that I've found) is through the Logger.Log call. This call aggregates multiple loggers through the ILogger interface and funnels the logging call to each implementationi (TelemetryLogger being one). I wanted to separate the telemetry calls out from this interface as I needed to add a minimum time threshold for some of the telemetry logging. Separating these calls out seems to make sense as I wanted to make it explicit that these calls were logging telemetry.

Additionally, the existing telemetry calls through ILogger don't support aggregating values. The new implementation supports this through the VSTelemetry histogram support (based on OTel). There is yet another telemetry mechanism in roslyn that does support histogram/count calls. The intent here is to switch over callers of that implementation to this new one (I can't yet though as I didn't add support for the Count part of the API).

The aggregated support will send histogram telemetry events every 30 minutes or upon process exit. For targeted notification support, non-aggregated telemetry calls are added upon a minimum time threshold being met.

Telemetry dashboards consuming these telemetry events here [https://dataexplorer.azure.com/dashboards/31bfd6e9-daa4-4273-acdb-e02cb156c289] (work in progress)
@ToddGrun ToddGrun merged commit 698af92 into dotnet:main Jun 13, 2023
24 checks passed
@ghost ghost added this to the Next milestone Jun 13, 2023
@RikkiGibson RikkiGibson modified the milestones: Next, 17.7 P3 Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants