Add new (hopefully easier) telemetry mechanism #68261

ToddGrun · 2023-05-18T21:46:02Z

The intent here is to provide a telemetry mechanism that is easy to use, and hopefully provides enough functionality that most telemetry users can use it directly.

The currently most common mechanism how Roslyn uses telemetry (that I've found) is through the Logger.Log call. This call aggregates multiple loggers through the ILogger interface and funnels the logging call to each implementationi (TelemetryLogger being one). I wanted to separate the telemetry calls out from this interface as I needed to add a minimum time threshold for some of the telemetry logging. Separating these calls out seems to make sense as I wanted to make it explicit that these calls were logging telemetry.

Additionally, the existing telemetry calls through ILogger don't support aggregating values. The new implementation supports this through the VSTelemetry histogram support (based on OTel). There is yet another telemetry mechanism in roslyn that does support histogram/count calls. The intent here is to switch over callers of that implementation to this new one (I can't yet though as I didn't add support for the Count part of the API).

The aggregated support will send histogram telemetry events every 30 minutes or upon process exit. For targeted notification support, non-aggregated telemetry calls are added upon a minimum time threshold being met.

Telemetry dashboards consuming these telemetry events here (work in progress)

src/Features/Core/Portable/CodeRefactorings/CodeRefactoringService.cs

mavasani · 2023-05-19T08:48:17Z

src/Features/Core/Portable/CodeRefactorings/CodeRefactoringService.cs

@@ -122,6 +123,7 @@ static ProjectCodeRefactoringProvider.ExtensionInfo GetExtensionInfo(ExportCodeR
            Func<string, IDisposable?> addOperationScope,
            CancellationToken cancellationToken)
        {
+            using (TelemetryLogging.LogBlockTimeAggregated(FunctionId.CodeRefactoring_Summary, "Pri" + (int)priority))


So, I like the fact that you have added lot of different fine grained aggregated log blocks for different parameter combinations (such as specific priority, fix vs refactoring, specific provider, etc.). The tricky part for someone new looking at this telemetry might be to figure out which ones to dig into for investigating their specific scenario. I think eventually adding a document or even a brief summary somewhere about the different type of events logged in each space (diagnostics, quick actions, etc.) with a small description will be helpful. Basically, documentation from investigation perspective, not the actual implementation perspective. #Resolved

I don't disagree, but it seems to me that such documentation would be very easy to get out of date. I think the dashboards are a nice piece of "documentation" from this perspective as they give the actual rundowns of the telemetry events that exist and are being fired.

mavasani · 2023-05-19T09:22:43Z

src/Features/LanguageServer/Protocol/Features/CodeFixes/CodeFixService.cs

-            var (allDiagnostics, upToDate) = await _diagnosticService.TryGetDiagnosticsForSpanAsync(
-                document, range, GetShouldIncludeDiagnosticPredicate(document, priorityProvider),
-                includeSuppressedDiagnostics: false, priorityProvider, DiagnosticKind.All, isExplicit: false, cancellationToken).ConfigureAwait(false);
+            var telemetryContext = "Pri" + (int)priorityProvider.Priority + "." + nameof(GetMostSevereFixAsync);


I see that you added name of the method here in the telemetry context, but not at other places. I am fine with either approach of appending the containing method name or not, but we probably want to be consistent? #Resolved

Definitely agree that consistency is good, but from my viewpoint the most important point is that the telemetry data when viewed is informative and can be reasoned about. I tried to name the events as such in this PR, but I don't have the context which you have in this area so I'm definitely willing to change any of the event names if you think different names would be more informative. Not sure if you have looked at the dashboard with this information, as that might give a more holistic view of how the events relate.

same comment about interpolated strings and interpolated string handlers.

mavasani · 2023-05-19T09:31:03Z

...rotocol/Features/Diagnostics/EngineV2/DiagnosticIncrementalAnalyzer_GetDiagnosticsForSpan.cs

@@ -216,6 +217,9 @@ public async Task<bool> TryGetAsync(ArrayBuilder<DiagnosticData> list, Cancellat
                    using var _2 = ArrayBuilder<AnalyzerWithState>.GetInstance(out var semanticSpanBasedAnalyzers);
                    using var _3 = ArrayBuilder<AnalyzerWithState>.GetInstance(out var semanticDocumentBasedAnalyzers);

+                    var telemetryContext = "Pri" + (int)_priorityProvider.Priority;
+                    using var _4 = TelemetryLogging.LogBlockTimeAggregated(FunctionId.RequestDiagnostics_Summary, telemetryContext);


I am wondering if this will lead to duplication with the telemetry logged in code fix service around the single call to diagnosticService.GetDiagnosticsForSpanAsync. The log block there is dedicated for lightbulb Ctrl + Dot path, while the block here is for all the following paths: lightbulb Ctrl + Dot, margin lightbulb background analysis, tagger, all LSP pull diagnostics calls. If we want to log the telemetry here, you definitely need to add more parameterization to the context so we have different blocks for these code paths. The parameters to include would be _range, _incrementalAnalysis, _diagnosticKind (maybe we want to log separate aggregated blocks per-diagnostic kind and also one accumulated block which is agnostic of the diagnostic kind). #Resolved

Oh, I am not sure of the performance penalty we pay by logging multiple aggregated blocks here. If it is expensive, then a single summary block per each code path would be fine, but we definitely need more context than just the priority here.

I think I'd like to leave this to a follow-up PR where we can discuss what you would like to see here. Note that every piece of data that we use to build up the event name multiplies the potential number of event names, so including something like range in the name isn't desirable.

src/VisualStudio/Core/Def/Telemetry/AggregatingTelemetryLog.cs

src/VisualStudio/Core/Def/Telemetry/TelemetryLogger.cs

mavasani · 2023-05-19T10:15:30Z

src/Workspaces/Remote/ServiceHub/Services/DiagnosticAnalyzer/PerformanceTrackerService.cs

@@ -61,6 +62,37 @@ private PerformanceQueue GetQueue(bool forSpanAnalysis)

        public void AddSnapshot(IEnumerable<AnalyzerPerformanceInfo> snapshot, int unitCount, bool forSpanAnalysis)
        {
+            foreach (var perfInfo in snapshot)


I presume the intent for this PR to add aggregate logging for these snapshots, and a later PR will then delete our existing code that does custom aggregation of these snapshots and logs them in background? #Resolved

Our existing logging support in this area was a bit too complicated for me to tackle to start with. I feel like the old support can be mostly removed, but it's something I would like to discuss further with you outside this PR.

src/Workspaces/Remote/ServiceHub/Services/DiagnosticAnalyzer/PerformanceTrackerService.cs

mavasani · 2023-05-19T10:51:03Z

src/Workspaces/Remote/ServiceHub/Services/ProcessTelemetry/RemoteWorkspaceTelemetryService.cs

    {
+        private TelemetryLogger? _telemetryLogger;


Just to confirm: The fact that the new telemetry logger is only hooked up in the remote workspace and not host workspace is not a limitation of the logger but just the need for this PR, correct? #Resolved

I'm not clear on this question, can you clarify? TelemetryLogger.Create is called both from RemoteWorkspaceTelemetryService and from VisualStudioWorkspaceTelemetryService.

src/Workspaces/SharedUtilitiesAndExtensions/Compiler/Core/Log/FunctionId.cs

src/Workspaces/Remote/ServiceHub/Services/ProcessTelemetry/RemoteProcessTelemetryService.cs

src/VisualStudio/Core/Def/Telemetry/VisualStudioTelemetryLogManager.cs

src/VisualStudio/Core/Def/Telemetry/TelemetryLogProvider.cs

src/VisualStudio/Core/Def/Telemetry/AggregatingTelemetryLogManager.cs

sharwell · 2023-05-19T13:02:20Z

src/VisualStudio/Core/Def/Telemetry/AggregatingTelemetryLogManager.cs

+            _isDisposed = false;
+            _lock = new object();
+
+            _ = PostCollectedTelemetryAsync();


❗ Asynchronous operations must be tracked either by the life of the calling method or through asynchronous operation listeners. Fire-and-forget methods always fall into the latter category. However, since this is a long-lived operation that won't terminate as expected by our tracking mechanism, the preferred strategy is to use an AsyncBatchingWorkQueue, which will report that it's complete whenever the queue is empty. #Resolved

Didn't know about this, thanks! Switched to use this in commit 2.

src/VisualStudio/Core/Def/Telemetry/AggregatingTelemetryLog.cs

src/VisualStudio/Core/Def/Telemetry/AggregatingTelemetryLogManager.cs

src/EditorFeatures/Core.Wpf/Suggestions/SuggestedActionsSource_Async.cs

src/Features/LanguageServer/Protocol/Features/CodeFixes/CodeFixService.cs

CyrusNajmabadi · 2023-05-22T18:32:22Z

src/EditorFeatures/Core.Wpf/Suggestions/SuggestedActions/SuggestedAction.cs

@@ -112,6 +113,8 @@ private async Task InvokeAsync()
        {
            try
            {
+                using var _ = TelemetryLogging.LogBlockTimeAggregated(FunctionId.SuggestedAction_Application_Summary, "Total");


how would we know what suggestion action this adds up to? or are we just wanting to get top level P numbers of invocation in general? #Resolved

Aggregated collection won't tell us anything about what is being tracked other than a histogram of times and what is captured in the event name. So, the intent here is to capture a histogram of all "Total" costs of invoking a suggested action. If we find out that we are having perf issues in this area, then I could see us adding a TelemetryLogging.LogBlockTime call with a threshold time for firing individual events.

CyrusNajmabadi · 2023-05-22T18:33:35Z

src/EditorFeatures/Core.Wpf/Suggestions/SuggestedActionsSource_Async.cs

@@ -110,17 +111,22 @@ private partial class SuggestedActionsSource : IAsyncSuggestedActionsSource
                    // then pass it continuously from one priority group to the next.
                    var lowPriorityAnalyzers = new ConcurrentSet<DiagnosticAnalyzer>();

+                    using var _2 = TelemetryLogging.LogBlockTimeAggregated(FunctionId.SuggestedAction_Summary, "Total");


what's the diff between this and SuggestedAction_Application_Summary? Is it so we can measure what time is spent in the source, prior to actually doing computation at each pri level? #Resolved

SuggestedAction_Summary is intended to measure the time to calculate the actions at each priority, whereas SuggestedAction_Application_Summary is intended to measure the time to apply the changes for an invoked suggested action.

ToddGrun · 2023-05-24T18:46:02Z

Okey dokey, keep the cadence at 30 minutes, or would not doing it on shutdown change your opinion of that frequency?

Keeping at 30 minutes and removed disposal code.

ToddGrun · 2023-05-30T18:41:50Z

hrm, I seem to have done something wrong rebasing against main (I did so as this branch was based off a main a couple weeks old)

ToddGrun · 2023-06-01T00:20:08Z

hrm, I seem to have done something wrong rebasing against main (I did so as this branch was based off a main a couple weeks old)

Fixed rebase via a force push. Not sure what went wrong, but that problem is solved. The tests were failing because some test code requires AsyncBatchingWorkQueue to be completely empty to progress. I now no longer automatically queue a new item once I've processed the old one instead waiting to do so until the log manager's GetLog method is called.

CyrusNajmabadi · 2023-06-02T18:04:54Z

src/VisualStudio/Core/Def/Telemetry/AggregatingTelemetryLogManager.cs

+        /// <summary>
+        /// Indicates whether <see cref="_postTelemetryQueue"/> has work already added to it.
+        /// </summary>
+        private int _workAdded = 0;


not sure why this is needed. you can always just add more work to the queue. you're using the no-type-arg version, so it will be a no-op if htere's already an item in the queue.

CyrusNajmabadi · 2023-06-02T18:05:08Z

src/VisualStudio/Core/Def/Telemetry/AggregatingTelemetryLogManager.cs

+                // The queue doesn't have any work pending. Add an item so PostCollectedTelemetryAsync
+                //   will get fired after the collection period.
+                _postTelemetryQueue.AddWork();
+            }


this doesn't seem necessary.

CyrusNajmabadi · 2023-06-02T18:05:50Z

src/VisualStudio/Core/Def/Telemetry/AggregatingTelemetryLogManager.cs

+            PostCollectedTelemetry();
+
+            // Reset to indicate the queue doesn't have any work pending.
+            _workAdded = 0;


also not necessary. this is how the queue works to begin with. in the no-type-arg case you can think of it just as a beacon that pulses after the time limit has happened as long as at least one work item was added to it.

CyrusNajmabadi · 2023-06-02T18:07:34Z

src/VisualStudio/Core/Def/Telemetry/TimedTelemetryLogBlock.cs

+            if (elapsed >= _minThresholdMs)
+            {
+                const string Name = nameof(Name);
+                const string Value = nameof(Value);


second time i've seen these constants. Place in common location?

CyrusNajmabadi · 2023-06-02T18:09:28Z

src/Workspaces/Core/Portable/Telemetry/TelemetryLogging.cs

+                return;
+
+            const string Name = nameof(Name);
+            const string Value = nameof(Value);


another copy :)

CyrusNajmabadi · 2023-06-02T18:09:36Z

src/Workspaces/Core/Portable/Telemetry/TelemetryLoggingInterpolatedStringHandler.cs

+namespace Microsoft.CodeAnalysis.Telemetry
+{
+    [InterpolatedStringHandler]
+    internal readonly struct TelemetryLoggingInterpolatedStringHandler


CyrusNajmabadi · 2023-06-02T18:10:52Z

src/Workspaces/SharedUtilitiesAndExtensions/Compiler/Core/Log/FunctionId.cs

+        PerformAnalysis_Summary = 721,
+        RequestDiagnostics_Delay = 730,
+        RequestDiagnostics_Summary = 731,
+        SuggestedAction_Delay = 740,


blank lines betweent he sections.

CyrusNajmabadi

minor suggestions/requests. None of them really blocking. Ok for this to go in once you're satisfied the suggestions are handled to your liking.

ToddGrun · 2023-06-08T02:46:22Z

/azp run

azure-pipelines · 2023-06-08T02:46:43Z

Azure Pipelines successfully started running 4 pipeline(s).

ToddGrun · 2023-06-08T13:06:45Z

/azp run

azure-pipelines · 2023-06-08T13:07:06Z

Azure Pipelines successfully started running 4 pipeline(s).

ToddGrun · 2023-06-08T19:34:21Z

/azp run

azure-pipelines · 2023-06-08T19:34:26Z

Pull request contains merge conflicts.

ToddGrun · 2023-06-09T17:56:46Z

/azp run

azure-pipelines · 2023-06-09T17:57:07Z

Azure Pipelines successfully started running 4 pipeline(s).

The intent here is to provide a telemetry mechanism that is easy to use, and hopefully provides enough functionality that most telemetry users can use it directly. The currently most common mechanism how Roslyn uses telemetry (that I've found) is through the Logger.Log call. This call aggregates multiple loggers through the ILogger interface and funnels the logging call to each implementationi (TelemetryLogger being one). I wanted to separate the telemetry calls out from this interface as I needed to add a minimum time threshold for some of the telemetry logging. Separating these calls out seems to make sense as I wanted to make it explicit that these calls were logging telemetry. Additionally, the existing telemetry calls through ILogger don't support aggregating values. The new implementation supports this through the VSTelemetry histogram support (based on OTel). There is yet another telemetry mechanism in roslyn that does support histogram/count calls. The intent here is to switch over callers of that implementation to this new one (I can't yet though as I didn't add support for the Count part of the API). The aggregated support will send histogram telemetry events every 30 minutes or upon process exit. For targeted notification support, non-aggregated telemetry calls are added upon a minimum time threshold being met. Telemetry dashboards consuming these telemetry events here [https://dataexplorer.azure.com/dashboards/31bfd6e9-daa4-4273-acdb-e02cb156c289] (work in progress)

dotnet-issue-labeler bot added Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead labels May 18, 2023