Azure Monitor exporter timing out and throwing errors #862

acksmaggart · 2020-02-21T23:41:21Z

Problem

I am using the OpenCensus Azure extension to try to write trace information to App Insights. However, of the 2-3 dozen requests I have made to my server only one is showing up in App Insights and I keep getting timeout errors on my end and 500 errors from IIS on the other end. I am assuming that I'm just doing something wrong, but I can't figure out what it is.

Environment

I am on macOS.
Python version:

3.8.0 (default, Jan  8 2020, 13:35:00)
[Clang 10.0.1 (clang-1001.0.46.4)]

Package Versions:

opencensus==0.7.7
opencensus-context==0.1.1
opencensus-ext-azure==1.0.2
Flask==1.1.1
Flask-Cors==3.0.8

To Reproduce

Server code:

import os
import time
import random

from flask import Flask, jsonify, request
from flask_cors import CORS
from opencensus.ext.azure.trace_exporter import AzureExporter
from opencensus.trace.samplers import ProbabilitySampler
from opencensus.trace.tracer import Tracer

azure_exporter = AzureExporter(connection_string='InstrumentationKey=************')
tracer = Tracer(exporter=azure_exporter, sampler=ProbabilitySampler(1.0))

app = Flask(__name__)
CORS(app)

@app.route('/')
def handle_request():
    with tracer.span(name="handler.respond"):
        to_sleep = random.random() * 2
        time.sleep(to_sleep)
    return "done"

if __name__ == "__main__":
    app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))

client code:

$ curl localhost:8080

Expected Behaivor

No timeouts or 500 errors and every request producing a new trace in App Insights (since I am using a probability sampler with probability=1).

Actual Behaivor

I am seeing this error every few seconds:

Transient client side error HTTPSConnectionPool(host='dc.services.visualstudio.com', port=443): Read timed out. (read timeout=10.0).

and when I send a request I get this html back:

Maybe this is just an app insights server outage, but I'm guessing that I'm doing something wrong. Let me know if there is other info that would be helpful.

The text was updated successfully, but these errors were encountered:

acksmaggart · 2020-02-21T23:43:29Z

Also, I've tried setting timeout=60 on the exporter but that didn't help.

lzchen · 2020-02-26T23:13:48Z

@MaxTaggart
Are you still having this issue? The problem might have been on the ingestion side (Azure Monitor backend). Try it again and see if you are still getting these errors.

lzchen · 2020-03-27T15:42:20Z

@MaxTaggart Any updates on this?

lzchen · 2020-04-07T17:55:05Z

Closing due to inactivity.

acksmaggart · 2020-04-09T17:43:16Z

@lzchen Yes, sorry for the radio silence. We are still getting those errors. I ran a 3-day test on our web service and there are long stretches (anywhere from 1-20 hours) where we aren't getting exception information pushed into Azure App Insights and I'm seeing the error message above in my traces.

Transient client side error HTTPSConnectionPool(host='dc.services.visualstudio.com', port=443): Read timed out. (read timeout=10.0).

Although that does mean the trace information is making it into App Insights at least.

Also, I see that banner every time I go to the OpenCensus documentation that implies that I should be using "OpenTelemetry" instead, is OpenCensus still the Microsoft-endorsed telemetry library for Python or should I be using something else?

lzchen · 2020-04-09T19:19:30Z

Although that does mean the trace information is making it into App Insights at least.

Just to clarify, are you able to see trace information in App insights when the error message is shown? Just want to see if the telemetry corresponding to the error message is being sent to App insights, or if the trace information you are seeing is from successful information that does not generate the error message.

Also, I see that banner every time I go to the OpenCensus documentation that implies that I should be using "OpenTelemetry" instead, is OpenCensus still the Microsoft-endorsed telemetry library for Python or should I be using something else?

Good question! OpenCensus Azure exporter is Microsoft's currently officially supported APM solution for Python applications. We are also investing heavily into OpenTelemetry for Python as the future of vendor neutral APM solutions. However, OpenTelemetry is still in beta and probably will not see GA until sometime next year, so for Microsoft, we recommend using OpenCensus for production environments. We will have migration plan for customers that are on OpenCensus once OpenTelemetry goes GA. If you want to try OpenTelemetry out yourself however, feel free to do so, as we would love the feedback! :)

lzchen · 2020-04-09T20:04:40Z

@MaxTaggart
This issue seems to be common when the endpoint health (Application Insights backend) is degraded. Too many requests being sent to overloaded storage clusters. We don't have an SLA on ingestion request latency, so you should implement appropriate timeout strategies. Try setting the network timeout configuration in AzureExporter to something greater (default is 10.0s).

azure_exporter = AzureExporter(connection_string='InstrumentationKey=************', timeout=30.0)

acksmaggart · 2020-04-23T03:12:25Z

Thanks for the followup! I will try adjusting the timeout. I currently also have my sampling rate set to 1 for testing, but I could dial that down too.

Just to clarify, are you able to see trace information in App insights when the error message is shown?

Yes, the error message appears in the traces table in the App Insights logs, so the exception message is being sent successfully to App insights, just not the dependencies or requests data.

jonasmiederer · 2020-06-04T14:39:39Z

I am currently facing the same problem, did you find a solution for that @MaxTaggart ?

SanthoshMedide · 2020-06-11T17:27:41Z

I am facing the same issue and adding a timeout is not resolving it. Is there any solution yet?

lzchen · 2020-06-11T18:10:52Z

@SanthoshMedide
Yes this is not an SDK issue as it is an ingestion endpoint delay. See this comment. When this message appears, your telemetry is deemed as "failed retryable" and it should be attempting to send again once the ingestion service isn't backed up anymore. You should be able to see your telemetry eventually in App insights.

acksmaggart added the bug label Feb 21, 2020

lzchen closed this as completed Apr 7, 2020

lzchen added the azure Microsoft Azure label Oct 5, 2020

lzchen mentioned this issue Jan 30, 2021

Logging information appears to overload the application-insights endpoint and kills app #1007

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Monitor exporter timing out and throwing errors #862

Azure Monitor exporter timing out and throwing errors #862

acksmaggart commented Feb 21, 2020

acksmaggart commented Feb 21, 2020

lzchen commented Feb 26, 2020

lzchen commented Mar 27, 2020

lzchen commented Apr 7, 2020

acksmaggart commented Apr 9, 2020 •

edited

lzchen commented Apr 9, 2020 •

edited

lzchen commented Apr 9, 2020

acksmaggart commented Apr 23, 2020

jonasmiederer commented Jun 4, 2020

SanthoshMedide commented Jun 11, 2020

lzchen commented Jun 11, 2020 •

edited

Azure Monitor exporter timing out and throwing errors #862

Azure Monitor exporter timing out and throwing errors #862

Comments

acksmaggart commented Feb 21, 2020

Problem

Environment

To Reproduce

Expected Behaivor

Actual Behaivor

acksmaggart commented Feb 21, 2020

lzchen commented Feb 26, 2020

lzchen commented Mar 27, 2020

lzchen commented Apr 7, 2020

acksmaggart commented Apr 9, 2020 • edited

lzchen commented Apr 9, 2020 • edited

lzchen commented Apr 9, 2020

acksmaggart commented Apr 23, 2020

jonasmiederer commented Jun 4, 2020

SanthoshMedide commented Jun 11, 2020

lzchen commented Jun 11, 2020 • edited

acksmaggart commented Apr 9, 2020 •

edited

lzchen commented Apr 9, 2020 •

edited

lzchen commented Jun 11, 2020 •

edited