Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Monitor exporter timing out and throwing errors #862

Closed
acksmaggart opened this issue Feb 21, 2020 · 11 comments
Closed

Azure Monitor exporter timing out and throwing errors #862

acksmaggart opened this issue Feb 21, 2020 · 11 comments
Labels
azure Microsoft Azure bug

Comments

@acksmaggart
Copy link

Problem

I am using the OpenCensus Azure extension to try to write trace information to App Insights. However, of the 2-3 dozen requests I have made to my server only one is showing up in App Insights and I keep getting timeout errors on my end and 500 errors from IIS on the other end. I am assuming that I'm just doing something wrong, but I can't figure out what it is.

Environment

I am on macOS.
Python version:

3.8.0 (default, Jan  8 2020, 13:35:00)
[Clang 10.0.1 (clang-1001.0.46.4)]

Package Versions:

opencensus==0.7.7
opencensus-context==0.1.1
opencensus-ext-azure==1.0.2
Flask==1.1.1
Flask-Cors==3.0.8

To Reproduce

Server code:

import os
import time
import random

from flask import Flask, jsonify, request
from flask_cors import CORS
from opencensus.ext.azure.trace_exporter import AzureExporter
from opencensus.trace.samplers import ProbabilitySampler
from opencensus.trace.tracer import Tracer

azure_exporter = AzureExporter(connection_string='InstrumentationKey=************')
tracer = Tracer(exporter=azure_exporter, sampler=ProbabilitySampler(1.0))

app = Flask(__name__)
CORS(app)

@app.route('/')
def handle_request():
    with tracer.span(name="handler.respond"):
        to_sleep = random.random() * 2
        time.sleep(to_sleep)
    return "done"

if __name__ == "__main__":
    app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))

client code:

$ curl localhost:8080

Expected Behaivor

No timeouts or 500 errors and every request producing a new trace in App Insights (since I am using a probability sampler with probability=1).

Actual Behaivor

I am seeing this error every few seconds:

Transient client side error HTTPSConnectionPool(host='dc.services.visualstudio.com', port=443): Read timed out. (read timeout=10.0).

and when I send a request I get this html back:

Screen Shot 2020-02-21 at 4 19 29 PM

Maybe this is just an app insights server outage, but I'm guessing that I'm doing something wrong. Let me know if there is other info that would be helpful.

@acksmaggart
Copy link
Author

Also, I've tried setting timeout=60 on the exporter but that didn't help.

@lzchen
Copy link
Contributor

lzchen commented Feb 26, 2020

@MaxTaggart
Are you still having this issue? The problem might have been on the ingestion side (Azure Monitor backend). Try it again and see if you are still getting these errors.

@lzchen
Copy link
Contributor

lzchen commented Mar 27, 2020

@MaxTaggart Any updates on this?

@lzchen
Copy link
Contributor

lzchen commented Apr 7, 2020

Closing due to inactivity.

@lzchen lzchen closed this as completed Apr 7, 2020
@acksmaggart
Copy link
Author

acksmaggart commented Apr 9, 2020

@lzchen Yes, sorry for the radio silence. We are still getting those errors. I ran a 3-day test on our web service and there are long stretches (anywhere from 1-20 hours) where we aren't getting exception information pushed into Azure App Insights and I'm seeing the error message above in my traces.

Transient client side error HTTPSConnectionPool(host='dc.services.visualstudio.com', port=443): Read timed out. (read timeout=10.0).

Although that does mean the trace information is making it into App Insights at least.

Also, I see that banner every time I go to the OpenCensus documentation that implies that I should be using "OpenTelemetry" instead, is OpenCensus still the Microsoft-endorsed telemetry library for Python or should I be using something else?

@lzchen
Copy link
Contributor

lzchen commented Apr 9, 2020

Although that does mean the trace information is making it into App Insights at least.

Just to clarify, are you able to see trace information in App insights when the error message is shown? Just want to see if the telemetry corresponding to the error message is being sent to App insights, or if the trace information you are seeing is from successful information that does not generate the error message.

Also, I see that banner every time I go to the OpenCensus documentation that implies that I should be using "OpenTelemetry" instead, is OpenCensus still the Microsoft-endorsed telemetry library for Python or should I be using something else?

Good question! OpenCensus Azure exporter is Microsoft's currently officially supported APM solution for Python applications. We are also investing heavily into OpenTelemetry for Python as the future of vendor neutral APM solutions. However, OpenTelemetry is still in beta and probably will not see GA until sometime next year, so for Microsoft, we recommend using OpenCensus for production environments. We will have migration plan for customers that are on OpenCensus once OpenTelemetry goes GA. If you want to try OpenTelemetry out yourself however, feel free to do so, as we would love the feedback! :)

@lzchen
Copy link
Contributor

lzchen commented Apr 9, 2020

@MaxTaggart
This issue seems to be common when the endpoint health (Application Insights backend) is degraded. Too many requests being sent to overloaded storage clusters. We don't have an SLA on ingestion request latency, so you should implement appropriate timeout strategies. Try setting the network timeout configuration in AzureExporter to something greater (default is 10.0s).

azure_exporter = AzureExporter(connection_string='InstrumentationKey=************', timeout=30.0)

@acksmaggart
Copy link
Author

Thanks for the followup! I will try adjusting the timeout. I currently also have my sampling rate set to 1 for testing, but I could dial that down too.

Just to clarify, are you able to see trace information in App insights when the error message is shown?

Yes, the error message appears in the traces table in the App Insights logs, so the exception message is being sent successfully to App insights, just not the dependencies or requests data.

@jonasmiederer
Copy link

I am currently facing the same problem, did you find a solution for that @MaxTaggart ?

@SanthoshMedide
Copy link

I am facing the same issue and adding a timeout is not resolving it. Is there any solution yet?

@lzchen
Copy link
Contributor

lzchen commented Jun 11, 2020

@SanthoshMedide
Yes this is not an SDK issue as it is an ingestion endpoint delay. See this comment. When this message appears, your telemetry is deemed as "failed retryable" and it should be attempting to send again once the ingestion service isn't backed up anymore. You should be able to see your telemetry eventually in App insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
azure Microsoft Azure bug
Projects
None yet
Development

No branches or pull requests

4 participants