Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APM agent crashes in Celery at random times during the day #823

Closed
lfrodrigues opened this issue May 8, 2020 · 21 comments · Fixed by #838
Closed

APM agent crashes in Celery at random times during the day #823

lfrodrigues opened this issue May 8, 2020 · 21 comments · Fixed by #838

Comments

@lfrodrigues
Copy link

I'm not sure how to reproduce but I've had this issue in production for several months. The crash happens randomly during the day, 20-50 times (we run 400k+ tasks per day)

The crash is:

TransportException: HTTP 400: {"accepted":0,"errors":[{"message":"gzip: invalid header"}]}

  File "elasticapm/transport/base.py", line 208, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 102, in send
    raise TransportException(message, data, print_trace=print_trace)

Environment (please complete the following information)

  • OS: Ubuntu 18.04
  • Python version: 3.6
  • Framework and version: Django 2.2 Celery 4.4.2
  • APM Server version: 7.6.0
  • Agent version: 5.3.2
@basepi
Copy link
Contributor

basepi commented May 8, 2020

Hmmm, I wonder how we're managing to send an invalid gzip header? Very odd. We'll see if we can track this down next week.

Can you do a pip freeze so we can see all your python package versions?

@lfrodrigues
Copy link
Author

Here you are:

africastalking==1.1.5
amqp==2.5.2
apipkg==1.5
appdirs==1.4.3
asgiref==3.2.7
asn1crypto==1.3.0
astroid==2.3.3
attrs==19.3.0
Babel==2.8.0
backcall==0.1.0
bandit==1.6.2
beautifulsoup4==4.8.2
billiard==3.6.3.0
black==19.10b0
block-disposable-email==1.0.1
boto==2.44.0
boto3==1.4.2
botocore==1.4.93
Brotli==1.0.7
cachetools==4.1.0
cairocffi==1.0.2
CairoSVG==2.4.2
celery==4.4.2
certifi==2020.4.5.1
cffi==1.14.0
chardet==3.0.4
click==7.1.1
contextlib2==0.6.0.post1
coreapi==2.3.3
coreschema==0.0.4
coverage==5.0.4
cryptography==2.6.1
cssselect2==0.3.0
cycler==0.10.0
decorator==4.4.2
defusedxml==0.6.0
distlib==0.3.0
Django==2.2
django-anymail==7.1.0
django-datatables-view==1.17.0
django-disposable-email-checker==1.2.1
django-extensions==2.2.9
django-phonenumber-field==4.0.0
django-prometheus==2.0.0
django-rest-knox==4.0.1
django-rest-logger==1.0.3
django-storages==1.5.1
django-stubs==1.5.0
django-webtest==1.9.7
djangorestframework==3.9.4
djangorestframework-stubs==1.1.0
dnspython==1.16.0
dnsq==1.1.6
docopt==0.6.2
docutils==0.16
dodgy==0.2.1
elastic-apm==5.3.2
execnet==1.7.1
expiringdict==1.2.0
facebook-sdk==3.1.0
factory-boy==2.12.0
Faker==4.0.2
filelock==3.0.12
flanker==0.9.11
flower==0.9.4
gitdb==4.0.2
GitPython==3.1.0
google-api-core==1.17.0
google-auth==1.14.1
google-cloud-core==1.3.0
google-cloud-translate==2.0.0
googleapis-common-protos==1.51.0
grpcio==1.28.1
gunicorn==20.0.4
html5lib==1.0.1
humanize==0.5.1
idna==2.9
importlib-metadata==1.6.0
importlib-resources==1.4.0
ipython==7.11.1
ipython-genutils==0.2.0
isort==4.3.21
itypes==1.2.0
jedi==0.16.0
Jinja2==2.11.2
jmespath==0.9.5
joblib==0.14.1
JSON-log-formatter==0.2.0
kiwisolver==1.2.0
kombu==4.6.8
lazy-object-proxy==1.4.3
MarkupSafe==1.1.1
matplotlib==3.2.1
mccabe==0.6.1
messagebird==1.2.0
mock==3.0.5
more-itertools==8.2.0
mypy==0.770
mypy-extensions==0.4.3
numpy==1.16.2
oauthlib==3.1.0
packaging==20.3
pandas==0.24.2
parso==0.6.2
pathspec==0.7.0
pbr==5.4.4
pep8-naming==0.4.1
pexpect==4.8.0
phonenumbers==8.12.2
pickleshare==0.7.5
Pillow==6.0.0
pipenv==2018.11.26
pluggy==0.13.1
ply==3.11
prometheus-client==0.7.1
prompt-toolkit==2.0.10
prospector==1.2.0
protobuf==3.11.3
psycopg2-binary==2.8.5
ptpython==2.0.6
ptyprocess==0.6.0
py==1.8.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycodestyle==2.4.0
pycparser==2.20
pydocstyle==5.0.2
pyflakes==2.1.1
Pygments==2.6.1
PyJWT==1.7.1
pylint==2.4.4
pylint-celery==0.3
pylint-django==2.0.12
pylint-flask==0.6
pylint-plugin-utils==0.6
pyparsing==2.4.7
Pyphen==0.9.5
pytest==5.3.4
pytest-cov==2.8.1
pytest-django==3.8.0
pytest-forked==1.1.3
pytest-pythonpath==0.7.3
pytest-xdist==1.31.0
python-dateutil==2.8.1
pytz==2019.3
pyxero==0.9.1
PyYAML==5.3.1
raven==6.10.0
regex==2020.4.4
requests==2.23.0
requests-oauthlib==1.3.0
requirements-detector==0.6
responses==0.10.9
rsa==4.0
s3transfer==0.1.13
schema==0.7.2
scikit-learn==0.21.2
scipy==1.4.1
scorecardpy==0.1.7
setoptconf==0.2.0
six==1.14.0
smmap==3.0.1
snowballstemmer==2.0.0
soupsieve==2.0
sqlparse==0.3.1
stevedore==1.32.0
text-unidecode==1.3
tinycss2==1.0.2
tld==0.12
toml==0.10.0
tornado==6.0.4
traitlets==4.3.3
typed-ast==1.4.1
typing==3.7.4.1
typing-extensions==3.7.4.2
Unidecode==1.1.1
uritemplate==3.0.1
urllib3==1.25.9
vine==1.3.0
virtualenv==20.0.18
virtualenv-clone==0.5.4
waitress==1.4.3
wcwidth==0.1.9
WeasyPrint==47
webencodings==0.5.1
WebOb==1.8.6
WebTest==2.0.34
whitenoise==5.0.1
wrapt==1.11.2
xgboost==0.90
zipp==3.1.0
zipstream==1.1.4

@basepi
Copy link
Contributor

basepi commented May 11, 2020

I wonder if #783 fixed this? Can you try upgrading to 5.6.0 and see if the crashes go away?

@lfrodrigues
Copy link
Author

Sure just did and will let you know

@lfrodrigues
Copy link
Author

lfrodrigues commented May 12, 2020

The crash is still happing randomly. Since I enabled it last night I got around 50 crashes. I do get a few more errors since I deployed 5.6.0.

I'm getting lots of celery data in APM so I assumed everything configured correctly.
Any idea what can be happening/how can I help debug this?

Here are the multiple crashes I have in sentry:

TransportException: HTTP 403: {"error":"forbidden request: Agent remote configuration is disabled. Configure the `apm-server.kibana` section in apm-server.yml to enable it. If you are using a RUM agent, you also need to configure the `apm-server.rum` section. If you are not using remote configuration, you can safely ignore this error."}

  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 105, in send
    raise TransportException(message, data, print_trace=print_trace)
TransportException: HTTP 400: 400 Bad Request
  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 105, in send
    raise TransportException(message, data, print_trace=print_trace)
TransportException: HTTP 400: {"accepted":0,"errors":[{"message":"gzip: invalid header"}]}

  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 105, in send
    raise TransportException(message, data, print_trace=print_trace)

@basepi
Copy link
Contributor

basepi commented May 12, 2020

I'm pretty certain these errors are around remote config, not normal data sending to the APM Server. When you say it crashes, does the celery worker actually stop or restart? Or do you just see an error log? Crashing is definitely not behavior we want, I'm going to figure out how to make these errors less intrusive and more useful.

As a band-aid fix, assuming you're not actually using the remote config feature, you can disable it by setting CENTRAL_CONFIG to False. I'm betting that will stop the errors until I can get a fix in.

@basepi
Copy link
Contributor

basepi commented May 12, 2020

One other question: what are you using for your celery pool? Are you using prefork or one of the thread options?

@lfrodrigues
Copy link
Author

We get a crash report in sentry but I'm not sure what is crashing. I'm adding two more errors I get maybe it helps.

We're using prefork.

ConnectionResetError: [Errno 104] Connection reset by peer
  File "urllib3/response.py", line 437, in _error_catcher
    yield
  File "urllib3/response.py", line 515, in read
    data = self._fp.read() if not fp_closed else b""
  File "http/client.py", line 459, in read
    s = self.fp.read()
  File "python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)

ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 96, in send
    body = response.read()
  File "urllib3/response.py", line 541, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
timeout: timed out
  File "urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
    # Permission is hereby granted, free of charge, to any person obtaining a copy
  File "urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "raven/breadcrumbs.py", line 390, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "http/client.py", line 1331, in getresponse
    response.begin()
  File "http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)

ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8200): Read timed out. (read timeout=5)
  File "elasticapm/transport/http.py", line 82, in send
    "POST", url, body=data, headers=headers, timeout=self._timeout, preload_content=False
  File "urllib3/poolmanager.py", line 336, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "elasticapm/instrumentation/packages/base.py", line 205, in call_if_sampling
    return wrapped(*args, **kwargs)
  File "urllib3/connectionpool.py", line 725, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "urllib3/packages/six.py", line 735, in reraise
    raise value
  File "urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "urllib3/connectionpool.py", line 428, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "urllib3/connectionpool.py", line 336, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value

TransportException: Unable to reach APM Server: HTTPConnectionPool(host='localhost', port=8200): Read timed out. (read timeout=5) (url: http://localhost:8200/intake/v2/events)
  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 95, in send
    raise TransportException(message, data, print_trace=print_trace)

@lfrodrigues
Copy link
Author

No, we're not using central_config, disabled it as sugested.

@basepi
Copy link
Contributor

basepi commented May 12, 2020

Are there any errors present in the APM Server logs? I wonder if it's overloaded or needs to be tuned? Those errors do not appear to be anything on the agent side, could be network issues or APM Server load issues.

I'm still a bit mystified about the 400 errors that you're seeing. I'm going to do some more digging.

@basepi
Copy link
Contributor

basepi commented May 12, 2020

Do you have any sort of proxy that could be modifying headers? The strangest piece of those errors is the fact that they're intermittent. We gzip-compress all data that we send to the APM server to reduce payload sizes, and I can't imagine why most data would be getting through fine but some would have invalid headers...

@lfrodrigues
Copy link
Author

No, nothing in the logs. It can't be the network as in this particular server I installed the APM server locally.

No, I have this celery in the same server as the apm server and I'm using the minimal default settings in Django to get APM working :

ELASTIC_APM = {'SERVICE_NAME': 'MyService', 'CENTRAL_CONFIG': False}

@basepi
Copy link
Contributor

basepi commented May 13, 2020

On the timeouts, it may be that your apm server is a touch overloaded and thus responding slowly. You can modify those timeouts with SERVER_TIMEOUT to raise that timeout period from the 5 second default.

I'm still trying to track down the source of the gzip errors. One of the APM Server devs pointed out that golang will throw that error in two situations:

The error is not related to any HTTP headers, it's caused by malformed the "magic number" sequence: https://github.com/golang/go/blob/cb11c981df7b4dc40550ab71cc097c25d24d7a71/src/compress/gzip/gunzip.go#L186

Or... missing null byte in the Name or Comment in the header: https://github.com/golang/go/blob/cb11c981df7b4dc40550ab71cc097c25d24d7a71/src/compress/gzip/gunzip.go#L146

I haven't been able to figure out how we would botch either of those situations, especially inconsistently like you're seeing. It may be a bug in python's gzip library... anyway, I'm still digging.

@lfrodrigues
Copy link
Author

This specific server is massively over-provisioned with a load average consistently below the available cores, so I think it's strange it is a server timeone. I increased to 10s and will let you know if something changes.

@lfrodrigues
Copy link
Author

Maybe all these issues are related to celery workers exiting when apm-agent wasn't expecting it?

Here's another crash from today.

SystemExit: 155
  File "billiard/pool.py", line 292, in __call__
    sys.exit(self.workloop(pid=pid))
  File "billiard/pool.py", line 283, in exit
    return _exit(status)

ValueError: close timed out
  File "celery/utils/dispatch/signal.py", line 288, in send
    response = receiver(signal=self, sender=sender, **named)
  File "elasticapm/contrib/celery/__init__.py", line 80, in worker_shutdown
    client.close()
  File "elasticapm/base.py", line 296, in close
    manager.stop_thread()
  File "elasticapm/transport/base.py", line 260, in close
    raise ValueError("close timed out")

Screenshot 2020-05-14 at 15 19 02

@basepi
Copy link
Contributor

basepi commented May 14, 2020

That could very well be the case. I could see that causing most of the issues that you're seeing -- in fact, if the celery worker shuts down while a message is being written to the gzip buffer in memory, it's possible that could be causing the corruption that results in the original error you reported.

Perhaps it would be worth turning off worker_max_tasks_per_child temporarily to see if that resolves the errors? Then we could investigate ways to make them shut down more gracefully if that is indeed the issue.

@lfrodrigues
Copy link
Author

Disabled, I'll let you tomorrow.

@lfrodrigues
Copy link
Author

It looks like the worker theory is correct.

I'm left with one crash in Sentry that happens between 1 am and 2 am that I hadn't even noticed before related to the size of the event.

We're inserting around 700k new rows into a table between 1 am and 2 am. Is there a smart way to prevent sending this huge SQL to APM Server?

dropping flushed data due to transport failure back-off
TransportException: HTTP 400: {"accepted":0,"errors":[{"message":"event exceeded the permitted size.","document":"{\"span\": {\"id\": \"01f00bedcb0e54eb\", \"transaction_id\": \"f1559737ffb09161\", \"trace_id\": \"f0ed1884c22422ee3af54fb9d9761d24\", \"parent_id\": \"f1559737ffb09161\", \"INFO ON THE SQL WITH LOTS OF STUFF...
  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 105, in send
    raise TransportException(message, data, print_trace=print_trace)

@basepi
Copy link
Contributor

basepi commented May 18, 2020

Yes we should definitely truncate that. Can you create a separate issue for that SQL truncation? I want to keep this issue focused on making the transport shut down more gracefully.

@lfrodrigues
Copy link
Author

Opened #827

basepi added a commit to basepi/apm-agent-python that referenced this issue May 27, 2020
Previously these would show up as big tracebacks in the logs which
was ugly, considering all sorts of things can go wrong when suddenly
closing the transport (such as when a celery worker terminates)

Fixes elastic#823
@basepi
Copy link
Contributor

basepi commented May 27, 2020

Opened #838 to address these errors and make them more descriptive.

basepi added a commit that referenced this issue Jun 5, 2020
* Catch and log any exceptions while closing the transport

Previously these would show up as big tracebacks in the logs which
was ugly, considering all sorts of things can go wrong when suddenly
closing the transport (such as when a celery worker terminates)

Fixes #823

* Remove trailing comma

I don't know how this happened, I think black added it? But then it
didn't re-add it this time when I removed it. Probably more of
black's magic comma stuff that has been added recently

* Add to changelog
romulorosa pushed a commit to romulorosa/apm-agent-python that referenced this issue Oct 15, 2020
* Catch and log any exceptions while closing the transport

Previously these would show up as big tracebacks in the logs which
was ugly, considering all sorts of things can go wrong when suddenly
closing the transport (such as when a celery worker terminates)

Fixes elastic#823

* Remove trailing comma

I don't know how this happened, I think black added it? But then it
didn't re-add it this time when I removed it. Probably more of
black's magic comma stuff that has been added recently

* Add to changelog
romulorosa pushed a commit to romulorosa/apm-agent-python that referenced this issue Oct 15, 2020
* Catch and log any exceptions while closing the transport

Previously these would show up as big tracebacks in the logs which
was ugly, considering all sorts of things can go wrong when suddenly
closing the transport (such as when a celery worker terminates)

Fixes elastic#823

* Remove trailing comma

I don't know how this happened, I think black added it? But then it
didn't re-add it this time when I removed it. Probably more of
black's magic comma stuff that has been added recently

* Add to changelog
beniwohli pushed a commit to beniwohli/apm-agent-python that referenced this issue Sep 14, 2021
* Catch and log any exceptions while closing the transport

Previously these would show up as big tracebacks in the logs which
was ugly, considering all sorts of things can go wrong when suddenly
closing the transport (such as when a celery worker terminates)

Fixes elastic#823

* Remove trailing comma

I don't know how this happened, I think black added it? But then it
didn't re-add it this time when I removed it. Probably more of
black's magic comma stuff that has been added recently

* Add to changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants