APM agent crashes in Celery at random times during the day #823

lfrodrigues · 2020-05-08T18:08:24Z

I'm not sure how to reproduce but I've had this issue in production for several months. The crash happens randomly during the day, 20-50 times (we run 400k+ tasks per day)

The crash is:

TransportException: HTTP 400: {"accepted":0,"errors":[{"message":"gzip: invalid header"}]}

  File "elasticapm/transport/base.py", line 208, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 102, in send
    raise TransportException(message, data, print_trace=print_trace)

Environment (please complete the following information)

OS: Ubuntu 18.04
Python version: 3.6
Framework and version: Django 2.2 Celery 4.4.2
APM Server version: 7.6.0
Agent version: 5.3.2

The text was updated successfully, but these errors were encountered:

basepi · 2020-05-08T21:54:06Z

Hmmm, I wonder how we're managing to send an invalid gzip header? Very odd. We'll see if we can track this down next week.

Can you do a pip freeze so we can see all your python package versions?

lfrodrigues · 2020-05-09T09:26:40Z

Here you are:

africastalking==1.1.5
amqp==2.5.2
apipkg==1.5
appdirs==1.4.3
asgiref==3.2.7
asn1crypto==1.3.0
astroid==2.3.3
attrs==19.3.0
Babel==2.8.0
backcall==0.1.0
bandit==1.6.2
beautifulsoup4==4.8.2
billiard==3.6.3.0
black==19.10b0
block-disposable-email==1.0.1
boto==2.44.0
boto3==1.4.2
botocore==1.4.93
Brotli==1.0.7
cachetools==4.1.0
cairocffi==1.0.2
CairoSVG==2.4.2
celery==4.4.2
certifi==2020.4.5.1
cffi==1.14.0
chardet==3.0.4
click==7.1.1
contextlib2==0.6.0.post1
coreapi==2.3.3
coreschema==0.0.4
coverage==5.0.4
cryptography==2.6.1
cssselect2==0.3.0
cycler==0.10.0
decorator==4.4.2
defusedxml==0.6.0
distlib==0.3.0
Django==2.2
django-anymail==7.1.0
django-datatables-view==1.17.0
django-disposable-email-checker==1.2.1
django-extensions==2.2.9
django-phonenumber-field==4.0.0
django-prometheus==2.0.0
django-rest-knox==4.0.1
django-rest-logger==1.0.3
django-storages==1.5.1
django-stubs==1.5.0
django-webtest==1.9.7
djangorestframework==3.9.4
djangorestframework-stubs==1.1.0
dnspython==1.16.0
dnsq==1.1.6
docopt==0.6.2
docutils==0.16
dodgy==0.2.1
elastic-apm==5.3.2
execnet==1.7.1
expiringdict==1.2.0
facebook-sdk==3.1.0
factory-boy==2.12.0
Faker==4.0.2
filelock==3.0.12
flanker==0.9.11
flower==0.9.4
gitdb==4.0.2
GitPython==3.1.0
google-api-core==1.17.0
google-auth==1.14.1
google-cloud-core==1.3.0
google-cloud-translate==2.0.0
googleapis-common-protos==1.51.0
grpcio==1.28.1
gunicorn==20.0.4
html5lib==1.0.1
humanize==0.5.1
idna==2.9
importlib-metadata==1.6.0
importlib-resources==1.4.0
ipython==7.11.1
ipython-genutils==0.2.0
isort==4.3.21
itypes==1.2.0
jedi==0.16.0
Jinja2==2.11.2
jmespath==0.9.5
joblib==0.14.1
JSON-log-formatter==0.2.0
kiwisolver==1.2.0
kombu==4.6.8
lazy-object-proxy==1.4.3
MarkupSafe==1.1.1
matplotlib==3.2.1
mccabe==0.6.1
messagebird==1.2.0
mock==3.0.5
more-itertools==8.2.0
mypy==0.770
mypy-extensions==0.4.3
numpy==1.16.2
oauthlib==3.1.0
packaging==20.3
pandas==0.24.2
parso==0.6.2
pathspec==0.7.0
pbr==5.4.4
pep8-naming==0.4.1
pexpect==4.8.0
phonenumbers==8.12.2
pickleshare==0.7.5
Pillow==6.0.0
pipenv==2018.11.26
pluggy==0.13.1
ply==3.11
prometheus-client==0.7.1
prompt-toolkit==2.0.10
prospector==1.2.0
protobuf==3.11.3
psycopg2-binary==2.8.5
ptpython==2.0.6
ptyprocess==0.6.0
py==1.8.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycodestyle==2.4.0
pycparser==2.20
pydocstyle==5.0.2
pyflakes==2.1.1
Pygments==2.6.1
PyJWT==1.7.1
pylint==2.4.4
pylint-celery==0.3
pylint-django==2.0.12
pylint-flask==0.6
pylint-plugin-utils==0.6
pyparsing==2.4.7
Pyphen==0.9.5
pytest==5.3.4
pytest-cov==2.8.1
pytest-django==3.8.0
pytest-forked==1.1.3
pytest-pythonpath==0.7.3
pytest-xdist==1.31.0
python-dateutil==2.8.1
pytz==2019.3
pyxero==0.9.1
PyYAML==5.3.1
raven==6.10.0
regex==2020.4.4
requests==2.23.0
requests-oauthlib==1.3.0
requirements-detector==0.6
responses==0.10.9
rsa==4.0
s3transfer==0.1.13
schema==0.7.2
scikit-learn==0.21.2
scipy==1.4.1
scorecardpy==0.1.7
setoptconf==0.2.0
six==1.14.0
smmap==3.0.1
snowballstemmer==2.0.0
soupsieve==2.0
sqlparse==0.3.1
stevedore==1.32.0
text-unidecode==1.3
tinycss2==1.0.2
tld==0.12
toml==0.10.0
tornado==6.0.4
traitlets==4.3.3
typed-ast==1.4.1
typing==3.7.4.1
typing-extensions==3.7.4.2
Unidecode==1.1.1
uritemplate==3.0.1
urllib3==1.25.9
vine==1.3.0
virtualenv==20.0.18
virtualenv-clone==0.5.4
waitress==1.4.3
wcwidth==0.1.9
WeasyPrint==47
webencodings==0.5.1
WebOb==1.8.6
WebTest==2.0.34
whitenoise==5.0.1
wrapt==1.11.2
xgboost==0.90
zipp==3.1.0
zipstream==1.1.4

basepi · 2020-05-11T19:46:54Z

I wonder if #783 fixed this? Can you try upgrading to 5.6.0 and see if the crashes go away?

lfrodrigues · 2020-05-11T21:23:53Z

Sure just did and will let you know

lfrodrigues · 2020-05-12T07:05:43Z

The crash is still happing randomly. Since I enabled it last night I got around 50 crashes. I do get a few more errors since I deployed 5.6.0.

I'm getting lots of celery data in APM so I assumed everything configured correctly.
Any idea what can be happening/how can I help debug this?

Here are the multiple crashes I have in sentry:

TransportException: HTTP 403: {"error":"forbidden request: Agent remote configuration is disabled. Configure the `apm-server.kibana` section in apm-server.yml to enable it. If you are using a RUM agent, you also need to configure the `apm-server.rum` section. If you are not using remote configuration, you can safely ignore this error."}

  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 105, in send
    raise TransportException(message, data, print_trace=print_trace)

TransportException: HTTP 400: 400 Bad Request
  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 105, in send
    raise TransportException(message, data, print_trace=print_trace)

TransportException: HTTP 400: {"accepted":0,"errors":[{"message":"gzip: invalid header"}]}

  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 105, in send
    raise TransportException(message, data, print_trace=print_trace)

basepi · 2020-05-12T14:05:31Z

I'm pretty certain these errors are around remote config, not normal data sending to the APM Server. When you say it crashes, does the celery worker actually stop or restart? Or do you just see an error log? Crashing is definitely not behavior we want, I'm going to figure out how to make these errors less intrusive and more useful.

As a band-aid fix, assuming you're not actually using the remote config feature, you can disable it by setting CENTRAL_CONFIG to False. I'm betting that will stop the errors until I can get a fix in.

basepi · 2020-05-12T14:14:10Z

One other question: what are you using for your celery pool? Are you using prefork or one of the thread options?

lfrodrigues · 2020-05-12T15:35:59Z

We get a crash report in sentry but I'm not sure what is crashing. I'm adding two more errors I get maybe it helps.

We're using prefork.

ConnectionResetError: [Errno 104] Connection reset by peer
  File "urllib3/response.py", line 437, in _error_catcher
    yield
  File "urllib3/response.py", line 515, in read
    data = self._fp.read() if not fp_closed else b""
  File "http/client.py", line 459, in read
    s = self.fp.read()
  File "python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)

ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 96, in send
    body = response.read()
  File "urllib3/response.py", line 541, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)

timeout: timed out
  File "urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
    # Permission is hereby granted, free of charge, to any person obtaining a copy
  File "urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "raven/breadcrumbs.py", line 390, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "http/client.py", line 1331, in getresponse
    response.begin()
  File "http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)

ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8200): Read timed out. (read timeout=5)
  File "elasticapm/transport/http.py", line 82, in send
    "POST", url, body=data, headers=headers, timeout=self._timeout, preload_content=False
  File "urllib3/poolmanager.py", line 336, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "elasticapm/instrumentation/packages/base.py", line 205, in call_if_sampling
    return wrapped(*args, **kwargs)
  File "urllib3/connectionpool.py", line 725, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "urllib3/packages/six.py", line 735, in reraise
    raise value
  File "urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "urllib3/connectionpool.py", line 428, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "urllib3/connectionpool.py", line 336, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value

TransportException: Unable to reach APM Server: HTTPConnectionPool(host='localhost', port=8200): Read timed out. (read timeout=5) (url: http://localhost:8200/intake/v2/events)
  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 95, in send
    raise TransportException(message, data, print_trace=print_trace)

lfrodrigues · 2020-05-12T15:36:34Z

No, we're not using central_config, disabled it as sugested.

basepi · 2020-05-12T19:26:47Z

Are there any errors present in the APM Server logs? I wonder if it's overloaded or needs to be tuned? Those errors do not appear to be anything on the agent side, could be network issues or APM Server load issues.

I'm still a bit mystified about the 400 errors that you're seeing. I'm going to do some more digging.

basepi · 2020-05-12T19:58:29Z

Do you have any sort of proxy that could be modifying headers? The strangest piece of those errors is the fact that they're intermittent. We gzip-compress all data that we send to the APM server to reduce payload sizes, and I can't imagine why most data would be getting through fine but some would have invalid headers...

lfrodrigues · 2020-05-12T20:15:59Z

No, nothing in the logs. It can't be the network as in this particular server I installed the APM server locally.

No, I have this celery in the same server as the apm server and I'm using the minimal default settings in Django to get APM working :

ELASTIC_APM = {'SERVICE_NAME': 'MyService', 'CENTRAL_CONFIG': False}

basepi · 2020-05-13T19:42:04Z

On the timeouts, it may be that your apm server is a touch overloaded and thus responding slowly. You can modify those timeouts with SERVER_TIMEOUT to raise that timeout period from the 5 second default.

I'm still trying to track down the source of the gzip errors. One of the APM Server devs pointed out that golang will throw that error in two situations:

The error is not related to any HTTP headers, it's caused by malformed the "magic number" sequence: https://github.com/golang/go/blob/cb11c981df7b4dc40550ab71cc097c25d24d7a71/src/compress/gzip/gunzip.go#L186

Or... missing null byte in the Name or Comment in the header: https://github.com/golang/go/blob/cb11c981df7b4dc40550ab71cc097c25d24d7a71/src/compress/gzip/gunzip.go#L146

I haven't been able to figure out how we would botch either of those situations, especially inconsistently like you're seeing. It may be a bug in python's gzip library... anyway, I'm still digging.

lfrodrigues · 2020-05-14T08:58:27Z

This specific server is massively over-provisioned with a load average consistently below the available cores, so I think it's strange it is a server timeone. I increased to 10s and will let you know if something changes.

lfrodrigues · 2020-05-14T14:22:04Z

Maybe all these issues are related to celery workers exiting when apm-agent wasn't expecting it?

Here's another crash from today.

SystemExit: 155
  File "billiard/pool.py", line 292, in __call__
    sys.exit(self.workloop(pid=pid))
  File "billiard/pool.py", line 283, in exit
    return _exit(status)

ValueError: close timed out
  File "celery/utils/dispatch/signal.py", line 288, in send
    response = receiver(signal=self, sender=sender, **named)
  File "elasticapm/contrib/celery/__init__.py", line 80, in worker_shutdown
    client.close()
  File "elasticapm/base.py", line 296, in close
    manager.stop_thread()
  File "elasticapm/transport/base.py", line 260, in close
    raise ValueError("close timed out")

basepi · 2020-05-14T18:56:49Z

That could very well be the case. I could see that causing most of the issues that you're seeing -- in fact, if the celery worker shuts down while a message is being written to the gzip buffer in memory, it's possible that could be causing the corruption that results in the original error you reported.

Perhaps it would be worth turning off worker_max_tasks_per_child temporarily to see if that resolves the errors? Then we could investigate ways to make them shut down more gracefully if that is indeed the issue.

lfrodrigues · 2020-05-14T21:06:20Z

Disabled, I'll let you tomorrow.

lfrodrigues · 2020-05-17T14:14:40Z

It looks like the worker theory is correct.

I'm left with one crash in Sentry that happens between 1 am and 2 am that I hadn't even noticed before related to the size of the event.

We're inserting around 700k new rows into a table between 1 am and 2 am. Is there a smart way to prevent sending this huge SQL to APM Server?

dropping flushed data due to transport failure back-off

TransportException: HTTP 400: {"accepted":0,"errors":[{"message":"event exceeded the permitted size.","document":"{\"span\": {\"id\": \"01f00bedcb0e54eb\", \"transaction_id\": \"f1559737ffb09161\", \"trace_id\": \"f0ed1884c22422ee3af54fb9d9761d24\", \"parent_id\": \"f1559737ffb09161\", \"INFO ON THE SQL WITH LOTS OF STUFF...
  File "elasticapm/transport/base.py", line 227, in _flush
    self.send(data)
  File "elasticapm/transport/http.py", line 105, in send
    raise TransportException(message, data, print_trace=print_trace)

basepi · 2020-05-18T16:36:04Z

Yes we should definitely truncate that. Can you create a separate issue for that SQL truncation? I want to keep this issue focused on making the transport shut down more gracefully.

lfrodrigues · 2020-05-18T19:11:58Z

Opened #827

Previously these would show up as big tracebacks in the logs which was ugly, considering all sorts of things can go wrong when suddenly closing the transport (such as when a celery worker terminates) Fixes elastic#823

basepi · 2020-05-27T22:07:26Z

Opened #838 to address these errors and make them more descriptive.

* Catch and log any exceptions while closing the transport Previously these would show up as big tracebacks in the logs which was ugly, considering all sorts of things can go wrong when suddenly closing the transport (such as when a celery worker terminates) Fixes #823 * Remove trailing comma I don't know how this happened, I think black added it? But then it didn't re-add it this time when I removed it. Probably more of black's magic comma stuff that has been added recently * Add to changelog

* Catch and log any exceptions while closing the transport Previously these would show up as big tracebacks in the logs which was ugly, considering all sorts of things can go wrong when suddenly closing the transport (such as when a celery worker terminates) Fixes elastic#823 * Remove trailing comma I don't know how this happened, I think black added it? But then it didn't re-add it this time when I removed it. Probably more of black's magic comma stuff that has been added recently * Add to changelog

basepi mentioned this issue May 27, 2020

Catch and log any exceptions while closing the transport #838

Merged

basepi closed this as completed in #838 Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APM agent crashes in Celery at random times during the day #823

APM agent crashes in Celery at random times during the day #823

lfrodrigues commented May 8, 2020

basepi commented May 8, 2020

lfrodrigues commented May 9, 2020

basepi commented May 11, 2020

lfrodrigues commented May 11, 2020

lfrodrigues commented May 12, 2020 •

edited

basepi commented May 12, 2020

basepi commented May 12, 2020

lfrodrigues commented May 12, 2020

lfrodrigues commented May 12, 2020

basepi commented May 12, 2020

basepi commented May 12, 2020

lfrodrigues commented May 12, 2020

basepi commented May 13, 2020

lfrodrigues commented May 14, 2020

lfrodrigues commented May 14, 2020

basepi commented May 14, 2020

lfrodrigues commented May 14, 2020

lfrodrigues commented May 17, 2020

basepi commented May 18, 2020

lfrodrigues commented May 18, 2020

basepi commented May 27, 2020

APM agent crashes in Celery at random times during the day #823

APM agent crashes in Celery at random times during the day #823

Comments

lfrodrigues commented May 8, 2020

basepi commented May 8, 2020

lfrodrigues commented May 9, 2020

basepi commented May 11, 2020

lfrodrigues commented May 11, 2020

lfrodrigues commented May 12, 2020 • edited

basepi commented May 12, 2020

basepi commented May 12, 2020

lfrodrigues commented May 12, 2020

lfrodrigues commented May 12, 2020

basepi commented May 12, 2020

basepi commented May 12, 2020

lfrodrigues commented May 12, 2020

basepi commented May 13, 2020

lfrodrigues commented May 14, 2020

lfrodrigues commented May 14, 2020

basepi commented May 14, 2020

lfrodrigues commented May 14, 2020

lfrodrigues commented May 17, 2020

basepi commented May 18, 2020

lfrodrigues commented May 18, 2020

basepi commented May 27, 2020

lfrodrigues commented May 12, 2020 •

edited