Add Geoprocessing Request Timeout #2751

arottersman · 2018-04-02T17:08:25Z

Overview

We've witnessed a state twice on production where the geoprocessing service times out on every response and must be stopped and started to get back into a good state. Because there was no timeout on the geoprocessing task requests, the tasks would take longer than the celery hard task timeout limit (300s). Exceeding this timeout would put the celery service in its own bad state, and it, too would need a restart (perhaps related celery/celery#2451).

While we are determining the underlying cause of the geoprocessing failures, add a timeout on the geoprocessing request that is less than the celery hard task timeout (request timeouts are probably best practice anyway. There is even a warning specifically about not putting a timeout on I/O ops in a task in the celery docs). This way a rollbar error will notify us of a bad state instead of an email from our client saying all analyze and model runs timeout.

Connects #2738

Demo

Testing Instructions

Pull this branch, and ./scripts/debugcelery.sh
I've add a commit that drops the timeout down to a second. I'll drop this commit before merging so the timeout is 200 seconds. Draw a big shape so the operation isn't cached and it takes the geoprocessing service longer than a second to chew through an analyze task. Confirm you see an error caught in the celery logs

mmcfarland · 2018-04-02T17:12:19Z

src/mmw/apps/modeling/geoprocessing.py

@@ -232,10 +232,13 @@ def geoprocess(endpoint, data, retry=None):
    try:
        response = requests.post(geop_url,
                                 data=json.dumps(data),
-                                 headers={'Content-Type': 'application/json'})
+                                 headers={'Content-Type': 'application/json'},
+                                 timeout=200)


I tried to find where we set the Celery hard timeout to 300, but couldn't find it at a quick glance. Is it possible to set this value as a function of that setting since the relationship is important?

Yes, we don't set it currently; it has a default of 300. I will set it explicitly, and set a request timeout setting based on it.

mmcfarland · 2018-04-02T17:15:53Z

src/mmw/apps/modeling/geoprocessing.py

    except ConnectionError as exc:
        if retry is not None:
            retry(exc=exc)
+    except Timeout as exc:
+        raise Exception('Geoprocessing service timed out: {}'.format(exc))


Is there a way to report the actual exception contents in the job table but restrict what gets sent to the client? The timeout exception reports internal hosts and ports which while not crucial to protect does feel like an unnecessary leak. It's likely that the qualitative description you provide is more meaningful than the Timeout exception contents in the case where we have to choose between leaking and not capturing that exception detail.

I'll look into splitting our job error and what gets sent back to the client, though I'm not sure we're set up for that. We could also write details we don't want to send to the client to the celery log instead. I agree they're not even that helpful, so I may just remove the printing of the exception entirely.

mmcfarland

Changes look good. I played around with the REQUEST timeout and it was able to recover from Geop timeout errors and continue receiving tasks.

I did notice, however, that if you set CELERY_TASK_TIME_LIMIT to a low value to force a timeout, the process throws in our save_error method and at that point does not continue to receive new jobs

  File "/opt/app/apps/core/tasks.py", line 42, in save_job_error
    .format(e.errno, e.strerror))
AttributeError: 'IntegrityError' object has no attribute errno'

If I remove the failing line containing errno it seems to pick up again (when not using debugcelery). It may be sufficient to just fix the save_job_error method for the timeout error, but we may also want to think about our 5 minute timeout value and 3.5 minute geop timeout value. Would it be beneficial to reduce these?

mmcfarland · 2018-04-02T20:17:24Z

src/mmw/mmw/settings/base.py

@@ -138,6 +138,9 @@ def get_env_setting(setting):
 }
 CELERY_TASK_DEFAULT_EXCHANGE = 'tasks'
 CELERY_TASK_DEFAULT_ROUTING_KEY = "task.%s" % STACK_COLOR
+
+CELERY_TASK_TIME_LIMIT = 300


It looks like in v4, which we're on now, has no default set for time limit:
http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-time-limit

So setting this alone may have an impact, although it looks like the prefix should be CELERYD according to
http://docs.celeryproject.org/en/latest/userguide/configuration.html#new-lowercase-settings

The setting you have appears to have an effect when tweaked so it's likely backwards-compatible.

I'll add the D at the end to be safe.

That's weird there's no set default. The error we were seeing on staging spoke of a HardTimeLimit of 300sec being exceeded.

It doesn't appear to work for me if I add the D...were you able to set it that way? If not, I'm going to open an issue in celery. Could be a mistake in the docs...

Nope, but definitely works the way you have it set now, so seems good to leave.

arottersman · 2018-04-02T20:51:01Z

Nice, I didn't notice that save_job_error error when I was initially playing with the hard task time limit. I think we should add a commit to fix that issue, but also keep the request time outs.

Re: how long the timeouts should be. The geoprocessing service has its own time limit of 59sec. It seems like we could be safe with a little more than that, maybe a task timelimit of 100sec, request time limit of 90? I don't think there's anything that takes longer than geoprocessing.

arottersman · 2018-04-03T15:56:07Z

re: save_job_error error. When we exceed the hard time limit, we are trying to save a Job that doesn't have a traceback, a column for which there is a NOT NULL constraint. This throws an IntegrityError, which doesn't have an errno attribute, which then crashes our catch block.

I don't think, however, the hanging always has to do with a failure in save_job_error. Sometimes it seems the job never even makes it to save_job_error at all. After this message my worker entered its horribly stuck state:

[2018-04-03 11:40:25,852: INFO/MainProcess] Received task: apps.geoprocessing_api.tasks.analyze_pointsource[c818276b-08a1-4c8f-8db1-9390c9f7a01d]
[2018-04-03 11:40:26,095: ERROR/MainProcess] Task fbaeedf3-ec08-4f4d-9036-25dbfb4d2730 run from job 4875 raised exception: TimeLimitExceeded(1,)
None
[2018-04-03 11:40:26,184: ERROR/MainProcess] Task handler raised error: TimeLimitExceeded(1,)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/billiard/pool.py", line 658, in on_hard_timeout
    raise TimeLimitExceeded(job._timeout)
TimeLimitExceeded: TimeLimitExceeded(1,)
[2018-04-03 11:40:26,186: ERROR/MainProcess] Hard time limit (1s) exceeded for apps.modeling.geoprocessing.run[fbaeedf3-ec08-4f4d-9036-25dbfb4d2730]
[2018-04-03 11:40:26,221: ERROR/MainProcess] Task fbaeedf3-ec08-4f4d-9036-25dbfb4d2730 run from job 4875 raised exception: TimeLimitExceeded(1,)
None
[2018-04-03 11:40:27,366: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:10117 exited with 'signal 9 (SIGKILL)'
[2018-04-03 11:40:27,488: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x7f3277201c80> (args:('apps.modeling.geoprocessing.run', '6a4ddfe8-55bf-4415-9682-7c34240a4ce1', {'origin': 'gen3201@app', 'lang': 'py', 'task': 'apps.modeling.geoprocessing.run', 'group': None, 'root_id': '6a4ddfe8-55bf-4415-9682-7c34240a4ce1', u'delivery_info': {u'priority': 0, u'redelivered': None, u'routing_key': 'Black', u'exchange': u''}, 'expires': None, u'correlation_id': '6a4ddfe8-55bf-4415-9682-7c34240a4ce1', 'retries': 0, 'timelimit': [None, None], 'argsrepr': '(\'soil\', {\'polygon\': [\'{ "type": "MultiPolygon", "coordinates": [ [ [ [ -82.488346990747843, 38.472211785968319 ], [ -82.476862970189657, 38.472211785968319 ], [ -82.476862970189657, 38.481202179740976 ], [ -82.488346990747843, 38.481202179740976 ], [ -82.488346990747843, 38.472211785968319 ] ] ] ] }\']}, None)', 'eta': None, 'parent_id': None, u'reply_to': '47c326ee-66c8-3795-b218-6012e9a1097f', 'id': '6a4ddfe8-55bf-4415-9682-7c34240a4ce1', 'kwargsrepr': '{}'}, '[["soil", {"polygon": ["{ \\"type\\": \\"MultiPolygon\\", \\"coordinates\\": [ [ [ [... kwargs:{})

If we're interested in addressing the save_job_error problems, I was thinking of something like the diff below. I don't think it is the underlying problem though; I think we need to avoid hitting the hard timeout in the first place. I am inclined to fix up this PR to have lower request and hard time outs, and call it a day.

diff --git a/src/mmw/apps/core/tasks.py b/src/mmw/apps/core/tasks.py
index 09c90e9..282692a 100644
--- a/src/mmw/apps/core/tasks.py
+++ b/src/mmw/apps/core/tasks.py
@@ -4,6 +4,7 @@ from __future__ import unicode_literals
 from __future__ import absolute_import

 from django.utils.timezone import now
+from django.db import IntegrityError
 from celery import shared_task
 from apps.core.models import Job

@@ -31,10 +32,14 @@ def save_job_error(request, exc, traceback, job_id):
     try:
         job = Job.objects.get(id=job_id)
         job.error = exc
-        job.traceback = traceback
+        job.traceback = traceback or ''
         job.delivered_at = now()
         job.status = 'failed'
         job.save()
+    except IntegrityError as e:
+        logger.error('Failed to save job error status. Job will appear hung. \
+                     Job Id: {0}'.format(job.id))
+        logger.error('IntegrityError: {}'.format(e))
     except Exception as e:
         logger.error('Failed to save job error status. Job will appear hung. \
                      Job Id: {0}'.format(job.id))

What the integrity error ends up looking like:

[2018-04-03 11:00:02,366: ERROR/MainProcess] Task ca7b5415-6914-4f21-97c5-3a390e5339cf run from job 4858 raised exception: TimeLimitExceeded(1,)
None
[2018-04-03 11:00:02,516: ERROR/MainProcess] Failed to save job error status. Job will appear hung.                      Job Id: 4858
[2018-04-03 11:00:02,518: ERROR/MainProcess] IntegrityError: null value in column "traceback" violates not-null constraint
DETAIL:  Failing row contains (4858, 88111856-0df4-4397-a790-6a82cc4bc157, { "type": "MultiPolygon", "coordinates": [ [ [ [ -82.26219177246..., 2018-04-03 15:00:00.679+00, , 2018-04-03 15:00:02.507433+00, TimeLimitExceeded(1,), null, failed, 1).

mmcfarland

I think the new timeouts are good.

If a task takes longer than the Celery HARD_TASK_TIME_LIMIT default of 300sec, celery enters a non-responsive state. Set a timeout on all http requests we make in tasks so we don't enter this state while waiting for a response. This will also better alert us when there are "silent" errors with the geoprocessing service, ie when it is consistently timing out without other errors.

arottersman · 2018-04-03T21:27:10Z

Thanks for all your help with this. Going to merge.

arottersman added the in progress label Apr 2, 2018

mmcfarland reviewed Apr 2, 2018

View reviewed changes

arottersman force-pushed the arr/celery-timeout-experiment branch 2 times, most recently from 300add8 to 6af8e2f Compare April 2, 2018 18:29

arottersman requested a review from mmcfarland April 2, 2018 18:43

arottersman assigned mmcfarland Apr 2, 2018

arottersman added in review and removed in progress labels Apr 2, 2018

mmcfarland reviewed Apr 2, 2018

View reviewed changes

mmcfarland approved these changes Apr 3, 2018

View reviewed changes

arottersman force-pushed the arr/celery-timeout-experiment branch from 25599ab to 000e5e9 Compare April 3, 2018 20:47

arottersman merged commit 46b5247 into develop Apr 3, 2018

hectcastro removed the in review label Apr 3, 2018

arottersman deleted the arr/celery-timeout-experiment branch April 3, 2018 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Geoprocessing Request Timeout #2751

Add Geoprocessing Request Timeout #2751

arottersman commented Apr 2, 2018 •

edited

Loading

mmcfarland Apr 2, 2018

arottersman Apr 2, 2018

mmcfarland Apr 2, 2018

arottersman Apr 2, 2018

mmcfarland left a comment

mmcfarland Apr 2, 2018

arottersman Apr 2, 2018

arottersman Apr 2, 2018

mmcfarland Apr 2, 2018

arottersman commented Apr 2, 2018

arottersman commented Apr 3, 2018

mmcfarland left a comment

arottersman commented Apr 3, 2018

Add Geoprocessing Request Timeout #2751

Add Geoprocessing Request Timeout #2751

Conversation

arottersman commented Apr 2, 2018 • edited Loading

Overview

Demo

Testing Instructions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmcfarland left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arottersman commented Apr 2, 2018

arottersman commented Apr 3, 2018

mmcfarland left a comment

Choose a reason for hiding this comment

arottersman commented Apr 3, 2018

arottersman commented Apr 2, 2018 •

edited

Loading