Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Geoprocessing Request Timeout #2751

Merged
merged 1 commit into from
Apr 3, 2018

Conversation

arottersman
Copy link

@arottersman arottersman commented Apr 2, 2018

Overview

We've witnessed a state twice on production where the geoprocessing service times out on every response and must be stopped and started to get back into a good state. Because there was no timeout on the geoprocessing task requests, the tasks would take longer than the celery hard task timeout limit (300s). Exceeding this timeout would put the celery service in its own bad state, and it, too would need a restart (perhaps related celery/celery#2451).

While we are determining the underlying cause of the geoprocessing failures, add a timeout on the geoprocessing request that is less than the celery hard task timeout (request timeouts are probably best practice anyway. There is even a warning specifically about not putting a timeout on I/O ops in a task in the celery docs). This way a rollbar error will notify us of a bad state instead of an email from our client saying all analyze and model runs timeout.

Connects #2738

Demo

screen shot 2018-04-02 at 12 40 54 pm

screen shot 2018-04-02 at 12 40 34 pm

Testing Instructions

  • Pull this branch, and ./scripts/debugcelery.sh
  • I've add a commit that drops the timeout down to a second. I'll drop this commit before merging so the timeout is 200 seconds. Draw a big shape so the operation isn't cached and it takes the geoprocessing service longer than a second to chew through an analyze task. Confirm you see an error caught in the celery logs

@@ -232,10 +232,13 @@ def geoprocess(endpoint, data, retry=None):
try:
response = requests.post(geop_url,
data=json.dumps(data),
headers={'Content-Type': 'application/json'})
headers={'Content-Type': 'application/json'},
timeout=200)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to find where we set the Celery hard timeout to 300, but couldn't find it at a quick glance. Is it possible to set this value as a function of that setting since the relationship is important?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we don't set it currently; it has a default of 300. I will set it explicitly, and set a request timeout setting based on it.

except ConnectionError as exc:
if retry is not None:
retry(exc=exc)
except Timeout as exc:
raise Exception('Geoprocessing service timed out: {}'.format(exc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to report the actual exception contents in the job table but restrict what gets sent to the client? The timeout exception reports internal hosts and ports which while not crucial to protect does feel like an unnecessary leak. It's likely that the qualitative description you provide is more meaningful than the Timeout exception contents in the case where we have to choose between leaking and not capturing that exception detail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into splitting our job error and what gets sent back to the client, though I'm not sure we're set up for that. We could also write details we don't want to send to the client to the celery log instead. I agree they're not even that helpful, so I may just remove the printing of the exception entirely.

@arottersman arottersman force-pushed the arr/celery-timeout-experiment branch 2 times, most recently from 300add8 to 6af8e2f Compare April 2, 2018 18:29
Copy link
Contributor

@mmcfarland mmcfarland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. I played around with the REQUEST timeout and it was able to recover from Geop timeout errors and continue receiving tasks.

I did notice, however, that if you set CELERY_TASK_TIME_LIMIT to a low value to force a timeout, the process throws in our save_error method and at that point does not continue to receive new jobs

  File "/opt/app/apps/core/tasks.py", line 42, in save_job_error
    .format(e.errno, e.strerror))
AttributeError: 'IntegrityError' object has no attribute errno'

If I remove the failing line containing errno it seems to pick up again (when not using debugcelery). It may be sufficient to just fix the save_job_error method for the timeout error, but we may also want to think about our 5 minute timeout value and 3.5 minute geop timeout value. Would it be beneficial to reduce these?

@@ -138,6 +138,9 @@ def get_env_setting(setting):
}
CELERY_TASK_DEFAULT_EXCHANGE = 'tasks'
CELERY_TASK_DEFAULT_ROUTING_KEY = "task.%s" % STACK_COLOR

CELERY_TASK_TIME_LIMIT = 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like in v4, which we're on now, has no default set for time limit:
http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-time-limit

So setting this alone may have an impact, although it looks like the prefix should be CELERYD according to
http://docs.celeryproject.org/en/latest/userguide/configuration.html#new-lowercase-settings

The setting you have appears to have an effect when tweaked so it's likely backwards-compatible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the D at the end to be safe.

That's weird there's no set default. The error we were seeing on staging spoke of a HardTimeLimit of 300sec being exceeded.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't appear to work for me if I add the D...were you able to set it that way? If not, I'm going to open an issue in celery. Could be a mistake in the docs...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, but definitely works the way you have it set now, so seems good to leave.

@arottersman
Copy link
Author

Nice, I didn't notice that save_job_error error when I was initially playing with the hard task time limit. I think we should add a commit to fix that issue, but also keep the request time outs.

Re: how long the timeouts should be. The geoprocessing service has its own time limit of 59sec. It seems like we could be safe with a little more than that, maybe a task timelimit of 100sec, request time limit of 90? I don't think there's anything that takes longer than geoprocessing.

@arottersman
Copy link
Author

re: save_job_error error. When we exceed the hard time limit, we are trying to save a Job that doesn't have a traceback, a column for which there is a NOT NULL constraint. This throws an IntegrityError, which doesn't have an errno attribute, which then crashes our catch block.

I don't think, however, the hanging always has to do with a failure in save_job_error. Sometimes it seems the job never even makes it to save_job_error at all. After this message my worker entered its horribly stuck state:

[2018-04-03 11:40:25,852: INFO/MainProcess] Received task: apps.geoprocessing_api.tasks.analyze_pointsource[c818276b-08a1-4c8f-8db1-9390c9f7a01d]
[2018-04-03 11:40:26,095: ERROR/MainProcess] Task fbaeedf3-ec08-4f4d-9036-25dbfb4d2730 run from job 4875 raised exception: TimeLimitExceeded(1,)
None
[2018-04-03 11:40:26,184: ERROR/MainProcess] Task handler raised error: TimeLimitExceeded(1,)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/billiard/pool.py", line 658, in on_hard_timeout
    raise TimeLimitExceeded(job._timeout)
TimeLimitExceeded: TimeLimitExceeded(1,)
[2018-04-03 11:40:26,186: ERROR/MainProcess] Hard time limit (1s) exceeded for apps.modeling.geoprocessing.run[fbaeedf3-ec08-4f4d-9036-25dbfb4d2730]
[2018-04-03 11:40:26,221: ERROR/MainProcess] Task fbaeedf3-ec08-4f4d-9036-25dbfb4d2730 run from job 4875 raised exception: TimeLimitExceeded(1,)
None
[2018-04-03 11:40:27,366: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:10117 exited with 'signal 9 (SIGKILL)'
[2018-04-03 11:40:27,488: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x7f3277201c80> (args:('apps.modeling.geoprocessing.run', '6a4ddfe8-55bf-4415-9682-7c34240a4ce1', {'origin': 'gen3201@app', 'lang': 'py', 'task': 'apps.modeling.geoprocessing.run', 'group': None, 'root_id': '6a4ddfe8-55bf-4415-9682-7c34240a4ce1', u'delivery_info': {u'priority': 0, u'redelivered': None, u'routing_key': 'Black', u'exchange': u''}, 'expires': None, u'correlation_id': '6a4ddfe8-55bf-4415-9682-7c34240a4ce1', 'retries': 0, 'timelimit': [None, None], 'argsrepr': '(\'soil\', {\'polygon\': [\'{ "type": "MultiPolygon", "coordinates": [ [ [ [ -82.488346990747843, 38.472211785968319 ], [ -82.476862970189657, 38.472211785968319 ], [ -82.476862970189657, 38.481202179740976 ], [ -82.488346990747843, 38.481202179740976 ], [ -82.488346990747843, 38.472211785968319 ] ] ] ] }\']}, None)', 'eta': None, 'parent_id': None, u'reply_to': '47c326ee-66c8-3795-b218-6012e9a1097f', 'id': '6a4ddfe8-55bf-4415-9682-7c34240a4ce1', 'kwargsrepr': '{}'}, '[["soil", {"polygon": ["{ \\"type\\": \\"MultiPolygon\\", \\"coordinates\\": [ [ [ [... kwargs:{})

If we're interested in addressing the save_job_error problems, I was thinking of something like the diff below. I don't think it is the underlying problem though; I think we need to avoid hitting the hard timeout in the first place. I am inclined to fix up this PR to have lower request and hard time outs, and call it a day.

diff --git a/src/mmw/apps/core/tasks.py b/src/mmw/apps/core/tasks.py
index 09c90e9..282692a 100644
--- a/src/mmw/apps/core/tasks.py
+++ b/src/mmw/apps/core/tasks.py
@@ -4,6 +4,7 @@ from __future__ import unicode_literals
 from __future__ import absolute_import

 from django.utils.timezone import now
+from django.db import IntegrityError
 from celery import shared_task
 from apps.core.models import Job

@@ -31,10 +32,14 @@ def save_job_error(request, exc, traceback, job_id):
     try:
         job = Job.objects.get(id=job_id)
         job.error = exc
-        job.traceback = traceback
+        job.traceback = traceback or ''
         job.delivered_at = now()
         job.status = 'failed'
         job.save()
+    except IntegrityError as e:
+        logger.error('Failed to save job error status. Job will appear hung. \
+                     Job Id: {0}'.format(job.id))
+        logger.error('IntegrityError: {}'.format(e))
     except Exception as e:
         logger.error('Failed to save job error status. Job will appear hung. \
                      Job Id: {0}'.format(job.id))

What the integrity error ends up looking like:

[2018-04-03 11:00:02,366: ERROR/MainProcess] Task ca7b5415-6914-4f21-97c5-3a390e5339cf run from job 4858 raised exception: TimeLimitExceeded(1,)
None
[2018-04-03 11:00:02,516: ERROR/MainProcess] Failed to save job error status. Job will appear hung.                      Job Id: 4858
[2018-04-03 11:00:02,518: ERROR/MainProcess] IntegrityError: null value in column "traceback" violates not-null constraint
DETAIL:  Failing row contains (4858, 88111856-0df4-4397-a790-6a82cc4bc157, { "type": "MultiPolygon", "coordinates": [ [ [ [ -82.26219177246..., 2018-04-03 15:00:00.679+00, , 2018-04-03 15:00:02.507433+00, TimeLimitExceeded(1,), null, failed, 1).

Copy link
Contributor

@mmcfarland mmcfarland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new timeouts are good.

If a task takes longer than the Celery HARD_TASK_TIME_LIMIT
default of 300sec, celery enters a non-responsive state. Set a timeout
on all http requests we make in tasks so we don't enter this
state while waiting for a response. This will also better alert
us when there are "silent" errors with the geoprocessing service,
ie when it is consistently timing out without other errors.
@arottersman arottersman force-pushed the arr/celery-timeout-experiment branch from 25599ab to 000e5e9 Compare April 3, 2018 20:47
@arottersman
Copy link
Author

Thanks for all your help with this. Going to merge.

@arottersman arottersman merged commit 46b5247 into develop Apr 3, 2018
@arottersman arottersman deleted the arr/celery-timeout-experiment branch April 3, 2018 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants