-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Geoprocessing Request Timeout #2751
Conversation
@@ -232,10 +232,13 @@ def geoprocess(endpoint, data, retry=None): | |||
try: | |||
response = requests.post(geop_url, | |||
data=json.dumps(data), | |||
headers={'Content-Type': 'application/json'}) | |||
headers={'Content-Type': 'application/json'}, | |||
timeout=200) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to find where we set the Celery hard timeout to 300, but couldn't find it at a quick glance. Is it possible to set this value as a function of that setting since the relationship is important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we don't set it currently; it has a default of 300
. I will set it explicitly, and set a request timeout setting based on it.
except ConnectionError as exc: | ||
if retry is not None: | ||
retry(exc=exc) | ||
except Timeout as exc: | ||
raise Exception('Geoprocessing service timed out: {}'.format(exc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to report the actual exception contents in the job table but restrict what gets sent to the client? The timeout exception reports internal hosts and ports which while not crucial to protect does feel like an unnecessary leak. It's likely that the qualitative description you provide is more meaningful than the Timeout
exception contents in the case where we have to choose between leaking and not capturing that exception detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into splitting our job error and what gets sent back to the client, though I'm not sure we're set up for that. We could also write details we don't want to send to the client to the celery log instead. I agree they're not even that helpful, so I may just remove the printing of the exception entirely.
300add8
to
6af8e2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good. I played around with the REQUEST timeout and it was able to recover from Geop timeout errors and continue receiving tasks.
I did notice, however, that if you set CELERY_TASK_TIME_LIMIT
to a low value to force a timeout, the process throws in our save_error method and at that point does not continue to receive new jobs
File "/opt/app/apps/core/tasks.py", line 42, in save_job_error
.format(e.errno, e.strerror))
AttributeError: 'IntegrityError' object has no attribute errno'
If I remove the failing line containing errno
it seems to pick up again (when not using debugcelery
). It may be sufficient to just fix the save_job_error
method for the timeout error, but we may also want to think about our 5 minute timeout value and 3.5 minute geop timeout value. Would it be beneficial to reduce these?
src/mmw/mmw/settings/base.py
Outdated
@@ -138,6 +138,9 @@ def get_env_setting(setting): | |||
} | |||
CELERY_TASK_DEFAULT_EXCHANGE = 'tasks' | |||
CELERY_TASK_DEFAULT_ROUTING_KEY = "task.%s" % STACK_COLOR | |||
|
|||
CELERY_TASK_TIME_LIMIT = 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like in v4, which we're on now, has no default set for time limit:
http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-time-limit
So setting this alone may have an impact, although it looks like the prefix should be CELERYD according to
http://docs.celeryproject.org/en/latest/userguide/configuration.html#new-lowercase-settings
The setting you have appears to have an effect when tweaked so it's likely backwards-compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the D
at the end to be safe.
That's weird there's no set default. The error we were seeing on staging spoke of a HardTimeLimit of 300sec being exceeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't appear to work for me if I add the D
...were you able to set it that way? If not, I'm going to open an issue in celery. Could be a mistake in the docs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, but definitely works the way you have it set now, so seems good to leave.
Nice, I didn't notice that Re: how long the timeouts should be. The geoprocessing service has its own time limit of 59sec. It seems like we could be safe with a little more than that, maybe a task timelimit of 100sec, request time limit of 90? I don't think there's anything that takes longer than geoprocessing. |
re: I don't think, however, the hanging always has to do with a failure in
If we're interested in addressing the diff --git a/src/mmw/apps/core/tasks.py b/src/mmw/apps/core/tasks.py
index 09c90e9..282692a 100644
--- a/src/mmw/apps/core/tasks.py
+++ b/src/mmw/apps/core/tasks.py
@@ -4,6 +4,7 @@ from __future__ import unicode_literals
from __future__ import absolute_import
from django.utils.timezone import now
+from django.db import IntegrityError
from celery import shared_task
from apps.core.models import Job
@@ -31,10 +32,14 @@ def save_job_error(request, exc, traceback, job_id):
try:
job = Job.objects.get(id=job_id)
job.error = exc
- job.traceback = traceback
+ job.traceback = traceback or ''
job.delivered_at = now()
job.status = 'failed'
job.save()
+ except IntegrityError as e:
+ logger.error('Failed to save job error status. Job will appear hung. \
+ Job Id: {0}'.format(job.id))
+ logger.error('IntegrityError: {}'.format(e))
except Exception as e:
logger.error('Failed to save job error status. Job will appear hung. \
Job Id: {0}'.format(job.id))
What the integrity error ends up looking like:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new timeouts are good.
If a task takes longer than the Celery HARD_TASK_TIME_LIMIT default of 300sec, celery enters a non-responsive state. Set a timeout on all http requests we make in tasks so we don't enter this state while waiting for a response. This will also better alert us when there are "silent" errors with the geoprocessing service, ie when it is consistently timing out without other errors.
25599ab
to
000e5e9
Compare
Thanks for all your help with this. Going to merge. |
Overview
We've witnessed a state twice on production where the geoprocessing service times out on every response and must be stopped and started to get back into a good state. Because there was no timeout on the geoprocessing task requests, the tasks would take longer than the
celery
hard task timeout limit (300s). Exceeding this timeout would put the celery service in its own bad state, and it, too would need a restart (perhaps related celery/celery#2451).While we are determining the underlying cause of the geoprocessing failures, add a timeout on the geoprocessing request that is less than the celery hard task timeout (request timeouts are probably best practice anyway. There is even a warning specifically about not putting a timeout on I/O ops in a task in the celery docs). This way a rollbar error will notify us of a bad state instead of an email from our client saying all analyze and model runs timeout.
Connects #2738
Demo
Testing Instructions
./scripts/debugcelery.sh