Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BROKER_CONNECTION_TIMEOUT not working. #4627

Open
Raghavsalotra opened this issue Mar 29, 2018 · 10 comments
Open

BROKER_CONNECTION_TIMEOUT not working. #4627

Raghavsalotra opened this issue Mar 29, 2018 · 10 comments

Comments

@Raghavsalotra
Copy link

@Raghavsalotra Raghavsalotra commented Mar 29, 2018

Checklist

  • I am using celery version 4.1.0
  • I have verified that the issue exists against the master branch of Celery.

Steps to reproduce

Make rabbitmq borker unreachable , while using Task.delay() this call should for seconds mentioned in BROKER_CONNECTION_TIMEOUT but it is getting stuck for indefinite seconds.

Expected behavior

.delay() call should give timeout exception if it is not able to connect to broker.

Actual behavior

It is stuck in waiting for indefinite time or it could be the default seconds mentioned in kombu which celery is using under the hood to connect to Rabbitmq broker.

@clokep

This comment has been minimized.

Copy link
Contributor

@clokep clokep commented Mar 29, 2018

I've never been able to get a real answer on whether this is the expected behavior or not (see #4328 (comment)), but I did document that this happens in #4606.

We were able to fix this by monkey patching the kombu Connection with the changes from celery/kombu@0f4da8d and then setting BROKER_TRANSPORT_OPTIONS to have fewer retries (although the number of retries you want is fully dependent on your application!)

@frankwiles

This comment has been minimized.

Copy link

@frankwiles frankwiles commented Mar 30, 2018

I'm seeing this as well.

@russoz

This comment has been minimized.

Copy link

@russoz russoz commented Apr 16, 2018

From http://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-broker_connection_timeout

broker_connection_timeout
Default: 4.0.

The default timeout in seconds before we give up establishing a connection to the AMQP server. This setting is disabled when using gevent.

broker_connection_retry
Default: Enabled.

Automatically try to re-establish the connection to the AMQP broker if lost.

The time between retries is increased for each retry, and is not exhausted before broker_connection_max_retries is exceeded.

broker_connection_max_retries
Default: 100.

Maximum number of retries before we give up re-establishing a connection to the AMQP broker.

If this is set to 0 or None, we’ll retry forever.

Sorry if my question is too obvious, but have you guys tried to disable broker_connection_retry?

@drmclean

This comment has been minimized.

Copy link

@drmclean drmclean commented Jun 13, 2018

I have also had this exact issue with celery and RabbitMQ, namely:

.apply_async() hangs forver when the broker (RabbitMQ) is down.

The indefinite hang is due to the retry_over_time() function in the Kombu library.
(Kombu.utils.functional.py:retry_over_time())

retry_over_time() takes a number of arguments:

def retry_over_time(fun, catch, args=[], kwargs={}, errback=None,
                    max_retries=None, interval_start=2, interval_step=2,
                    interval_max=30, callback=None):

max_retries defaults to None which means infinite retries (hence infinite hang)

These arguments are set using broker_transport_options in:

Kombu.connection.py:Connection.default_channel

    @property
    def default_channel(self):
        """Default channel.

        Created upon access and closed when the connection is closed.

        Note:
            Can be used for automatic channel handling when you only need one
            channel, and also it is the channel implicitly used if
            a connection is passed instead of a channel, to functions that
            require a channel.
        """
        conn_opts = {}
        transport_opts = self.transport_options
        if transport_opts:
            if 'max_retries' in transport_opts:
                conn_opts['max_retries'] = transport_opts['max_retries']
            if 'interval_start' in transport_opts:
                conn_opts['interval_start'] = transport_opts['interval_start']
            if 'interval_step' in transport_opts:
                conn_opts['interval_step'] = transport_opts['interval_step']
            if 'interval_max' in transport_opts:
                conn_opts['interval_max'] = transport_opts['interval_max']

        # make sure we're still connected, and if not refresh.
        self.ensure_connection(**conn_opts)
        if self._default_channel is None:
            self._default_channel = self.channel()
        return self._default_channel

You can fix the indefinite hang by setting broker_transport_options in the celery config.

BROKER_TRANSPORT_OPTIONS = {"max_retries": 3, "interval_start": 0, "interval_step": 0.2, "interval_max": 0.5}

In this example the broker will retry 3 times, starting immediately, waiting 0.2 seconds more each time and never waiting more than 0.5 seconds so it will retry at:

+0 seconds
((internal = 0+0.2 = 0.2 seconds))
+0.2 seconds
((internal = 0.2+0.2 = 0.4 seconds))
+0.6 seconds

I think this is the correct short term solution for the problem unless anyone knows of a better fix?

------------------------------------------------------------------------------------

tl;dr
Put this in your celery config to fix the issue immediately:

BROKER_TRANSPORT_OPTIONS = {"max_retries": 3, "interval_start": 0, "interval_step": 0.2, "interval_max": 0.5}
@auvipy

This comment has been minimized.

Copy link
Member

@auvipy auvipy commented Jun 14, 2018

may be we could close this then? should the solution needed to be on documentation?

@drmclean

This comment has been minimized.

Copy link

@drmclean drmclean commented Jun 14, 2018

Yes, but also I think the documentation is actually incorrect (or at least very misleading).

looking here: http://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-task_publish_retry

The docs suggest that you get set broker-transport options using the task_publish_retry and task_publish_retry_policy.

The linked page: (http://docs.celeryproject.org/en/latest/userguide/calling.html#calling-retry) even explicity mentions the problems that could occur and the ramifications:

For example, the default policy correlates to:

add.apply_async((2, 2), retry=True, retry_policy={
    'max_retries': 3,
    'interval_start': 0,
    'interval_step': 0.2,
    'interval_max': 0.2,
})
the maximum time spent retrying will be 0.4 seconds. It’s set relatively short by default because a connection failure could lead to a retry pile effect if the broker connection is down – For example, many web server processes waiting to retry, blocking other incoming requests.

But those settings don't get applied and you have to use broker_transport_options instead - unless I misunderstood the intention of those settings?

Perhaps altering the docs to be more explicit about which settings affect which connections and then even better an overview page with some friendly documentation that covers the difference between:

broker_transport_options

broker_connection_timeout
broker_connection_max_retries

publish_task_retry
publish_task_retry_policy

etc...
and when to use each?

What do you think?

@clokep

This comment has been minimized.

Copy link
Contributor

@clokep clokep commented Jun 14, 2018

I tried to improve these a bit in #4606, but I hadn't come across the retry_policy before. That still seems like not the best fix though since that is meant to be used at task-call time? I guess it depends on whether you want all your tasks to use the same retry policy or not.

@drmclean

This comment has been minimized.

Copy link

@drmclean drmclean commented Jun 14, 2018

retry_policy isn't a fix though. It doesn't get applied even though the docs suggest that it does.

I think the docs ought to be updated to explicit state that publish_task_retry and publish_task_retry_policy don't affect how celery will attempt to retry (or not) the publishing of new tasks using apply_async()

@auvipy auvipy removed this from the v4.3 milestone Nov 17, 2018
@rsyring

This comment has been minimized.

Copy link

@rsyring rsyring commented Jan 29, 2019

I'm running into the problem of infinite connection retries when first starting Celery:

Using a bad port number deliberately for demonstration purposes: amqp://guest@localhost:5679//

$ strace -f -e trace=network pytest -x
strace: Process 11274 attached
...
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 11
connect(11, {sa_family=AF_INET, sin_port=htons(5679), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
getsockopt(11, SOL_SOCKET, SO_ERROR, [ECONNREFUSED], [4]) = 0
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 11
connect(11, {sa_family=AF_INET, sin_port=htons(5679), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
getsockopt(11, SOL_SOCKET, SO_ERROR, [ECONNREFUSED], [4]) = 0
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 11
connect(11, {sa_family=AF_INET, sin_port=htons(5679), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
getsockopt(11, SOL_SOCKET, SO_ERROR, [ECONNREFUSED], [4]) = 0
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 11
connect(11, {sa_family=AF_INET, sin_port=htons(5679), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
getsockopt(11, SOL_SOCKET, SO_ERROR, [ECONNREFUSED], [4]) = 0

The key to me was seeing "sin_port=htons(5679)" repeated over and over again, indicating a repeated attempt to connect to the broker.

I have the config set:

        'broker_transport_options': {'max_retries': 3}

But it does not seem to apply. The connections retry for at least 10 minutes (that's the longest I wanted to wait).

@Jamim

This comment has been minimized.

Copy link
Contributor

@Jamim Jamim commented Feb 6, 2019

Hello @auvipy,
Just wanted to remind that a milestone is missing for this bug currently.

@auvipy auvipy added this to the v5.0.0 milestone Feb 6, 2019
@auvipy auvipy modified the milestones: v5.0.0, 4.7 May 10, 2019
@auvipy auvipy modified the milestones: 4.7, 4.6 Aug 23, 2019
@auvipy auvipy modified the milestones: 4.6, 4.5 Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.