Silent failing task on large arg size #318

Closed
simonz05 opened this Issue Feb 8, 2011 · 21 comments

Comments

Projects
None yet
4 participants

simonz05 commented Feb 8, 2011

My test-case is a file containing 7000 chars of Lorem Ipsum text. The contents of the file is sent as one argument to an arbitrary task.

$ wc text.md
103 1058 7000 text.md

Redis is configured as DB and Broker backend. The case fails on v2.2.0 celery, v1.0.2 kombu. I stepped through the code on the client-side of the program. That is, from apply_async, till the task is published on the 'ae.undeliver' queue. The task is successfully added to the 'ae.undeliver' list, but it does not show up as a unique entry in the backend, as it should.

The same tasks are successfully executed when the argument size is reduced to about 1/3 of the text file. This indicates that there is a hidden limit to argument size, which does not look as if it is intended, since no error is raised.

Owner

ask commented Feb 11, 2011

Could you please provide a sample project reproducing this issue?

It would save me a lot of time :)

https://github.com/simonz05/issue318

I created a small project which will reproduce the issue. I've added a readme file which explains the steps to install and reproduce it.

Owner

ask commented Mar 27, 2011

I haven't had the time to look at this properly yet, mainly because I don't have another host around.

I will be installing ubuntu on vmware soon to track down the issue.

Thanks for the update.

I could setup a user for you on two hosts running Ubuntu 10.04 if you are interested. Could even setup my test-case again and save you the trouble. Obviously, you might want to be in control of the environment, but it's an open offer so let me know.

Owner

ask commented Apr 2, 2011

That would be great, maybe you have Skype? (ask@celeryproject.org)

simonz05 commented Apr 2, 2011

I'll see if I have some time to setup the hosts tomorrow and contact
you by mail when I have it done. I'm usually on Skype with my regular
email(simonz05@gmail.com) so we can probably arrange something when it
suits you.

On 3 April 2011 00:12, ask
reply@reply.github.com
wrote:

That would be great, maybe you have Skype?  (ask@celeryproject.org)

Reply to this email directly or view it on GitHub:
https://github.com/ask/celery/issues/318#comment_949879

I encountered a very similar if not the same problem today. The problem is in Kombu's pyredis module, specifically setting the Redis socket to nonblocking with sock.setblocking(0).

The socket fails reading the response of BRPOP on line 210 of pyredis.py, raising a ConnectionnError, which ends up raising Empty() on line 212. It might be nice to add a log entry if Empty is raised to give some indication of failure for future debugging.

The error is: "Error while reading from socket: Resource temporarily unavailable". This typically happens when you're trying to read a socket that's concurrently being written to. As simonz05 suggests, it only happens on larger payloads, assumably where the message is broken up into multiple TCP packets. It also explains why it's harder (if not impossible) to repro when Redis is running locally, because the socket transfer doesn't have to travel across the wire.

It looks like you're jumping through some hoops to get your event loop to play well with Redis. I'm open to making redis-py play better with eventio is that'd make this easier, though it will probably take a little time to research what's best, and try to maintain backwards compatibility.

Let's discuss in IRC.

Owner

ask commented Apr 6, 2011

@simonz05 and @andymccurdy

Could you try applying this patch to kombu/transport/pyredis.py ?

http://pastie.org/1762753

The setblocking calls are actually left-overs from a solution that didn't use select/kqueue/epoll.

Owner

ask commented Apr 6, 2011

I just managed to reproduce myself by calling this task:

@task
def foo(text):
    print(text)

and calling it with a megabyte of data:

 >>> foo.apply_async(("x" * 1024 * 1024, ))

Removing the setblocking calls solves the issue, and doesn't look to have any
other side effects.

Owner

ask commented Apr 6, 2011

While testing I managed to crash Redis though :)

[65478] 06 Apr 13:03:49 - 4 clients connected (0 slaves), 969392 bytes in use
[65478] 06 Apr 13:03:49 - Accepted 127.0.0.1:49892
[65478] 06 Apr 13:03:49 # === ASSERTION FAILED ===
[65478] 06 Apr 13:03:49 # ==> t_list.c:820 'ln != listFirst(clients)' is not true
[65478] 06 Apr 13:03:49 # (forcing SIGSEGV in order to print the stack trace)
[65478] 06 Apr 13:03:49 # ======= Ooops! Redis 2.2.2 got signal: -11- =======
[65478] 06 Apr 13:03:49 # redis_version:2.2.2
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:kqueue
process_id:65478
uptime_in_seconds:1518
uptime_in_days:0
lru_clock:185358
used_cpu_sys:0.51
used_cpu_user:0.48
used_cpu_sys_childrens:0.00
used_cpu_user_childrens:0.00
connected_clients:5
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:1284128
used_memory_human:1.22M
used_memory_rss:7507968
mem_fragmentation_ratio:5.85
use_tcmalloc:0
loading:0
aof_enabled:0
changes_since_last_save:93
bgsave_in_progress:0
last_save_time:1302086311
bgrewriteaof_in_progress:0
total_connections_received:152
total_commands_processed:794
expired_keys:0
evicted_keys:0
keyspace_hits:133
keyspace_misses:451
hash_max_zipmap_entries:512
hash_max_zipmap_value:64
pubsub_channels:1
pubsub_patterns:0
vm_enabled:0
role:master
allocation_stats:6=1,8=157,9=399,10=231,11=1,12=369,13=291,14=393,15=380,16=17480,17=152,18=12,19=32,20=6,21=2,22=2,23=93,2
[65478] 06 Apr 13:03:49 # 1 redis-server 0x00000001000294ef 0x0 + 4295136495
[65478] 06 Apr 13:03:49 # 2 ??? 0x00007fff5fbfe420 0x0 + 140734799799328
[65478] 06 Apr 13:03:49 # 3 redis-server 0x0000000100019932 0x0 + 4295072050
[65478] 06 Apr 13:03:49 # 4 redis-server 0x000000010001a0c4 0x0 + 4295073988
[65478] 06 Apr 13:03:49 # 5 redis-server 0x0000000100007921 0x0 + 4294998305
[65478] 06 Apr 13:03:49 # 6 redis-server 0x000000010000f1a8 0x0 + 4295029160
[65478] 06 Apr 13:03:49 # 7 redis-server 0x000000010000f280 0x0 + 4295029376
[65478] 06 Apr 13:03:49 # 8 redis-server 0x0000000100001891 0x0 + 4294973585
[65478] 06 Apr 13:03:49 # 9 redis-server 0x0000000100001b9e 0x0 + 4294974366
[65478] 06 Apr 13:03:49 # 10 redis-server 0x00000001000071b8 0x0 + 4294996408
[65478] 06 Apr 13:03:49 # 11 redis-server 0x00000001000008e4 0x0 + 4294969572
Segmentation fault

simonz05 commented Apr 7, 2011

On 6 April 2011 12:55, ask
reply@reply.github.com
wrote:

@simonz05 and @andymccurdy

Could you try applying this patch to kombu/transport/pyredis.py ?

http://pastie.org/1762753

The setblocking calls are actually left-overs from a solution that didn't use select/kqueue/epoll.

Hi, I just briefly had some time to test this patch yesterday, and I
didn't see any different behavior, but I assumed I did something
wrong. Furthermore, I'm happy to see you were able to create a fix
for the issue with kombu 348babd commit. So thanks for taking the time
to look at it.

Simon.

Owner

ask commented Apr 16, 2011

@simonz05 confirms on IRC that the issue has not occurred since applying the patch.. Reopen if similar symptoms are experienced.

ask closed this Apr 16, 2011

steeve reopened this Aug 10, 2012

Owner

steeve commented Aug 10, 2012

I'm having the same issue with Celery 3.0.5 and AMQP.
Note that this doesn't happen on localhost, but if the broker is on another machine.

I'm running Mountain Lion, and RabbitMQ 2.8.5 runs on Debian.

Code to reproduce:

from celery import Celery

celery = Celery('tasks', broker=BROKER_URL)
celery.conf.CELERYD_POOL = "solo"

@celery.task
def one(data):
    print "Data size is %d" % len(data)
    return None

if __name__ == "__main__":
    one.delay("A" * 1024 * 90)

one never gets called.

Owner

steeve commented Aug 10, 2012

After investigation, it seems the socket is now completely out of sync and when AMQP tries to read the header it will read the last message data.

I've also tried to send big amounts of data using Kombu alone, and it works fine. So I believe it's in Celery.

Owner

steeve commented Aug 10, 2012

And also having the exact same issues (ConnectionError, Resource not available making Kombu believe that the socket is empty, when it's not.)

Owner

steeve commented Aug 20, 2012

Okay this is much worse than I thought, on 3.0.6, if you do:

    for x in xrange(50):
        one.delay("A" * 1024)

It will crash, and in some case triggered this:

[2012-08-20 18:30:46,406: ERROR/MainProcess] Unrecoverable error: AMQPError('Framing Error, received 0x00 while expecting 0xce', None, None, None, '')
Traceback (most recent call last):
  File "/Users/steeve/projects/celery/celery/worker/__init__.py", line 353, in start
    component.start()
  File "/Users/steeve/projects/celery/celery/worker/consumer.py", line 369, in start
    self.consume_messages()
  File "/Users/steeve/projects/celery/celery/worker/consumer.py", line 453, in consume_messages
    readers[fileno](fileno, event)
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 195, in drain_nowait
    self.drain_events(timeout=0)
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 191, in drain_events
    return self.transport.drain_events(self.connection, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 84, in drain_events
    return connection.drain_events(**kwargs)
  File "/Users/steeve/projects/instance/lib/python2.7/site-packages/amqp/connection.py", line 264, in drain_events
    chanmap, None, timeout=timeout)
  File "/Users/steeve/projects/instance/lib/python2.7/site-packages/amqp/connection.py", line 322, in _wait_multiple
    channel, method_sig, args, content = read_timeout(timeout)
  File "/Users/steeve/projects/instance/lib/python2.7/site-packages/amqp/connection.py", line 296, in read_timeout
    return self.method_reader.read_method()
  File "/Users/steeve/projects/instance/lib/python2.7/site-packages/amqp/method_framing.py", line 186, in read_method
    raise m
AMQPError: Framing Error, received 0x00 while expecting 0xce
Owner

ask commented Aug 20, 2012

@steeve: I start the worker with:

$ celery worker -l info -b amqplib://

and then:

$ celery shell
>>>  _=[accept_anything.delay('A' * 1024) for i in xrange(50)]

and I have no issues,

I can even do:

>>>  _=[accept_anything.delay('A' * 1024) for i in xrange(10000)]

So there must be something in your environment causing this (setting, or something else)

Owner

steeve commented Aug 21, 2012

@ask this doesn't happen on localhost, you have to run the broker on another machine for that to happen

steeve referenced this issue Aug 21, 2012

Closed

Framing Error #922

Owner

steeve commented Aug 24, 2012

Okay I've did some more investigating:

tasks.py:

from celery import Celery

celery = Celery()

@celery.task(ignore_result=True)
def big(data):
    return len(data)

In the shell:

$ celery shell -b amqplib://guest:guest@192.168.1.5:5672/celery -A tasks
In [1]: big.delay("A" * 1024 * 1000)
Out[1]: <AsyncResult: 39c5c9a3-d094-4665-a888-c2e906bf8ccb>

This works:

$ celery worker -b amqplib://guest:guest@192.168.1.5:5672/celery -A tasks -l info -P gevent

This doesn't:

$ celery worker -b amqplib://guest:guest@192.168.1.5:5672/celery -A tasks -l info -P solo

It doesn't work with anything but gevent as a pool, or if celery.conf.CELERYD_POOL is set after app creation, even with gevent.

Owner

ask commented Aug 29, 2012

@steeve Your amqplib related issue has been fixed in kombu 2.4.4 (or amqp 0.9.3 if pyamqp:// is used)

ask closed this Aug 29, 2012

Owner

steeve commented Aug 30, 2012

Yep, I can confirm it's fixed! Great work @ask!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment