Loss of data/jobs with no possibility of exceptions or logs #228

peterjmit · 2015-01-27T02:21:37Z

I am working on an application that is processing ~25k jobs per day, and we have started to notice that some jobs are not being processed or queued, more problematic is that these failed (or missing) jobs have no corresponding log entries in Symfony (via https://github.com/michelsalib/BCCResqueBundle), php-fpm, nginx or redis.

After some investigation, I believe this is being caused by the CredisException suppression in Resque_Redis.

`Resque_Redis` and `Credis_Client`

Inside Resque_Redis all methods calls are handled magically by __call and supported methods are proxied to Credis_Client::__call.

Credis_Client::_call has a few failure states represented in part by a CredisException, this can happen during connection to redis (Credis_Client::connect) or during the call to the corresponding redis method.

If a CredisException is raised, Resque_Redis::_call will return false, this is less than ideal (as it is harder for users of this library to detect/deal with redis errors). However it is ok as long as all upstream usages of Resque_Redis::_call check for a false return value.

Unfortunately this brings me to my specific issue of losing data in Resque::enqueue. Resque::enqueue calls Resque_Job::createwhich in turn calls Resque::push.

Resque::push()

public static function push($queue, $item)
{
    self::redis()->sadd('queues', $queue);
    $length = self::redis()->rpush('queue:' . $queue, json_encode($item));
    if ($length < 1) {
        return false;
    }
    return true;
}

The variable $length is confusing/misleading here, if there is an exception raised by redis/credis during this call $length is actually false and is indicative of a failure state due to an exception (rather than an incorrect queue length). Again this is ok if the usage of this function checks for a false return value which at this point could indicate an exception, or an erroneously empty queue.

Resque_Job::create()

If we remove the code that deals with job status/input validation we have

public static function create($queue, $class, $args = null, $monitor = false)
{
    $id = md5(uniqid('', true));
    Resque::push($queue, array(
        'class' => $class,
        'args'  => array($args),
        'id'    => $id,
    ));

    return $id;
}

Clearly we fail to check the return value of Resque::push() here. Job::create is the only usage of Resque::push(). This means if we experience any kind of Redis error, or Credis error during Job creation the end user will be none the wiser.

Solutions?

Raise an exception in Resque::push if Resque_Redis::rpush returns false
Remove suppression of CredisException inside Resque_Redis

The only reason I have hesitated to open a pull request, to at the very least remove the exception suppression is that I do not know if it is there for a specific reason/use case for it being there.

Notes.

Any one evaluating the use of this library in their application should be very wary it due to the possibility of silent data loss as of c335bc3 (or v1.2)

The text was updated successfully, but these errors were encountered:

danhunsaker · 2015-01-27T02:36:45Z

There's a patch around someplace to return false instead of the generated job ID if job creation fails, but this project has been stalled for about a year, so it hasn't been pulled in. Would exceptions be better? Perhaps, but they wouldn't be consistent with the majority of how the project interfaces with external code. Either way, at the moment I'd feel better recommending illuminate/queue for queue management (the Redis driver operates very similarly to how PHP-Resque does) if you're looking for something actively maintained.

peterjmit · 2015-01-27T02:43:30Z

@danhunsaker thank you for the speedy reply. Do you have contribution rights to this repo or is it only @chrisboulton?

It would be really useful (for future teams/users) to mark this repo as abandoned if possible to avoid others running into this issue with no hope of resolution (save for an existing or new fork)

http://seld.be/notes/composer-1-0-alpha9

danhunsaker · 2015-01-27T02:47:13Z

It's not abandoned, just yet. Just stalled for a while. I believe Chris was planning to open push access to the repo at some point soon, when he has time to properly review applicants and get the selected individuals up to speed on the direction the project is intended to move forward. At such time, progress will likely leap forward, and PHP-Resque will again become a powerful contender in the queue engine landscape.

peterjmit · 2015-01-27T02:54:10Z

Ah ok, thanks for the update. In the meantime is there anything we can do (other than this issue) to make users aware of this issue (and the potential lengthy timeline for a fix).

Users who do not do their due diligence when assessing packages for inclusion in their projects will not likely be aware of the problem here - and when (google) searching for background queuing solutions this lib (and the symfony bundle) are high up if not first in the results.

hussfelt · 2015-01-27T06:16:06Z

@peterjmit Nice work! +1! Thanks for sharing, I always wondered where that problem came from... :-)

chrisboulton · 2015-02-02T20:24:05Z

Thanks for spending the time tracking this down - I'm definitely eager to get something in place for this.

I feel like surfacing an exception is the correct solution here - do we want to refactor everything to surface something like Resque_RedisException whenever an exception related to the Redis connection appears?

danhunsaker · 2015-02-02T20:27:10Z

Unless there's a chance we'd have a non-connection Redis exception to raise. If there is, I'd call it Resque_RedisConnectionException. It might also be good to attempt to reconnect and retry the requested command first, only throwing the exception if that fails.

It would also be good, at that point, to raise exceptions on other errors we currently do not, for consistency's sake. This is a non-BC change, though, so versioning will be affected. Returning false instead of raising an exception would be a decent patch for 1.2.1, while the exception would be better overall, but require tagging under 1.3.

chrisboulton · 2015-02-02T20:38:38Z

Credis doesn't really seem to differentiate between what is connection related and what's just normal error related - looks like CredisException is thrown regardless so we probably can't differentiate either (hence the generic name)

danhunsaker · 2015-02-02T22:08:06Z

Sounds good, then. :-)

peterjmit · 2015-02-03T03:58:48Z

@danhunsaker @chrisboulton We deployed a similar fix (which has typically yet to report any errors since we looked into it) peterjmit@e11af37. Having an exception specific to redis is definitely an improvement

raajeshnew · 2015-07-21T11:03:35Z

@peterjmit can you pls clarify if your issue was that there were jobs that was not picked by threads and were you able to fix it. I am facing a similar situation where if i q 30 jobs to be processed in default q some where around 20 - 25 gers picked up and processed and remaining just doesnt get picked up at all and the q state probably is cold

peterjmit · 2015-07-21T14:59:51Z

@raajeshnew we have not actually tracked down the source of the issue, and we are still facing problems with losing data - your experience is good to know (and will hopefully help us figure things out!)

peterjmit mentioned this issue Jan 27, 2015

Warn users of potential data loss michelsalib/BCCResqueBundle#121

Closed

peterjmit mentioned this issue Jan 27, 2015

Possible issue Resque/Redis interaction, jobs not beeing queued/executed #149

Closed

chrisboulton mentioned this issue Feb 2, 2015

Add beforeEnqueue hook #212

Merged

chrisboulton mentioned this issue Feb 2, 2015

Surface Redis exceptions instead of silently returning false #229

Merged

This was referenced Jul 8, 2015

If tripod is unable to submit a job to resque, retry after a backoff period techfromsage/tripod-php#72

Closed

90% - Detects/Retries a failed request to resque techfromsage/tripod-php#79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss of data/jobs with no possibility of exceptions or logs #228

Loss of data/jobs with no possibility of exceptions or logs #228

peterjmit commented Jan 27, 2015

danhunsaker commented Jan 27, 2015

peterjmit commented Jan 27, 2015

danhunsaker commented Jan 27, 2015

peterjmit commented Jan 27, 2015

hussfelt commented Jan 27, 2015

chrisboulton commented Feb 2, 2015

danhunsaker commented Feb 2, 2015

chrisboulton commented Feb 2, 2015

danhunsaker commented Feb 2, 2015

peterjmit commented Feb 3, 2015

raajeshnew commented Jul 21, 2015

peterjmit commented Jul 21, 2015

Loss of data/jobs with no possibility of exceptions or logs #228

Loss of data/jobs with no possibility of exceptions or logs #228

Comments

peterjmit commented Jan 27, 2015

Resque_Redis and Credis_Client

Resque::push()

Resque_Job::create()

Solutions?

Notes.

danhunsaker commented Jan 27, 2015

peterjmit commented Jan 27, 2015

danhunsaker commented Jan 27, 2015

peterjmit commented Jan 27, 2015

hussfelt commented Jan 27, 2015

chrisboulton commented Feb 2, 2015

danhunsaker commented Feb 2, 2015

chrisboulton commented Feb 2, 2015

danhunsaker commented Feb 2, 2015

peterjmit commented Feb 3, 2015

raajeshnew commented Jul 21, 2015

peterjmit commented Jul 21, 2015

`Resque_Redis` and `Credis_Client`