Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of data/jobs with no possibility of exceptions or logs #228

Open
peterjmit opened this issue Jan 27, 2015 · 12 comments
Open

Loss of data/jobs with no possibility of exceptions or logs #228

peterjmit opened this issue Jan 27, 2015 · 12 comments

Comments

@peterjmit
Copy link

I am working on an application that is processing ~25k jobs per day, and we have started to notice that some jobs are not being processed or queued, more problematic is that these failed (or missing) jobs have no corresponding log entries in Symfony (via https://github.com/michelsalib/BCCResqueBundle), php-fpm, nginx or redis.

After some investigation, I believe this is being caused by the CredisException suppression in Resque_Redis.

Resque_Redis and Credis_Client

Inside Resque_Redis all methods calls are handled magically by __call and supported methods are proxied to Credis_Client::__call.

Credis_Client::_call has a few failure states represented in part by a CredisException, this can happen during connection to redis (Credis_Client::connect) or during the call to the corresponding redis method.

If a CredisException is raised, Resque_Redis::_call will return false, this is less than ideal (as it is harder for users of this library to detect/deal with redis errors). However it is ok as long as all upstream usages of Resque_Redis::_call check for a false return value.

Unfortunately this brings me to my specific issue of losing data in Resque::enqueue. Resque::enqueue calls Resque_Job::createwhich in turn calls Resque::push.

Resque::push()

public static function push($queue, $item)
{
    self::redis()->sadd('queues', $queue);
    $length = self::redis()->rpush('queue:' . $queue, json_encode($item));
    if ($length < 1) {
        return false;
    }
    return true;
}

The variable $length is confusing/misleading here, if there is an exception raised by redis/credis during this call $length is actually false and is indicative of a failure state due to an exception (rather than an incorrect queue length). Again this is ok if the usage of this function checks for a false return value which at this point could indicate an exception, or an erroneously empty queue.

Resque_Job::create()

If we remove the code that deals with job status/input validation we have

public static function create($queue, $class, $args = null, $monitor = false)
{
    $id = md5(uniqid('', true));
    Resque::push($queue, array(
        'class' => $class,
        'args'  => array($args),
        'id'    => $id,
    ));

    return $id;
}

Clearly we fail to check the return value of Resque::push() here. Job::create is the only usage of Resque::push(). This means if we experience any kind of Redis error, or Credis error during Job creation the end user will be none the wiser.

Solutions?

  1. Raise an exception in Resque::push if Resque_Redis::rpush returns false
  2. Remove suppression of CredisException inside Resque_Redis

The only reason I have hesitated to open a pull request, to at the very least remove the exception suppression is that I do not know if it is there for a specific reason/use case for it being there.

Notes.

Any one evaluating the use of this library in their application should be very wary it due to the possibility of silent data loss as of c335bc3 (or v1.2)

@danhunsaker
Copy link
Contributor

There's a patch around someplace to return false instead of the generated job ID if job creation fails, but this project has been stalled for about a year, so it hasn't been pulled in. Would exceptions be better? Perhaps, but they wouldn't be consistent with the majority of how the project interfaces with external code. Either way, at the moment I'd feel better recommending illuminate/queue for queue management (the Redis driver operates very similarly to how PHP-Resque does) if you're looking for something actively maintained.

@peterjmit
Copy link
Author

@danhunsaker thank you for the speedy reply. Do you have contribution rights to this repo or is it only @chrisboulton?

It would be really useful (for future teams/users) to mark this repo as abandoned if possible to avoid others running into this issue with no hope of resolution (save for an existing or new fork)

http://seld.be/notes/composer-1-0-alpha9

@danhunsaker
Copy link
Contributor

It's not abandoned, just yet. Just stalled for a while. I believe Chris was planning to open push access to the repo at some point soon, when he has time to properly review applicants and get the selected individuals up to speed on the direction the project is intended to move forward. At such time, progress will likely leap forward, and PHP-Resque will again become a powerful contender in the queue engine landscape.

@peterjmit
Copy link
Author

Ah ok, thanks for the update. In the meantime is there anything we can do (other than this issue) to make users aware of this issue (and the potential lengthy timeline for a fix).

Users who do not do their due diligence when assessing packages for inclusion in their projects will not likely be aware of the problem here - and when (google) searching for background queuing solutions this lib (and the symfony bundle) are high up if not first in the results.

@hussfelt
Copy link

@peterjmit Nice work! +1! Thanks for sharing, I always wondered where that problem came from... :-)

@chrisboulton
Copy link
Owner

Thanks for spending the time tracking this down - I'm definitely eager to get something in place for this.

I feel like surfacing an exception is the correct solution here - do we want to refactor everything to surface something like Resque_RedisException whenever an exception related to the Redis connection appears?

@danhunsaker
Copy link
Contributor

Unless there's a chance we'd have a non-connection Redis exception to raise. If there is, I'd call it Resque_RedisConnectionException. It might also be good to attempt to reconnect and retry the requested command first, only throwing the exception if that fails.

It would also be good, at that point, to raise exceptions on other errors we currently do not, for consistency's sake. This is a non-BC change, though, so versioning will be affected. Returning false instead of raising an exception would be a decent patch for 1.2.1, while the exception would be better overall, but require tagging under 1.3.

@chrisboulton
Copy link
Owner

Credis doesn't really seem to differentiate between what is connection related and what's just normal error related - looks like CredisException is thrown regardless so we probably can't differentiate either (hence the generic name)

@danhunsaker
Copy link
Contributor

Sounds good, then. :-)

@peterjmit
Copy link
Author

@danhunsaker @chrisboulton We deployed a similar fix (which has typically yet to report any errors since we looked into it) peterjmit@e11af37. Having an exception specific to redis is definitely an improvement

@raajeshnew
Copy link

@peterjmit can you pls clarify if your issue was that there were jobs that was not picked by threads and were you able to fix it. I am facing a similar situation where if i q 30 jobs to be processed in default q some where around 20 - 25 gers picked up and processed and remaining just doesnt get picked up at all and the q state probably is cold

@peterjmit
Copy link
Author

@raajeshnew we have not actually tracked down the source of the issue, and we are still facing problems with losing data - your experience is good to know (and will hopefully help us figure things out!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants