fix(job.id): Use unique hex string for job id to avoid a race condition #193

hughsw · 2020-03-26T12:42:05Z

These changes fix the problems reported in #189 and originally in #78 . They do so by managing the job id in the Job object itself rather than in Redis. Various simplifications emerge. Note that with these changes, the only purpose of the bq:name:id key in Redis is to support the newestJob field returned by checkHealth().

A couple of tests fail because they rely on the incrementing numerical-properties of the old job-id creation or on the idea that user-provided job ids are special. Whether user-provided ids are special isn't clear to me. So, I haven't tried to fix the tests yet.

hughsw · 2020-03-26T12:50:30Z

I don't understand the test failure involving the stack trace contents.

hughsw · 2020-03-26T14:01:58Z

So the core change of this PR breaks the default id ordering which supports the estimation of job-creation-throughput that the documentation for checkHealth() discusses. It also does away with the concept of the separation of user-provided ids and default ids.

The separation of user-provided ids from default ids is exercised by the remaining id-based test failure: Queue Health Check should not report the latest job for custom job ids.

AFAICT, the idea of keeping user-provided ids separate from the default (automatic) ids seems like an artifact of the old way of generating ids in Redis, rather than a bona-fide feature. The documented behavior has changed, and the non-user-provided and ordering semantics of checkHealth().newestJob are no longer supported.

I would like to change the documentation, and change the test to show that user-provided ids do show up in newestJob...

Actually, I would prefer to get rid of the bq:name:id key entirely and have the newestJob field be documented as deprecated, and return null or 0 or MAX_SAFE_INTEGER. But, I want v2 very badly.

One possible adjustment: we could make checkHealth().newestJob just be a counter of created jobs with no connection to job ids or the default vs user separation. This would continue to support the obvious use case of seeing how many jobs have been created and of estimating the throughput. The name of the attribute would be wrong because it should be something like .jobCount.

This adjustment would also make explicit that maintenance of the Redis key for bq:name:id is being done solely for this job-counting throughput estimation idea. Such semantics can of course be implemented by the client. To me this seems like a lot of code (in Lua no less) to support a counting feature than can be implemented entirely in the client.

bradvogel · 2020-03-27T13:04:14Z

Can you share more about how generating the job id from calling Redis' incr command leads to Bee Queue losing track of jobs?

hughsw · 2020-03-27T15:16:31Z

Please see #189 which reproduces the problem.

Specifically to your question: Redis does the increment and pushes the job onto the waiting queue atomically. The worker can pop the job off the waiting queue and get it finished before the job's ID has been set in the queue's container (which happens asynchronously with the worker running the job). So, when the job finishes, the queue does not emit a succeeded event because the queue doesn't yet know about the job.

bradvogel · 2020-03-27T16:49:37Z

I see now - thanks. Just to confirm - the only symptom of this is that events won't be emitted for those jobs, right?

I'll let @skeggse or @LewisJEllis review.

hughsw · 2020-03-28T12:18:56Z

Correct. And I believe only the Queue events won't be admitted. That is, the job <message> events do not rely on the Queue knowing about the job, so these events are not affected by the race condition.

BTW, I am considering an alternate approach to solving the problem. That is, make a separate Redis call to get the new incr ID. This was suggested in the analysis in #78 . This entails the extra overhead of two Redis calls, but has some simplicity, and is much less disruptive vis-a-vis the checkHealth().newestJob semantics.

I'm still struggling with intermittent test failures in a Docker environment that suggest there are lurking race conditions in some of the test logic (or conceivably the library).

hughsw · 2020-03-28T13:22:06Z

There may be some Queue assumptions that are causing test failures for either approach (this PR or a separate incr call). The existing behavior is that the Redis state has the saved job in play before the Queue has the job.id in its activeJobs set. The new behavior is that the Queue has the job.id in its activeJobs set before Redis has the job entered into its state. Thus, there are different constraints on what can be done...

There's more for me to learn before I can untangle the implications and formulate a solution...

hughsw · 2020-03-28T18:27:08Z

See also #194 which implements the Docker test harness I've been using, and which documents two intermittent test failures I've seen.

hughsw · 2020-04-07T15:08:00Z

This proposed fix is obsoleted by #197 , so I'm closing it.

fix(job.id): Use unique hex string for job id to avoid a race condition

d1a0cd9

fix(tests): Fix two tests to work with new job.id behavior

9bd4134

This was referenced Mar 27, 2020

Reproducible: bee-queue loses track of jobs #189

Closed

'success' event not fired #78

Closed

hughsw closed this Apr 7, 2020

hughsw deleted the fix-dropped-jobs branch April 7, 2020 15:08

hughsw mentioned this pull request Apr 7, 2020

[Obsolete draft] Fix dropped jobs issue #197

Closed

skeggse mentioned this pull request Apr 16, 2020

fix: prevent event delivery race condition #232

Closed

hughsw mentioned this pull request Apr 18, 2020

fix: avoid collisions in auto-generated Job IDs causing dropped jobs #237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(job.id): Use unique hex string for job id to avoid a race condition #193

fix(job.id): Use unique hex string for job id to avoid a race condition #193

hughsw commented Mar 26, 2020 •

edited

Loading

Uh oh!

hughsw commented Mar 26, 2020

Uh oh!

hughsw commented Mar 26, 2020 •

edited

Loading

Uh oh!

bradvogel commented Mar 27, 2020

Uh oh!

hughsw commented Mar 27, 2020

Uh oh!

bradvogel commented Mar 27, 2020

Uh oh!

hughsw commented Mar 28, 2020 •

edited

Loading

Uh oh!

hughsw commented Mar 28, 2020 •

edited

Loading

Uh oh!

hughsw commented Mar 28, 2020 •

edited

Loading

Uh oh!

hughsw commented Apr 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(job.id): Use unique hex string for job id to avoid a race condition #193

fix(job.id): Use unique hex string for job id to avoid a race condition #193

Conversation

hughsw commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughsw commented Mar 26, 2020

Uh oh!

hughsw commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bradvogel commented Mar 27, 2020

Uh oh!

hughsw commented Mar 27, 2020

Uh oh!

bradvogel commented Mar 27, 2020

Uh oh!

hughsw commented Mar 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughsw commented Mar 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughsw commented Mar 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughsw commented Apr 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hughsw commented Mar 26, 2020 •

edited

Loading

hughsw commented Mar 26, 2020 •

edited

Loading

hughsw commented Mar 28, 2020 •

edited

Loading

hughsw commented Mar 28, 2020 •

edited

Loading

hughsw commented Mar 28, 2020 •

edited

Loading