Skip to content

Conversation

@hughsw
Copy link
Collaborator

@hughsw hughsw commented Mar 26, 2020

These changes fix the problems reported in #189 and originally in #78 . They do so by managing the job id in the Job object itself rather than in Redis. Various simplifications emerge. Note that with these changes, the only purpose of the bq:name:id key in Redis is to support the newestJob field returned by checkHealth().

A couple of tests fail because they rely on the incrementing numerical-properties of the old job-id creation or on the idea that user-provided job ids are special. Whether user-provided ids are special isn't clear to me. So, I haven't tried to fix the tests yet.

@hughsw
Copy link
Collaborator Author

hughsw commented Mar 26, 2020

I don't understand the test failure involving the stack trace contents.

@hughsw
Copy link
Collaborator Author

hughsw commented Mar 26, 2020

So the core change of this PR breaks the default id ordering which supports the estimation of job-creation-throughput that the documentation for checkHealth() discusses. It also does away with the concept of the separation of user-provided ids and default ids.

The separation of user-provided ids from default ids is exercised by the remaining id-based test failure: Queue Health Check should not report the latest job for custom job ids.

AFAICT, the idea of keeping user-provided ids separate from the default (automatic) ids seems like an artifact of the old way of generating ids in Redis, rather than a bona-fide feature. The documented behavior has changed, and the non-user-provided and ordering semantics of checkHealth().newestJob are no longer supported.

I would like to change the documentation, and change the test to show that user-provided ids do show up in newestJob...

Actually, I would prefer to get rid of the bq:name:id key entirely and have the newestJob field be documented as deprecated, and return null or 0 or MAX_SAFE_INTEGER. But, I want v2 very badly.

One possible adjustment: we could make checkHealth().newestJob just be a counter of created jobs with no connection to job ids or the default vs user separation. This would continue to support the obvious use case of seeing how many jobs have been created and of estimating the throughput. The name of the attribute would be wrong because it should be something like .jobCount.

This adjustment would also make explicit that maintenance of the Redis key for bq:name:id is being done solely for this job-counting throughput estimation idea. Such semantics can of course be implemented by the client. To me this seems like a lot of code (in Lua no less) to support a counting feature than can be implemented entirely in the client.

@bradvogel
Copy link

Can you share more about how generating the job id from calling Redis' incr command leads to Bee Queue losing track of jobs?

@hughsw
Copy link
Collaborator Author

hughsw commented Mar 27, 2020

Please see #189 which reproduces the problem.

Specifically to your question: Redis does the increment and pushes the job onto the waiting queue atomically. The worker can pop the job off the waiting queue and get it finished before the job's ID has been set in the queue's container (which happens asynchronously with the worker running the job). So, when the job finishes, the queue does not emit a succeeded event because the queue doesn't yet know about the job.

@bradvogel
Copy link

I see now - thanks. Just to confirm - the only symptom of this is that events won't be emitted for those jobs, right?

I'll let @skeggse or @LewisJEllis review.

@hughsw
Copy link
Collaborator Author

hughsw commented Mar 28, 2020

Correct. And I believe only the Queue events won't be admitted. That is, the job <message> events do not rely on the Queue knowing about the job, so these events are not affected by the race condition.

BTW, I am considering an alternate approach to solving the problem. That is, make a separate Redis call to get the new incr ID. This was suggested in the analysis in #78 . This entails the extra overhead of two Redis calls, but has some simplicity, and is much less disruptive vis-a-vis the checkHealth().newestJob semantics.

I'm still struggling with intermittent test failures in a Docker environment that suggest there are lurking race conditions in some of the test logic (or conceivably the library).

@hughsw
Copy link
Collaborator Author

hughsw commented Mar 28, 2020

There may be some Queue assumptions that are causing test failures for either approach (this PR or a separate incr call). The existing behavior is that the Redis state has the saved job in play before the Queue has the job.id in its activeJobs set. The new behavior is that the Queue has the job.id in its activeJobs set before Redis has the job entered into its state. Thus, there are different constraints on what can be done...

There's more for me to learn before I can untangle the implications and formulate a solution...

@hughsw
Copy link
Collaborator Author

hughsw commented Mar 28, 2020

See also #194 which implements the Docker test harness I've been using, and which documents two intermittent test failures I've seen.

@hughsw
Copy link
Collaborator Author

hughsw commented Apr 7, 2020

This proposed fix is obsoleted by #197 , so I'm closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants