New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
journal: fix flush by age and in-flight byte tracking #31392
Conversation
jenkins test make check arm64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@dillaman Interestingly, the arm64 make check failed in in librbd unittest though I can't see the exact test as the output is truncated. |
In my environment it reproduced running unittest_librbd with RBD_FEATURES=109. It hanged here:
If it does not reproduce in your environment I can try to debug. |
In logs I see it hangs when after "already closed or overflowed" there is no "JournalRecorder: handle_advance_object_set", so it looks like the "notify_handler" is not fired. |
@trociny I was able to hit it, thanks |
Should be fixed now. |
@dillaman still there is an issue. I was observing this crash in a test:
Also, it was hit by rbd_mirror.sh test, for "laggy" image, which is created with
It gets stuck. Interestingly, when I add |
4b9f4b8
to
0d70963
Compare
Found a brand new race condition in the code that I guess my batching fixes exposed. I will run through teuthology just to double-check that nothing else shakes out. |
9443f9c
to
01cbbea
Compare
Use move semantics and RAII to control the locking of the per-object recorder lock. Signed-off-by: Jason Dillaman <dillaman@redhat.com>
The flush by age was always causing an immediate flush due to a backwards comparison. Additionally, the in-flight byte tracker was never decremented which caused premature closure of the journal object. Finally, there was a potential race condition between closing the object and in-flight notification callbacks executing. Now we keep the lock held for both closed and overflow callbacks to prevent the small chance of a race. Fixes: https://tracker.ceph.com/issues/42598 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
While the old set is being closed, additional IO can be queued up within the old, closed objects while the in-flight IO settles. It's therefore possible that the queued IO that is transferred from the old set to the new set causes an immediate overflow of the new set. Signed-off-by: Jason Dillaman <dillaman@redhat.com>
I am pretty confident that I've been chasing an unrelated failure [1] that has been introduced into the master branch [2] that is affecting the the dynamic features test. It seems to fail consistently on Ubuntu 18 but not on other distros. The failure is related to CRC errors causing the messenger layer to get into a connect/resend/fail CRC/disconnect spin loop. [1] http://pulpito.ceph.com/jdillaman-2019-11-19_16:32:12-rbd-wip-42598-distro-basic-smithi/ |
Build failures from PR #31672 |
jenkins test make check |
jenkins test make check arm64 |
@dillaman Just FYI (as it is not related to the PR), when testing it locally and running something like
|
@trociny @dillaman ( Referring to https://tracker.ceph.com/issues/42953#change-175846 ) Would a successful nautilus backport of this PR fix the following "make check" failure that we occasionally see in nautilus?
|
Not sure -- but the original issue in that linked tracker was a crash, not a hang so I'd lean towards no. |
The flush by age was always causing an immediate flush due to a
backwards comparison. Additionally, the in-flight byte tracker was
never decremented which caused premature closure of the journal
object. Finally, there was a potential race condition between
closing the object and in-flight notification callbacks executing.
Fixes: https://tracker.ceph.com/issues/42598
Signed-off-by: Jason Dillaman dillaman@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox