Skip to content

Conversation

@rzarzynski
Copy link
Contributor

Fixes: https://tracker.ceph.com/issues/61504

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@rzarzynski rzarzynski requested a review from a team as a code owner June 6, 2023 17:31
@rzarzynski
Copy link
Contributor Author

We might not need #51936 after this.

Copy link
Contributor

@Matan-B Matan-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's verify we no longer hit this issue in the suite's rbd api tests before merging.
I scheduled a build in Shaman.

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also appears to be fixing a bug where we were calling complete_watcher on all in progress notifies?

Copy link
Contributor

@Matan-B Matan-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ RUN ] TestLibRBD.QuiesceWatchTimeout is crashing the osds.

 0# gsignal in /lib64/libc.so.6
 1# abort in /lib64/libc.so.6
  ..
 4# crimson::osd::Watch::notify_ack(unsigned long, ceph::buffer::v15_2_0::list const&) in ceph-osd

See rbd_api_tests_old_format/rbd_api_tests. I will also try to look at it.
https://pulpito.ceph.com/matan-2023-06-08_06:36:16-crimson-rados-wip-matanb-crimson-testing-bug-61504-distro-crimson-smithi/

Edit: On the positive side, rbd_python_api_tests_old_format and rbd_python_api_tests look good! It seems that this was responsible for the recent regression with rbd tests.

});
logger().info("{} notify_id={}", __func__, notify_id);
const auto it = in_progress_notifies.find(notify_id);
assert(it != std::end(in_progress_notifies));
Copy link
Contributor

@Matan-B Matan-B Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resolves the TestLibRBD.QuiesceWatchTimeout for me locally, what do you think?

Suggested change
assert(it != std::end(in_progress_notifies));
if (it == std::end(in_progress_notifies)) {
logger().debug("Watch::notify_ack gid={} cookie={} notify_id={} already completed",
get_watcher_gid(),
get_cookie(),
notify_id);
return seastar::now();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we shouldn't be getting a notify ack from something not in the in_progress_notifies list, I don't think. Perhaps make it error() and debug it after this PR merges?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to logging an error.

in_progress_notifies.clear();
return seastar::now();
});
logger().info("{} notify_id={}", __func__, notify_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
logger().info("{} notify_id={}", __func__, notify_id);
logger().debug("{} gid={} cookie={} notify_id={}",
__func__, get_watcher_gid(), get_cookie(), notify_id);

Replaced the assert with an error log entry but ultimately
this commit should be reverted.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Copy link
Contributor

@Matan-B Matan-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a follow-up tracker to the suppressed issue once this is merged.

@Matan-B
Copy link
Contributor

Matan-B commented Jun 13, 2023

Failures, seem unrelated:

7301383, 7301385, 7301394, 7301396: https://tracker.ceph.com/issues/61651 (new tracker)
7301384 https://tracker.ceph.com/issues/59165
7301382 local_shared_foreign_ptr.h:75: Assertion ptr && *ptr failed
7301386 TestInternal.PoolMetadataConfApply freezes

https://pulpito.ceph.com/matan-2023-06-12_17:29:11-crimson-rados-wip-matanb-crimson-testing-bug-61504-v3-distro-crimson-smithi/

Merging based on rbd_python_api_tests/_old_format pass and we there were instances of rbd_api_tests/_old_format passing as well.

@Matan-B Matan-B merged commit 3f1e887 into ceph:main Jun 13, 2023
@Matan-B
Copy link
Contributor

Matan-B commented Jun 13, 2023

Follow-up tracker: https://tracker.ceph.com/issues/61652

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants