Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/rbd_support: support scheduling long-running background operations #29054

Merged
merged 11 commits into from Jul 27, 2019

Conversation

dillaman
Copy link

@dillaman dillaman commented Jul 15, 2019

Certain RBD operations might take a long (blocking) amount of time so it would be preferable if we offered a way to schedule these tasks to run in the background. We would expose the ability to manage these background tasks via the MGR's "rbd_support" module via "ceph" CLI commands similar to the following:

ceph rbd task add flatten <image-spec>
ceph rbd task add remove <image-spec>
ceph rbd task add trash remove <image-id-spec>
ceph rbd task add migration execute <image-spec>
ceph rbd task add migration commit <image-spec>
ceph rbd task add migration abort <image-spec>
ceph rbd task cancel <task-id>
ceph rbd task list

The ceph-csi will be one of the first to utilize these new background tasks to offload its workload and support proper re-starting of in-progress events.

@dillaman dillaman changed the title mgr/rbd_support: support scheduling long-running background operations [DNM] mgr/rbd_support: support scheduling long-running background operations Jul 15, 2019
@dillaman dillaman added the DNM label Jul 15, 2019
@rjfd
Copy link
Contributor

rjfd commented Jul 16, 2019

@dillaman FYI, I opened #29048 yesterday to integrate progress mgr module events in the dashboard.

@dillaman dillaman force-pushed the wip-40621 branch 2 times, most recently from 7c1868c to b1bba53 Compare July 19, 2019 19:45
src/pybind/mgr/rbd_support/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/rbd_support/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/rbd_support/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/rbd_support/module.py Show resolved Hide resolved
@dillaman
Copy link
Author

dillaman commented Jul 23, 2019

@rjfd No rush, but when you get a chance, can you review my changes to the progress? I just want to ensure we are on the same page for how to record the events to support the dashboard. Right now, here is an example:

$ ceph progress json
{
  "events": [],
  "completed": [
    {
      "id": "b586a7a8-3100-4819-9a1c-8dbe08bb7b75",
      "message": "Removing image rbd_tasks/50459f17-f529-4d2e-98de-4798d9c547b7",
      "refs": {
        "origin": "rbd_support",
        "action": "remove",
        "pool_name": "rbd_tasks",
        "pool_namespace": "",
        "image_name": "50459f17-f529-4d2e-98de-4798d9c547b7"
      },
      "started_at": 1563900987.7811384,
      "finished_at": 1563900989.6917968,
      "failed": true,
      "failure_message": "Operation canceled"
    },
    {
      "id": "8f144eaf-deb9-46d9-aad4-1f95746ab8d2",
      "message": "Removing image rbd/image",
      "refs": {
        "origin": "rbd_support",
        "action": "remove",
        "pool_name": "rbd",
        "pool_namespace": "",
        "image_name": "image"
      },
      "started_at": 1563901014.1282082,
      "finished_at": 1563901038.7095985
    }
  ]
}

Copy link
Contributor

@rjfd rjfd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

progress module changes look good, thanks!

@dillaman dillaman changed the title [DNM] mgr/rbd_support: support scheduling long-running background operations mgr/rbd_support: support scheduling long-running background operations Jul 24, 2019
@dillaman dillaman removed the DNM label Jul 24, 2019
@dillaman
Copy link
Author

dillaman commented Jul 24, 2019

@rjfd ... and if the dashboard wants to translate the rbd_support progress events to something it knows about, the ev.refs["action"] field now includes the action type [1].

[1] 8774317#diff-0eb9731567425c7d4afb3332c908c4fcR67-R72

@dillaman dillaman force-pushed the wip-40621 branch 2 times, most recently from e9ece2a to a7399bd Compare July 24, 2019 12:46
@rjfd
Copy link
Contributor

rjfd commented Jul 24, 2019

@rjfd ... and if the dashboard wants to translate the rbd_support progress events to something it knows about, the ev.refs["action"] field now includes the action type [1].

[1] 8774317#diff-0eb9731567425c7d4afb3332c908c4fcR67-R72

That's great! This way the problem I had with knowing the "event type" is solved with the pair refs["origin"], refs["action"].

Jason Dillaman added 2 commits July 24, 2019 09:31
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Copy link
Contributor

@trociny trociny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman There are some issues when testing on python2.7.

src/pybind/mgr/rbd_support/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/rbd_support/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/rbd_support/module.py Outdated Show resolved Hide resolved
src/test/pybind/test_rbd.py Outdated Show resolved Hide resolved
@trociny
Copy link
Contributor

trociny commented Jul 25, 2019

And yet another thing noticed. When the test failed for some reason while there were tasks in the mgr in-memory queue, and then the pool was deleted, the task remained int the mgr in-memory queue until it was restarted.

This callback can be used to track progress and also to attempt to cancel
the operation while it's in-progress by returning a negative error code
from the callback.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman added 6 commits July 25, 2019 08:58
This allows the error to be directly caught instead of attempting
to parse the OSError.errno

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Fixes: http://tracker.ceph.com/issues/40621
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
The failed events can also include a failure message to indicate
the reason for the failure.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
This dict will include an origin key fixed to 'rbd_support' as well
as pool and image references.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@dillaman
Copy link
Author

The new rbd_tasks teuthology test is still failing [1]. Will investigate.

[1] http://pulpito.ceph.com/jdillaman-2019-07-25_10:51:34-rbd-wip-jd-testing-distro-basic-smithi/

The 'ceph' CLI will duplicate commands within teuthology to test
the MONs idempotency. This shouldn't be required for the MGR module,
but we can keep a fixed set of completed tests to handle this
possible command replay.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@trociny
Copy link
Contributor

trociny commented Jul 26, 2019

And just as note, I suppose this this a bug in progress module, but when displaying a task progress (either with ceph progress or ceph status) the output looks like below:

  progress:
    Removing image test_task/XXXMG (since 00h 00m 54s)
      [========================......]

And this since 00h 00m 54s looks confusing to me. Shouldn't it show the task start time? Or does it mean task duration? Then I would use e.g. for 00h 00m 54s , not since ....

@rjfd
Copy link
Contributor

rjfd commented Jul 26, 2019

And this since 00h 00m 54s looks confusing to me. Shouldn't it show the task start time? Or does it mean task duration? Then I would use e.g. for 00h 00m 54s , not since ....

Yes, it's the duration. And I agree "since ..." is misleading.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@dillaman
Copy link
Author

I added a tracker ticket [1] for the unrelated progress issue.

[1] http://tracker.ceph.com/issues/40976

Copy link
Contributor

@trociny trociny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@trociny
Copy link
Contributor

trociny commented Jul 26, 2019

Just cosmetic. When remove is canceled, we have in the mgr log:

2019-07-26T14:23:17.338+0300 7f59a97d1700 -1 librbd::TrimRequest: trim encountered an error: (125) Operation canceled
2019-07-26T14:23:17.338+0300 7f59a8fd0700 -1 librbd::image::RemoveRequest: 0x56096bf6f1e0 handle_trim_image: failed to remove some object(s): (125) Operation canceled

It does not look very bad, as I suppose for a user from the error message it is clear that it is not actually an error but just cancel, but I can imagine it could be improved if you think it is worth doing.

@dillaman
Copy link
Author

It does not look very bad, as I suppose for a user from the error message it is clear that it is not actually an error but just cancel, but I can imagine it could be improved if you think it is worth doing.

It could be improved in a later PR I think (throughout since it will affect more than just removal I'd suspect).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants