Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Ceph/RBD snapshot delete #3615

Merged
merged 1 commit into from Dec 8, 2019

Conversation

GabrielBrascher
Copy link
Member

Description

When deleting volume snapshots, only records in the database are deleted, and snapshots are not deleted on the main storage.

Fixes: #3586

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

How Has This Been Tested?

Configuration:

snapshot.backup.to.secondary =false
kvm.snapshot.enabled =true
  1. Create snapshots stored on Ceph/RBD primary storage.
  2. Delete and assert that it has been removed on the RBD pool.

I was able to reproduce the issue #3586. After applying the packages with this fix, snapshots were properly removed from DB and deleted from the RBD storage pool.

@rohityadavcloud
Copy link
Member

@GabrielBrascher I don't have ceph env to run test against, can you describe testing you've done cc @andrijapanicsb
@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-303

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@GabrielBrascher
Copy link
Member Author

@rhtyd I tested the deletion of a ROOT volume snapshot via UI.

Prior to this implementation, the snapshot DB entry was removed; however, the snapshot kept on the RBD. This happened due to the fact that the StorageSystemSnapshotStrategy was returning CANT_HANDLE, the chosen strategy then was XenserverSnapshotStrategy; the XenServer strategy does not consider KVM + Ceph and did removed the snapshot only on DB.

This fix allows the Ceph strategy (wich extends the StorageSystemSnapshotStrategy) to return StrategyPriority.HIGH for deleting snapshots stored on RBD. Therefore, the deleteSnapshot method of StorageSystemSnapshotStrategy will be executed and the RBD snapshot removed.

In order to verify, one can delete the snapshot. Assert that it was indeed removed on DB and check if the RBD snapshot was deleted on Ceph (rbd snap list <volume id on the Ceph storage pool>).

rbd snap list 4324d448-b6e9-468f-b86c-016bbcc702b3
SNAPID NAME                                    SIZE TIMESTAMP                
    65 5832c1a4-a525-4389-a261-9fdd70b94ca6 205 MiB Fri Sep 27 13:46:13 2019 
    66 dc3a5fea-4613-4bd6-b2fc-c5ae313ca9c2 205 MiB Fri Sep 27 13:46:23 2019 
    67 411dbdb3-b250-482b-851e-6d79ffd331d5 205 MiB Fri Sep 27 14:41:32 2019 
    68 04c5aa37-7f8a-413f-bbb3-1927b34b7f9a 205 MiB Fri Sep 27 14:46:41 2019 
    69 7f174546-7dff-40d2-aba5-db1a46deda65 205 MiB Fri Sep 27 14:46:49 2019 
    72 a253ddca-3ea3-4f46-8a80-4b391a2bc2a0 205 MiB Tue Oct  1 04:39:45 2019 
    81 17da61f7-5a20-4761-9c94-1192053d0aac 205 MiB Wed Oct  2 23:46:39 2019 

All the snapshots above were removed on DB but not on the Ceph. After applying the fix, I was able to delete snapshots propperly.

@andrijapanicsb
Copy link
Contributor

I'll be testing this one @rhtyd @GabrielBrascher . thx for the fix.

@andrijapanicsb
Copy link
Contributor

andrijapanicsb commented Oct 3, 2019

Tested with snapshot.backup.to.secondary=true - since this is also a needed test.

Snaps ARE deleted on Ceph (manual, still waiting for scheduled test result...)
Snaps are NOT deleted on Secondary Storage (manual, still waiting for scheduled test result...)

Picture=1000 words/etc. - please see bellow - multiple garbage collection runs and 0 snaps found for cleanup on SS

image

This was working always fine (not sure when it got broken...)

@GabrielBrascher

@GabrielBrascher
Copy link
Member Author

Thanks for testing @andrijapanicsb. Good that RBD looks fine.

That secondary storage snapshot deletion was also broken then, good catch; not sure when it got broken as well neither where but I have somthing in mind that might be causing this issue. I will be reproducing and debugging it on my test env.

@andrijapanicsb
Copy link
Contributor

sure - thx @GabrielBrascher.

FTR, scheduled snaps are also not removed from SS, but are removed from fine from RBD

@rohityadavcloud
Copy link
Member

Let's do a regression test
@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-402)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 39219 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3615-t402-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 77 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@andrijapanicsb
Copy link
Contributor

@rhtyd @GabrielBrascher this seems to be an issue with any storage and any? hypervisor (KVM and XS tested) even on 4.11(.2) releases.

Make snap (XS plus NFS), delete snap - snap is left on Primary Storage AND on Secondary Storage.
Same goes for scheduled snap policy. "Primary" store ref row is deleted from the snapshots_store_ref but not really deleted on Primary Store. "image" store ref row is NOT deleted from snapshots_store_ref and image is NOT deleted on Secondary Store (after muiltiple runs of storage clean up thread i.e. storage.cleanup.delay=60, storage.cleanup.delay=60.

Can you guys give this a bit more attention, so it's properly fixed for any kind of Primary Storage?
This is now leaving a pile of garbage everywhere, as it seems. Thx.

/CC @PaulAngus

@GabrielBrascher
Copy link
Member Author

@andrijapanicsb thanks for the feedback. It is a bigger issue than we thought.
I was going to update this PR today adding a commit for CephSnapshotStrategy fixing the issues with secondary storage backup. However, now that you commented I will take a deeper check considering not only KVM+Ceph execution flow.

It looks that the XenserverSnapshotStrategy is aiming to handle multiple cases, not only XenServer (e.g. there is also a check considering RBD cases for that strategy). One approach might be simplifying the different cases by separating the cases (RBD / Xen primary).

@andrijapanicsb
Copy link
Contributor

@GabrielBrascher yes, you are right, that XenserverSnapshotStrategy is indeed handling also KVM stuff.

Listen, let me open a new issue for this if that is OK for you - I want that new one to be a blocker - i.e. people using 4.11.x and 4.13 will now have tons of garbage on their primary and secondary storage.

@andrijapanicsb
Copy link
Contributor

@GabrielBrascher I've opened a new, general issue, #3646 but you feel free to check it out and perhaps work only Ceph here in this PR - up to you - just let's make sure everything is fixed - I'm happy to help with testing.

@rohityadavcloud
Copy link
Member

Is this ready for merging given what it says on the tin @andrijapanicsb @GabrielBrascher ?

@andrijapanicsb
Copy link
Contributor

This one, as tested back then, looks good.
There's another PR for fixing the garbage problem.

@rohityadavcloud rohityadavcloud merged commit 93aad24 into apache:master Dec 8, 2019
ustcweizhou pushed a commit to ustcweizhou/cloudstack that referenced this pull request Feb 28, 2020
When deleting volume snapshots, only records in the database are deleted, and snapshots are not deleted on the main storage.

Fixes: apache#3586
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rbd snapshot delete failed
4 participants