-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
snapshot: don't schedule next snapshot job for a removed volume #8735
Conversation
When management server starts, it starts the snapshot scheduler. In case there is a volume snapshot policy which exists for a volume which does not exist, it can cause SQL constraint issue and cause the management server to break from starting its various components and cause HTTP 503 error. Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@blueorangutan package |
@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clgtm
server/src/main/java/com/cloud/storage/snapshot/SnapshotSchedulerImpl.java
Outdated
Show resolved
Hide resolved
what if the volume is removed manually after snapshot policy is created ? it may be better to add constraint to the table |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.18 #8735 +/- ##
============================================
- Coverage 13.16% 13.16% -0.01%
- Complexity 9203 9205 +2
============================================
Files 2724 2724
Lines 258137 258153 +16
Branches 40235 40236 +1
============================================
- Hits 33989 33987 -2
- Misses 219841 219860 +19
+ Partials 4307 4306 -1 ☔ View full report in Codecov by Sentry. |
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8831 |
I agree, but this code is still good and anybody could use it in older versions (backport it) I don't think it hurts in any way even with the constraint. If that works a cascading delete would be usefull in this case ;) |
@DaanHoogland @rohityadavcloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
@blueorangutan package |
@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Makes sense to add it. |
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8841 |
@weizhouapache I like your idea, could you propose that as a separate PR. For now I've added changes to remove the schedule when volume is missing. Requesting re-review cc @weizhouapache @DaanHoogland @sureshanaparti @JoaoJandre @blueorangutan package |
@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8928 |
@blueorangutan test |
@weizhouapache a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
[SF] Trillian test result (tid-9467)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clgtm
@JoaoJandre since you're the RM who has called for a freeze, could you merge this? |
I tried to test this manually but was not able to reproduce the issue, @rohityadavcloud could you share the steps to reproduce this? |
This is an edge case not easy to reproduce but found in a prod customer env @JoaoJandre. To reproduce you can delete a volume from the db manually which has a snapshot policy. |
@rohityadavcloud I was able to test it by deleting the volume on the DB while the MGMT was down. This really is an edge case, but the PR does not introduce any regressions and is adding a layer of safety to the snapshot policy process, so I don't see any issues in merging. |
…he#8735) * snapshot: don't schedule next snapshot job for a removed volume When management server starts, it starts the snapshot scheduler. In case there is a volume snapshot policy which exists for a volume which does not exist, it can cause SQL constraint issue and cause the management server to break from starting its various components and cause HTTP 503 error. Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com> * remove schedule on missing volume --------- Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
When management server starts, it starts the snapshot scheduler. In case there is a volume snapshot policy which exists for a volume which does not exist, it can cause SQL constraint issue and cause the management server to break from starting its various components and cause HTTP 503 error.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?