snapshot: don't schedule next snapshot job for a removed volume #8735

rohityadavcloud · 2024-03-01T14:31:16Z

When management server starts, it starts the snapshot scheduler. In case there is a volume snapshot policy which exists for a volume which does not exist, it can cause SQL constraint issue and cause the management server to break from starting its various components and cause HTTP 503 error.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

When management server starts, it starts the snapshot scheduler. In case there is a volume snapshot policy which exists for a volume which does not exist, it can cause SQL constraint issue and cause the management server to break from starting its various components and cause HTTP 503 error. Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

rohityadavcloud · 2024-03-01T14:32:13Z

@blueorangutan package

blueorangutan · 2024-03-01T14:34:04Z

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

DaanHoogland

clgtm

server/src/main/java/com/cloud/storage/snapshot/SnapshotSchedulerImpl.java

weizhouapache · 2024-03-01T14:44:57Z

what if the volume is removed manually after snapshot policy is created ?

it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

codecov · 2024-03-01T15:18:59Z

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 13.16%. Comparing base (9bd359a) to head (5ceddcd).
Report is 6 commits behind head on 4.18.

Files	Patch %	Lines
.../cloud/storage/snapshot/SnapshotSchedulerImpl.java	0.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               4.18    #8735      +/-   ##
============================================
- Coverage     13.16%   13.16%   -0.01%     
- Complexity     9203     9205       +2     
============================================
  Files          2724     2724              
  Lines        258137   258153      +16     
  Branches      40235    40236       +1     
============================================
- Hits          33989    33987       -2     
- Misses       219841   219860      +19     
+ Partials       4307     4306       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

blueorangutan · 2024-03-01T15:37:10Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8831

DaanHoogland · 2024-03-04T08:12:11Z

what if the volume is removed manually after snapshot policy is created ?

it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

I agree, but this code is still good and anybody could use it in older versions (backport it) I don't think it hurts in any way even with the constraint.

If that works a cascading delete would be usefull in this case ;)

weizhouapache · 2024-03-04T08:27:20Z

what if the volume is removed manually after snapshot policy is created ?
it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

I agree, but this code is still good and anybody could use it in older versions (backport it) I don't think it hurts in any way even with the constraint.

If that works a cascading delete would be usefull in this case ;)

@DaanHoogland
yes, with PR, the mgmt server will be started without any issue , even if volume is removed manually from database.

@rohityadavcloud
code lgtm

weizhouapache

code lgtm

weizhouapache · 2024-03-04T08:28:27Z

@blueorangutan package

blueorangutan · 2024-03-04T08:30:04Z

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

rajujith · 2024-03-04T08:43:27Z

what if the volume is removed manually after snapshot policy is created ?

it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

Makes sense to add it.

blueorangutan · 2024-03-04T09:27:26Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8841

rohityadavcloud · 2024-03-13T17:19:26Z

@weizhouapache I like your idea, could you propose that as a separate PR.

For now I've added changes to remove the schedule when volume is missing. Requesting re-review cc @weizhouapache @DaanHoogland @sureshanaparti @JoaoJandre

@blueorangutan package

blueorangutan · 2024-03-13T17:20:03Z

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

weizhouapache

code lgtm

blueorangutan · 2024-03-13T18:34:04Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8928

weizhouapache · 2024-03-13T19:52:03Z

@blueorangutan test

blueorangutan · 2024-03-13T19:54:04Z

@weizhouapache a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2024-03-14T07:23:51Z

[SF] Trillian test result (tid-9467)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 39891 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8735-t9467-kvm-centos7.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_08_arping_in_ssvm	`Failure`	5.18	test_diagnostics.py

JoaoJandre

CLGTM

DaanHoogland

clgtm

rohityadavcloud · 2024-03-18T12:30:57Z

@JoaoJandre since you're the RM who has called for a freeze, could you merge this?

JoaoJandre · 2024-03-18T12:35:25Z

I tried to test this manually but was not able to reproduce the issue, @rohityadavcloud could you share the steps to reproduce this?

rohityadavcloud · 2024-03-19T08:20:46Z

This is an edge case not easy to reproduce but found in a prod customer env @JoaoJandre. To reproduce you can delete a volume from the db manually which has a snapshot policy.

JoaoJandre · 2024-03-19T12:01:48Z

This is an edge case not easy to reproduce but found in a prod customer env @JoaoJandre. To reproduce you can delete a volume from the db manually which has a snapshot policy.

@rohityadavcloud I was able to test it by deleting the volume on the DB while the MGMT was down. This really is an edge case, but the PR does not introduce any regressions and is adding a layer of safety to the snapshot policy process, so I don't see any issues in merging.

…he#8735) * snapshot: don't schedule next snapshot job for a removed volume When management server starts, it starts the snapshot scheduler. In case there is a volume snapshot policy which exists for a volume which does not exist, it can cause SQL constraint issue and cause the management server to break from starting its various components and cause HTTP 503 error. Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com> * remove schedule on missing volume --------- Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

rohityadavcloud added this to the 4.18.2.0 milestone Mar 1, 2024

boring-cyborg bot added the component:storage label Mar 1, 2024

rohityadavcloud added type:bug Severity:Critical Critical bug and removed component:storage labels Mar 1, 2024

boring-cyborg bot added the component:storage label Mar 1, 2024

rohityadavcloud assigned DaanHoogland and rajujith Mar 1, 2024

rohityadavcloud requested review from DaanHoogland, rajujith, nvazquez and weizhouapache March 1, 2024 14:31

rohityadavcloud unassigned DaanHoogland and rajujith Mar 1, 2024

DaanHoogland approved these changes Mar 1, 2024

View reviewed changes

server/src/main/java/com/cloud/storage/snapshot/SnapshotSchedulerImpl.java Outdated Show resolved Hide resolved

rohityadavcloud marked this pull request as draft March 1, 2024 17:10

weizhouapache approved these changes Mar 4, 2024

View reviewed changes

remove schedule on missing volume

5ceddcd

rohityadavcloud requested review from weizhouapache, DaanHoogland and sureshanaparti March 13, 2024 17:17

rohityadavcloud marked this pull request as ready for review March 13, 2024 17:18

rohityadavcloud requested a review from JoaoJandre March 13, 2024 17:19

rohityadavcloud assigned JoaoJandre Mar 13, 2024

weizhouapache approved these changes Mar 13, 2024

View reviewed changes

rohityadavcloud added the status:ready-for-merge label Mar 14, 2024

JoaoJandre approved these changes Mar 14, 2024

View reviewed changes

DaanHoogland approved these changes Mar 15, 2024

View reviewed changes

JoaoJandre merged commit 720407b into apache:4.18 Mar 19, 2024
24 of 27 checks passed

rohityadavcloud deleted the snapshot-http-503 branch March 20, 2024 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snapshot: don't schedule next snapshot job for a removed volume #8735

snapshot: don't schedule next snapshot job for a removed volume #8735

rohityadavcloud commented Mar 1, 2024

rohityadavcloud commented Mar 1, 2024

blueorangutan commented Mar 1, 2024

DaanHoogland left a comment

weizhouapache commented Mar 1, 2024

codecov bot commented Mar 1, 2024 •

edited

blueorangutan commented Mar 1, 2024

DaanHoogland commented Mar 4, 2024

weizhouapache commented Mar 4, 2024

weizhouapache left a comment

weizhouapache commented Mar 4, 2024

blueorangutan commented Mar 4, 2024

rajujith commented Mar 4, 2024

blueorangutan commented Mar 4, 2024

rohityadavcloud commented Mar 13, 2024

blueorangutan commented Mar 13, 2024

weizhouapache left a comment

blueorangutan commented Mar 13, 2024

weizhouapache commented Mar 13, 2024

blueorangutan commented Mar 13, 2024

blueorangutan commented Mar 14, 2024

JoaoJandre left a comment

DaanHoogland left a comment

rohityadavcloud commented Mar 18, 2024 •

edited

JoaoJandre commented Mar 18, 2024

rohityadavcloud commented Mar 19, 2024

JoaoJandre commented Mar 19, 2024

snapshot: don't schedule next snapshot job for a removed volume #8735

snapshot: don't schedule next snapshot job for a removed volume #8735

Conversation

rohityadavcloud commented Mar 1, 2024

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

rohityadavcloud commented Mar 1, 2024

blueorangutan commented Mar 1, 2024

DaanHoogland left a comment

Choose a reason for hiding this comment

weizhouapache commented Mar 1, 2024

codecov bot commented Mar 1, 2024 • edited

Codecov Report

blueorangutan commented Mar 1, 2024

DaanHoogland commented Mar 4, 2024

weizhouapache commented Mar 4, 2024

weizhouapache left a comment

Choose a reason for hiding this comment

weizhouapache commented Mar 4, 2024

blueorangutan commented Mar 4, 2024

rajujith commented Mar 4, 2024

blueorangutan commented Mar 4, 2024

rohityadavcloud commented Mar 13, 2024

blueorangutan commented Mar 13, 2024

weizhouapache left a comment

Choose a reason for hiding this comment

blueorangutan commented Mar 13, 2024

weizhouapache commented Mar 13, 2024

blueorangutan commented Mar 13, 2024

blueorangutan commented Mar 14, 2024

JoaoJandre left a comment

Choose a reason for hiding this comment

DaanHoogland left a comment

Choose a reason for hiding this comment

rohityadavcloud commented Mar 18, 2024 • edited

JoaoJandre commented Mar 18, 2024

rohityadavcloud commented Mar 19, 2024

JoaoJandre commented Mar 19, 2024

codecov bot commented Mar 1, 2024 •

edited

rohityadavcloud commented Mar 18, 2024 •

edited