Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshot: don't schedule next snapshot job for a removed volume #8735

Merged
merged 2 commits into from
Mar 19, 2024

Conversation

rohityadavcloud
Copy link
Member

When management server starts, it starts the snapshot scheduler. In case there is a volume snapshot policy which exists for a volume which does not exist, it can cause SQL constraint issue and cause the management server to break from starting its various components and cause HTTP 503 error.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

When management server starts, it starts the snapshot scheduler. In case
there is a volume snapshot policy which exists for a volume which does
not exist, it can cause SQL constraint issue and cause the management
server to break from starting its various components and cause HTTP 503
error.

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@rohityadavcloud
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@weizhouapache
Copy link
Member

what if the volume is removed manually after snapshot policy is created ?

it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

Copy link

codecov bot commented Mar 1, 2024

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 13.16%. Comparing base (9bd359a) to head (5ceddcd).
Report is 6 commits behind head on 4.18.

Files Patch % Lines
.../cloud/storage/snapshot/SnapshotSchedulerImpl.java 0.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.18    #8735      +/-   ##
============================================
- Coverage     13.16%   13.16%   -0.01%     
- Complexity     9203     9205       +2     
============================================
  Files          2724     2724              
  Lines        258137   258153      +16     
  Branches      40235    40236       +1     
============================================
- Hits          33989    33987       -2     
- Misses       219841   219860      +19     
+ Partials       4307     4306       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8831

@rohityadavcloud rohityadavcloud marked this pull request as draft March 1, 2024 17:10
@DaanHoogland
Copy link
Contributor

what if the volume is removed manually after snapshot policy is created ?

it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

I agree, but this code is still good and anybody could use it in older versions (backport it) I don't think it hurts in any way even with the constraint.

If that works a cascading delete would be usefull in this case ;)

@weizhouapache
Copy link
Member

what if the volume is removed manually after snapshot policy is created ?
it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

I agree, but this code is still good and anybody could use it in older versions (backport it) I don't think it hurts in any way even with the constraint.

If that works a cascading delete would be usefull in this case ;)

@DaanHoogland
yes, with PR, the mgmt server will be started without any issue , even if volume is removed manually from database.

@rohityadavcloud
code lgtm

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@weizhouapache
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@rajujith
Copy link
Collaborator

rajujith commented Mar 4, 2024

what if the volume is removed manually after snapshot policy is created ?

it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

Makes sense to add it.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8841

@rohityadavcloud
Copy link
Member Author

@weizhouapache I like your idea, could you propose that as a separate PR.

For now I've added changes to remove the schedule when volume is missing. Requesting re-review cc @weizhouapache @DaanHoogland @sureshanaparti @JoaoJandre

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8928

@weizhouapache
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-9467)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 39891 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8735-t9467-kvm-centos7.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_08_arping_in_ssvm Failure 5.18 test_diagnostics.py

Copy link
Contributor

@JoaoJandre JoaoJandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@rohityadavcloud
Copy link
Member Author

rohityadavcloud commented Mar 18, 2024

@JoaoJandre since you're the RM who has called for a freeze, could you merge this?

@JoaoJandre
Copy link
Contributor

I tried to test this manually but was not able to reproduce the issue, @rohityadavcloud could you share the steps to reproduce this?

@rohityadavcloud
Copy link
Member Author

This is an edge case not easy to reproduce but found in a prod customer env @JoaoJandre. To reproduce you can delete a volume from the db manually which has a snapshot policy.

@JoaoJandre
Copy link
Contributor

This is an edge case not easy to reproduce but found in a prod customer env @JoaoJandre. To reproduce you can delete a volume from the db manually which has a snapshot policy.

@rohityadavcloud I was able to test it by deleting the volume on the DB while the MGMT was down. This really is an edge case, but the PR does not introduce any regressions and is adding a layer of safety to the snapshot policy process, so I don't see any issues in merging.

@JoaoJandre JoaoJandre merged commit 720407b into apache:4.18 Mar 19, 2024
24 of 27 checks passed
@rohityadavcloud rohityadavcloud deleted the snapshot-http-503 branch March 20, 2024 10:18
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Mar 21, 2024
…he#8735)

* snapshot: don't schedule next snapshot job for a removed volume

When management server starts, it starts the snapshot scheduler. In case
there is a volume snapshot policy which exists for a volume which does
not exist, it can cause SQL constraint issue and cause the management
server to break from starting its various components and cause HTTP 503
error.

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

* remove schedule on missing volume

---------

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants