Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot removal and storage cleanup logs #8031

Merged
merged 4 commits into from Oct 16, 2023

Conversation

hsato03
Copy link
Collaborator

@hsato03 hsato03 commented Oct 3, 2023

Description

The snapshot removal and storage cleanup logs have few information about these processes, which makes troubleshooting difficult.

This PR intends to add new log messages and rewrite the old ones to make troubleshooting easier.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

I decreased the value of the storage.cleanup.interval configuration and deleted a snapshot. Then I checked the management server logs.

@harikrishna-patnala
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@harikrishna-patnala a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link
Contributor

@harikrishna-patnala harikrishna-patnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

@codecov
Copy link

codecov bot commented Oct 4, 2023

Codecov Report

Merging #8031 (6aa5f52) into main (2e9b3d8) will increase coverage by 0.45%.
Report is 32 commits behind head on main.
The diff coverage is 7.69%.

@@             Coverage Diff              @@
##               main    #8031      +/-   ##
============================================
+ Coverage     28.57%   29.03%   +0.45%     
- Complexity    29784    30281     +497     
============================================
  Files          5100     5101       +1     
  Lines        358565   358728     +163     
  Branches      52316    52353      +37     
============================================
+ Hits         102464   104159    +1695     
+ Misses       241968   240269    -1699     
- Partials      14133    14300     +167     
Flag Coverage Δ
simulator-marvin-tests 24.96% <7.69%> (+0.64%) ⬆️
uitests 4.84% <ø> (-0.02%) ⬇️
unit-tests 14.51% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...oudstack/storage/snapshot/SnapshotServiceImpl.java 35.27% <14.28%> (+0.30%) ⬆️
...ain/java/com/cloud/storage/StorageManagerImpl.java 25.91% <6.66%> (-0.26%) ⬇️

... and 221 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7222

@@ -364,7 +364,7 @@ protected Void deleteSnapshotCallback(AsyncCallbackDispatcher<SnapshotServiceImp
SnapshotResult res = null;
try {
if (result.isFailed()) {
s_logger.debug("delete snapshot failed" + result.getResult());
s_logger.debug(String.format("Failed to delete snapshot [%s] due to: [%s].", snapshot.getUuid(), result.getResult()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While logging uuid/id, prefix it with [id=. This is how we are logging at other places as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is not a consensus.

s_logger.debug(String.format("Verifying if snapshot [%s] is in destroying state in any image data store.", snapshotUuid));
SnapshotInfo snapshotInfo = snapshotFactory.getSnapshot(snapshot.getId(), DataStoreRole.Image);

if (snapshotInfo != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If snapshot doesn't exist, snapshotInfo will also be null.

Copy link
Collaborator Author

@hsato03 hsato03 Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. But, snapshotInfo is getting only snapshots from Image data store role. If a snapshot is in destroying state in a Primary data store role and the same snapshot in the Image data store role has already been cleaned up, snapshotInfo will be null and a NPE will be thrown.

Screenshot from 2023-08-09 16-28-32

Snapshots from ID 18 & 19 were cleaned up in Image data store but the snapshot still remains in destroying state in Primary data store.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my understanding, the below code takes care of that. We don't need to check for snapshot and snapshotInfo separately. Just a check on snapshotInfo should be fine and the code above (https://github.com/apache/cloudstack/pull/8031/files#diff-cf68e132ad7711cfa9250cda5819c09f51cba9933aa595cd82ad5a02b4edbe64R1319-R1325) can be removed.

@Override
public SnapshotInfo getSnapshot(long snapshotId, DataStoreRole role, boolean retrieveAnySnapshotFromVolume) {
SnapshotVO snapshot = snapshotDao.findById(snapshotId);
if (snapshot == null) {
return null;
}
SnapshotDataStoreVO snapshotStore = snapshotStoreDao.findBySnapshot(snapshotId, role);
if (snapshotStore == null) {
if (!retrieveAnySnapshotFromVolume) {
return null;
}
snapshotStore = snapshotStoreDao.findByVolume(snapshotId, snapshot.getVolumeId(), role);
if (snapshotStore == null) {
return null;
}
}
DataStore store = storeMgr.getDataStore(snapshotStore.getDataStoreId(), role);
SnapshotObject so = SnapshotObject.getSnapshotObject(snapshot, store);
return so;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you sugested to remove is being used to get the snapshot uuid and to log it in places that are out of snapshotInfo scope and in cases where snapshotInfo is null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsato03 , if I understand correctly you do a DB query for the sole purpose of logging. That seems a stretch. An operator can do this out of bound. If at all one would want this it should be turn-offable like behind an if (LOGGER.isDebugEnabled()) {} guard.

Copy link
Collaborator Author

@hsato03 hsato03 Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaanHoogland I made the changes to check if debug is enabled. By the way, the error log can no longer use the snapshot uuid.

hsato03 and others added 2 commits October 10, 2023 17:49
Co-authored-by: dahn <daan.hoogland@gmail.com>
@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-7943)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 44936 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8031-t7943-kvm-centos7.zip
Smoke tests completed. 111 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_08_upgrade_kubernetes_ha_cluster Failure 628.45 test_kubernetes_clusters.py
test_03_deploy_vm_wrong_checksum Error 39.60 test_templates.py
test_09_list_templates_download_details Failure 0.05 test_templates.py

@DaanHoogland
Copy link
Contributor

@GutoVeronezi @vishesh92 Are you allright with this PR?

Copy link
Contributor

@GutoVeronezi GutoVeronezi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM

@DaanHoogland DaanHoogland merged commit e437d10 into apache:main Oct 16, 2023
23 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants