Enable over provisioning for SharedMountPoint primary storages #8481

GutoVeronezi · 2024-01-09T20:21:12Z

Description

ACS does not allow operators to set the configuration storage.overprovisioning.factor for SharedMountPoint primary storages, presenting the following message when one tries to change it:

Also, the global setting does not impact SharedMoundPoint primary storages, which means that the over-provisioning is not applied properly for this storage type.

Furthermore, the metric presents that the SharedMountPoint has the global over-provisioning applied; however, in a scenario where the over-provision factor is 2, the allocation threshold is 80%, and the metrics show that storage has 40% allocated, one cannot allocate to it anymore.

This PR intends to enable over-provisioning for SharedMountPoint primary storages.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

In an environment where the global over-provision factor was 1, the allocation threshold was 80%, and the SharedMountPoint primary storage had 80% allocated, I was able to change the over-provisioning factor of the SharedMountPoint primary storage and allocate a new volume on it, as, with the over-provisioning, ACS considered it as having 40% allocated.

GutoVeronezi · 2024-01-09T20:22:51Z

@blueorangutan package

blueorangutan · 2024-01-09T20:24:03Z

@GutoVeronezi a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

codecov · 2024-01-09T20:30:32Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (76aff0f) 13.12% compared to head (a7179f0) 29.40%.
Report is 8 commits behind head on 4.18.

Additional details and impacted files

@@              Coverage Diff              @@
##               4.18    #8481       +/-   ##
=============================================
+ Coverage     13.12%   29.40%   +16.27%     
- Complexity     9142    35417    +26275     
=============================================
  Files          2720     5348     +2628     
  Lines        257744   414514   +156770     
  Branches      40182    70287    +30105     
=============================================
+ Hits          33839   121883    +88044     
- Misses       219615   276663    +57048     
- Partials       4290    15968    +11678

Flag	Coverage Δ
simulator-marvin-tests	`24.64% <100.00%> (?)`
uitests	`4.39% <ø> (?)`
unit-tests	`16.48% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

blueorangutan · 2024-01-09T21:26:58Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8257

sureshanaparti

code changes lgtm, needs testing with SharedMountPoint pools.

DaanHoogland · 2024-01-10T08:51:49Z

code looks good. I have a worry though. overprovisioning on storage in general is a dangerous thing. On cpu and memory any errors due to overprovisioning are likely to not be fatal, but on storage these can easily lead to data loss. when storage is allocated but not used we can apply overprovisioning but when parts of it get used these should not be considered in overprovisioning. I don't think our model supports such behaviour. any ideas @GutoVeronezi @sureshanaparti (sorry, i don't know what wisdom is in respect to this)

JoaoJandre · 2024-01-10T11:25:41Z

@GutoVeronezi could you target 4.18.2?

JoaoJandre

CLGTM, didn't test it

weizhouapache · 2024-01-10T12:09:06Z

code looks good. I have a worry though. overprovisioning on storage in general is a dangerous thing. On cpu and memory any errors due to overprovisioning are likely to not be fatal, but on storage these can easily lead to data loss. when storage is allocated but not used we can apply overprovisioning but when parts of it get used these should not be considered in overprovisioning. I don't think our model supports such behaviour. any ideas @GutoVeronezi @sureshanaparti (sorry, i don't know what wisdom is in respect to this)

@DaanHoogland
I believe most cloudstack users use thin provisioning and storage overprovisioning. It was never a critical issue, there was no data loss either. If the storage is almost occupied, users should consider adding more storage before reaching full disk space.

GutoVeronezi · 2024-01-10T19:54:41Z

@GutoVeronezi could you target 4.18.2?

@JoaoJandre, done.

code looks good. I have a worry though. overprovisioning on storage in general is a dangerous thing. On cpu and memory any errors due to overprovisioning are likely to not be fatal, but on storage these can easily lead to data loss. when storage is allocated but not used we can apply overprovisioning but when parts of it get used these should not be considered in overprovisioning. I don't think our model supports such behaviour. any ideas @GutoVeronezi @sureshanaparti (sorry, i don't know what wisdom is in respect to this)

@DaanHoogland
Sure, dealing with over-provisioning without caution can be harmful. Before changing the overprovisioning configurations, operators should understand how the environment is being consumed. Furthermore, it is important to have active monitoring over the environment; this way, it is possible to take actions, like the one @weizhouapache mentioned in #8481 (comment), to avoid problems.

The configuration cluster.storage.capacity.notificationthreshold makes ACS notify the operators when the storage reaches the capacity threshold (for memory and CPU we have similar configurations). I am not sure if we have much more to do from the ACS perspective, but we can think about ways to improve this feature.

GutoVeronezi · 2024-01-10T19:55:12Z

@blueorangutan package

blueorangutan · 2024-01-10T20:10:06Z

@GutoVeronezi a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

weizhouapache · 2024-01-10T20:41:26Z

@GutoVeronezi could you target 4.18.2?

@JoaoJandre, done.

code looks good. I have a worry though. overprovisioning on storage in general is a dangerous thing. On cpu and memory any errors due to overprovisioning are likely to not be fatal, but on storage these can easily lead to data loss. when storage is allocated but not used we can apply overprovisioning but when parts of it get used these should not be considered in overprovisioning. I don't think our model supports such behaviour. any ideas @GutoVeronezi @sureshanaparti (sorry, i don't know what wisdom is in respect to this)

@DaanHoogland
Sure, dealing with over-provisioning without caution can be harmful. Before changing the overprovisioning configurations, operators should understand how the environment is being consumed. Furthermore, it is important to have active monitoring over the environment; this way, it is possible to take actions, like the one @weizhouapache mentioned in #8481 (comment), to avoid problems.

The configuration cluster.storage.capacity.notificationthreshold makes ACS notify the operators when the storage reaches the capacity threshold (for memory and CPU we have similar configurations). I am not sure if we have much more to do from the ACS perspective, but we can think about ways to improve this feature.

Yes @GutoVeronezi, there are global settings to (1) notify users, or (2) disable new volume allocation, if allocated capacity or actual used capacity reaches a threshold.

The thresholds are around 75% or 85%, so users have enough time to get more storage or clean the storage.

blueorangutan · 2024-01-10T21:14:49Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8280

DaanHoogland · 2024-01-11T08:55:32Z

@GutoVeronezi could you target 4.18.2?

@JoaoJandre, done.

code looks good. I have a worry though. overprovisioning on storage in general is a dangerous thing. On cpu and memory any errors due to overprovisioning are likely to not be fatal, but on storage these can easily lead to data loss. when storage is allocated but not used we can apply overprovisioning but when parts of it get used these should not be considered in overprovisioning. I don't think our model supports such behaviour. any ideas @GutoVeronezi @sureshanaparti (sorry, i don't know what wisdom is in respect to this)

@DaanHoogland
Sure, dealing with over-provisioning without caution can be harmful. Before changing the overprovisioning configurations, operators should understand how the environment is being consumed. Furthermore, it is important to have active monitoring over the environment; this way, it is possible to take actions, like the one @weizhouapache mentioned in #8481 (comment), to avoid problems.
The configuration cluster.storage.capacity.notificationthreshold makes ACS notify the operators when the storage reaches the capacity threshold (for memory and CPU we have similar configurations). I am not sure if we have much more to do from the ACS perspective, but we can think about ways to improve this feature.

Yes @GutoVeronezi, there are global settings to (1) notify users, or (2) disable new volume allocation, if allocated capacity or actual used capacity reaches a threshold.

The thresholds are around 75% or 85%, so users have enough time to get more storage or clean the storage.

ok, I thought only overprovisioned threshold would be signalled. I don't know of an extra guard on the actual threshold , but then again we would expect operators to do their due diligence. Sorry for the over-paranoia.

weizhouapache · 2024-01-11T09:40:50Z

Yes @GutoVeronezi, there are global settings to (1) notify users, or (2) disable new volume allocation, if allocated capacity or actual used capacity reaches a threshold.
The thresholds are around 75% or 85%, so users have enough time to get more storage or clean the storage.

ok, I thought only overprovisioned threshold would be signalled. I don't know of an extra guard on the actual threshold , but then again we would expect operators to do their due diligence. Sorry for the over-paranoia.

no worries @DaanHoogland
I think this is a frequently asked question for many users.

mysql> select name,default_value,description from configuration where name like "cluster.storage%" or name like "pool.storage%";
+----------------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| name                                                     | default_value | description                                                                                                                                                      |
+----------------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| cluster.storage.allocated.capacity.notificationthreshold | 0.75          | Percentage (as a value between 0 and 1) of allocated storage utilization above which alerts will be sent about low storage available.                            |
| cluster.storage.capacity.notificationthreshold           | 0.75          | Percentage (as a value between 0 and 1) of storage utilization above which alerts will be sent about low storage available.                                      |
| cluster.storage.operations.exclude                       | false         | Exclude cluster from storage operations                                                                                                                          |
| pool.storage.allocated.capacity.disablethreshold         | 0.85          | Percentage (as a value between 0 and 1) of allocated storage utilization above which allocators will disable using the pool for low allocated storage available. |
| pool.storage.capacity.disablethreshold                   | 0.85          | Percentage (as a value between 0 and 1) of storage utilization above which allocators will disable using the pool for low storage available.                     |
+----------------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)

GutoVeronezi · 2024-01-31T13:16:06Z

Aside from the third-party testing, do we still need anything else on this one?

cc: @DaanHoogland @weizhouapache

DaanHoogland · 2024-01-31T13:30:11Z

I don´t know, I didn't test it @GutoVeronezi

weizhouapache · 2024-01-31T13:51:48Z

Aside from the third-party testing, do we still need anything else on this one?

cc: @DaanHoogland @weizhouapache

@GutoVeronezi
code looks ok to me.

@slavkap
can you please review and test it ?

DaanHoogland · 2024-01-31T14:56:43Z

I don´t know, I didn't test it @GutoVeronezi

sorry, I didn't read very well and answered hastely. Yes @GutoVeronezi, clgtm

GutoVeronezi · 2024-02-08T14:29:59Z

@blueorangutan package

blueorangutan · 2024-02-08T14:32:03Z

@GutoVeronezi a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2024-02-08T15:33:04Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8591

slavkap

@weizhouapache, sorry for the late response!
Unfortunately, I don't have an environment with shared mount point but the code LGTM

weizhouapache · 2024-02-09T08:35:24Z

@weizhouapache, sorry for the late response! Unfortunately, I don't have an environment with shared mount point but the code LGTM

thanks @slavkap

…e#8481) * Enable over provisioning for SharedMountPoint primary storages * Fix unit tests * Fix typos and small adjusts --------- Co-authored-by: Daniel Augusto Veronezi Salvador <gutoveronezi@apache.org>

boring-cyborg bot added the component:api label Jan 9, 2024

sureshanaparti approved these changes Jan 10, 2024

View reviewed changes

sureshanaparti added the status:needs-testing label Jan 10, 2024

JoaoJandre reviewed Jan 10, 2024

View reviewed changes

JoaoJandre approved these changes Jan 10, 2024

View reviewed changes

Daniel Augusto Veronezi Salvador added 3 commits January 10, 2024 16:16

Enable over provisioning for SharedMountPoint primary storages

8c0c88c

Fix unit tests

32b18e1

Fix typos and small adjusts

a7179f0

GutoVeronezi force-pushed the enable-op-for-sharedmountpoint branch from 8981243 to a7179f0 Compare January 10, 2024 19:18

GutoVeronezi changed the base branch from main to 4.18 January 10, 2024 19:18

JoaoJandre added this to the 4.18.2.0 milestone Jan 10, 2024

rohityadavcloud approved these changes Feb 5, 2024

View reviewed changes

boring-cyborg bot added component:agent component:build component:integration labels Feb 8, 2024

GutoVeronezi force-pushed the enable-op-for-sharedmountpoint branch from 77a21bf to a7179f0 Compare February 8, 2024 14:28

slavkap approved these changes Feb 9, 2024

View reviewed changes

weizhouapache merged commit 2729ee1 into apache:4.18 Feb 9, 2024
73 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable over provisioning for SharedMountPoint primary storages #8481

Enable over provisioning for SharedMountPoint primary storages #8481

GutoVeronezi commented Jan 9, 2024 •

edited

GutoVeronezi commented Jan 9, 2024

blueorangutan commented Jan 9, 2024

codecov bot commented Jan 9, 2024 •

edited

blueorangutan commented Jan 9, 2024

sureshanaparti left a comment

DaanHoogland commented Jan 10, 2024

JoaoJandre commented Jan 10, 2024

JoaoJandre left a comment

weizhouapache commented Jan 10, 2024

GutoVeronezi commented Jan 10, 2024 •

edited

GutoVeronezi commented Jan 10, 2024

blueorangutan commented Jan 10, 2024

weizhouapache commented Jan 10, 2024

blueorangutan commented Jan 10, 2024

DaanHoogland commented Jan 11, 2024

weizhouapache commented Jan 11, 2024

GutoVeronezi commented Jan 31, 2024

DaanHoogland commented Jan 31, 2024

weizhouapache commented Jan 31, 2024

DaanHoogland commented Jan 31, 2024

GutoVeronezi commented Feb 8, 2024

blueorangutan commented Feb 8, 2024

blueorangutan commented Feb 8, 2024

slavkap left a comment

weizhouapache commented Feb 9, 2024

Enable over provisioning for SharedMountPoint primary storages #8481

Enable over provisioning for SharedMountPoint primary storages #8481

Conversation

GutoVeronezi commented Jan 9, 2024 • edited

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

GutoVeronezi commented Jan 9, 2024

blueorangutan commented Jan 9, 2024

codecov bot commented Jan 9, 2024 • edited

Codecov Report

blueorangutan commented Jan 9, 2024

sureshanaparti left a comment

Choose a reason for hiding this comment

DaanHoogland commented Jan 10, 2024

JoaoJandre commented Jan 10, 2024

JoaoJandre left a comment

Choose a reason for hiding this comment

weizhouapache commented Jan 10, 2024

GutoVeronezi commented Jan 10, 2024 • edited

GutoVeronezi commented Jan 10, 2024

blueorangutan commented Jan 10, 2024

weizhouapache commented Jan 10, 2024

blueorangutan commented Jan 10, 2024

DaanHoogland commented Jan 11, 2024

weizhouapache commented Jan 11, 2024

GutoVeronezi commented Jan 31, 2024

DaanHoogland commented Jan 31, 2024

weizhouapache commented Jan 31, 2024

DaanHoogland commented Jan 31, 2024

GutoVeronezi commented Feb 8, 2024

blueorangutan commented Feb 8, 2024

blueorangutan commented Feb 8, 2024

slavkap left a comment

Choose a reason for hiding this comment

weizhouapache commented Feb 9, 2024

GutoVeronezi commented Jan 9, 2024 •

edited

codecov bot commented Jan 9, 2024 •

edited

GutoVeronezi commented Jan 10, 2024 •

edited