Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable over provisioning for SharedMountPoint primary storages #8481

Merged
merged 3 commits into from
Feb 9, 2024

Conversation

GutoVeronezi
Copy link
Contributor

@GutoVeronezi GutoVeronezi commented Jan 9, 2024

Description

ACS does not allow operators to set the configuration storage.overprovisioning.factor for SharedMountPoint primary storages, presenting the following message when one tries to change it:

image

Also, the global setting does not impact SharedMoundPoint primary storages, which means that the over-provisioning is not applied properly for this storage type.

Furthermore, the metric presents that the SharedMountPoint has the global over-provisioning applied; however, in a scenario where the over-provision factor is 2, the allocation threshold is 80%, and the metrics show that storage has 40% allocated, one cannot allocate to it anymore.

This PR intends to enable over-provisioning for SharedMountPoint primary storages.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

In an environment where the global over-provision factor was 1, the allocation threshold was 80%, and the SharedMountPoint primary storage had 80% allocated, I was able to change the over-provisioning factor of the SharedMountPoint primary storage and allocate a new volume on it, as, with the over-provisioning, ACS considered it as having 40% allocated.

@GutoVeronezi
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@GutoVeronezi a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link

codecov bot commented Jan 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (76aff0f) 13.12% compared to head (a7179f0) 29.40%.
Report is 8 commits behind head on 4.18.

Additional details and impacted files
@@              Coverage Diff              @@
##               4.18    #8481       +/-   ##
=============================================
+ Coverage     13.12%   29.40%   +16.27%     
- Complexity     9142    35417    +26275     
=============================================
  Files          2720     5348     +2628     
  Lines        257744   414514   +156770     
  Branches      40182    70287    +30105     
=============================================
+ Hits          33839   121883    +88044     
- Misses       219615   276663    +57048     
- Partials       4290    15968    +11678     
Flag Coverage Δ
simulator-marvin-tests 24.64% <100.00%> (?)
uitests 4.39% <ø> (?)
unit-tests 16.48% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8257

Copy link
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code changes lgtm, needs testing with SharedMountPoint pools.

@DaanHoogland
Copy link
Contributor

code looks good. I have a worry though. overprovisioning on storage in general is a dangerous thing. On cpu and memory any errors due to overprovisioning are likely to not be fatal, but on storage these can easily lead to data loss. when storage is allocated but not used we can apply overprovisioning but when parts of it get used these should not be considered in overprovisioning. I don't think our model supports such behaviour. any ideas @GutoVeronezi @sureshanaparti (sorry, i don't know what wisdom is in respect to this)

@JoaoJandre
Copy link
Contributor

@GutoVeronezi could you target 4.18.2?

Copy link
Contributor

@JoaoJandre JoaoJandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM, didn't test it

@weizhouapache
Copy link
Member

code looks good. I have a worry though. overprovisioning on storage in general is a dangerous thing. On cpu and memory any errors due to overprovisioning are likely to not be fatal, but on storage these can easily lead to data loss. when storage is allocated but not used we can apply overprovisioning but when parts of it get used these should not be considered in overprovisioning. I don't think our model supports such behaviour. any ideas @GutoVeronezi @sureshanaparti (sorry, i don't know what wisdom is in respect to this)

@DaanHoogland
I believe most cloudstack users use thin provisioning and storage overprovisioning. It was never a critical issue, there was no data loss either. If the storage is almost occupied, users should consider adding more storage before reaching full disk space.

@GutoVeronezi GutoVeronezi changed the base branch from main to 4.18 January 10, 2024 19:18
@JoaoJandre JoaoJandre added this to the 4.18.2.0 milestone Jan 10, 2024
@GutoVeronezi
Copy link
Contributor Author

GutoVeronezi commented Jan 10, 2024

@GutoVeronezi could you target 4.18.2?

@JoaoJandre, done.

code looks good. I have a worry though. overprovisioning on storage in general is a dangerous thing. On cpu and memory any errors due to overprovisioning are likely to not be fatal, but on storage these can easily lead to data loss. when storage is allocated but not used we can apply overprovisioning but when parts of it get used these should not be considered in overprovisioning. I don't think our model supports such behaviour. any ideas @GutoVeronezi @sureshanaparti (sorry, i don't know what wisdom is in respect to this)

@DaanHoogland
Sure, dealing with over-provisioning without caution can be harmful. Before changing the overprovisioning configurations, operators should understand how the environment is being consumed. Furthermore, it is important to have active monitoring over the environment; this way, it is possible to take actions, like the one @weizhouapache mentioned in #8481 (comment), to avoid problems.

The configuration cluster.storage.capacity.notificationthreshold makes ACS notify the operators when the storage reaches the capacity threshold (for memory and CPU we have similar configurations). I am not sure if we have much more to do from the ACS perspective, but we can think about ways to improve this feature.

@GutoVeronezi
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@GutoVeronezi a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@weizhouapache
Copy link
Member

@GutoVeronezi could you target 4.18.2?

@JoaoJandre, done.

code looks good. I have a worry though. overprovisioning on storage in general is a dangerous thing. On cpu and memory any errors due to overprovisioning are likely to not be fatal, but on storage these can easily lead to data loss. when storage is allocated but not used we can apply overprovisioning but when parts of it get used these should not be considered in overprovisioning. I don't think our model supports such behaviour. any ideas @GutoVeronezi @sureshanaparti (sorry, i don't know what wisdom is in respect to this)

@DaanHoogland
Sure, dealing with over-provisioning without caution can be harmful. Before changing the overprovisioning configurations, operators should understand how the environment is being consumed. Furthermore, it is important to have active monitoring over the environment; this way, it is possible to take actions, like the one @weizhouapache mentioned in #8481 (comment), to avoid problems.

The configuration cluster.storage.capacity.notificationthreshold makes ACS notify the operators when the storage reaches the capacity threshold (for memory and CPU we have similar configurations). I am not sure if we have much more to do from the ACS perspective, but we can think about ways to improve this feature.

Yes @GutoVeronezi, there are global settings to (1) notify users, or (2) disable new volume allocation, if allocated capacity or actual used capacity reaches a threshold.

The thresholds are around 75% or 85%, so users have enough time to get more storage or clean the storage.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8280

@DaanHoogland
Copy link
Contributor

@GutoVeronezi could you target 4.18.2?

@JoaoJandre, done.

code looks good. I have a worry though. overprovisioning on storage in general is a dangerous thing. On cpu and memory any errors due to overprovisioning are likely to not be fatal, but on storage these can easily lead to data loss. when storage is allocated but not used we can apply overprovisioning but when parts of it get used these should not be considered in overprovisioning. I don't think our model supports such behaviour. any ideas @GutoVeronezi @sureshanaparti (sorry, i don't know what wisdom is in respect to this)

@DaanHoogland
Sure, dealing with over-provisioning without caution can be harmful. Before changing the overprovisioning configurations, operators should understand how the environment is being consumed. Furthermore, it is important to have active monitoring over the environment; this way, it is possible to take actions, like the one @weizhouapache mentioned in #8481 (comment), to avoid problems.
The configuration cluster.storage.capacity.notificationthreshold makes ACS notify the operators when the storage reaches the capacity threshold (for memory and CPU we have similar configurations). I am not sure if we have much more to do from the ACS perspective, but we can think about ways to improve this feature.

Yes @GutoVeronezi, there are global settings to (1) notify users, or (2) disable new volume allocation, if allocated capacity or actual used capacity reaches a threshold.

The thresholds are around 75% or 85%, so users have enough time to get more storage or clean the storage.

ok, I thought only overprovisioned threshold would be signalled. I don't know of an extra guard on the actual threshold , but then again we would expect operators to do their due diligence. Sorry for the over-paranoia.

@weizhouapache
Copy link
Member

Yes @GutoVeronezi, there are global settings to (1) notify users, or (2) disable new volume allocation, if allocated capacity or actual used capacity reaches a threshold.
The thresholds are around 75% or 85%, so users have enough time to get more storage or clean the storage.

ok, I thought only overprovisioned threshold would be signalled. I don't know of an extra guard on the actual threshold , but then again we would expect operators to do their due diligence. Sorry for the over-paranoia.

no worries @DaanHoogland
I think this is a frequently asked question for many users.

mysql> select name,default_value,description from configuration where name like "cluster.storage%" or name like "pool.storage%";
+----------------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| name                                                     | default_value | description                                                                                                                                                      |
+----------------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| cluster.storage.allocated.capacity.notificationthreshold | 0.75          | Percentage (as a value between 0 and 1) of allocated storage utilization above which alerts will be sent about low storage available.                            |
| cluster.storage.capacity.notificationthreshold           | 0.75          | Percentage (as a value between 0 and 1) of storage utilization above which alerts will be sent about low storage available.                                      |
| cluster.storage.operations.exclude                       | false         | Exclude cluster from storage operations                                                                                                                          |
| pool.storage.allocated.capacity.disablethreshold         | 0.85          | Percentage (as a value between 0 and 1) of allocated storage utilization above which allocators will disable using the pool for low allocated storage available. |
| pool.storage.capacity.disablethreshold                   | 0.85          | Percentage (as a value between 0 and 1) of storage utilization above which allocators will disable using the pool for low storage available.                     |
+----------------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)

@GutoVeronezi
Copy link
Contributor Author

Aside from the third-party testing, do we still need anything else on this one?

cc: @DaanHoogland @weizhouapache

@DaanHoogland
Copy link
Contributor

I don´t know, I didn't test it @GutoVeronezi

@weizhouapache
Copy link
Member

Aside from the third-party testing, do we still need anything else on this one?

cc: @DaanHoogland @weizhouapache

@GutoVeronezi
code looks ok to me.

@slavkap
can you please review and test it ?

@DaanHoogland
Copy link
Contributor

I don´t know, I didn't test it @GutoVeronezi

sorry, I didn't read very well and answered hastely. Yes @GutoVeronezi, clgtm

@GutoVeronezi
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@GutoVeronezi a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8591

Copy link
Contributor

@slavkap slavkap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weizhouapache, sorry for the late response!
Unfortunately, I don't have an environment with shared mount point but the code LGTM

@weizhouapache
Copy link
Member

@weizhouapache, sorry for the late response! Unfortunately, I don't have an environment with shared mount point but the code LGTM

thanks @slavkap

@weizhouapache weizhouapache merged commit 2729ee1 into apache:4.18 Feb 9, 2024
73 checks passed
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Feb 20, 2024
…e#8481)

* Enable over provisioning for SharedMountPoint primary storages

* Fix unit tests

* Fix typos and small adjusts

---------

Co-authored-by: Daniel Augusto Veronezi Salvador <gutoveronezi@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants