New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to limit the maximum number of CPU cores used for queries #291
Comments
Heavy queries could result in the lack of CPU resources for processing the current data ingestion stream. Prevent this by delaying queries' execution until free resources are available for data ingestion. Expose `vm_search_delays_total` metric, which may be used in for alerting when there is no enough CPU resources for data ingestion and/or for executing heavy queries. Updates #291
Heavy queries could result in the lack of CPU resources for processing the current data ingestion stream. Prevent this by delaying queries' execution until free resources are available for data ingestion. Expose `vm_search_delays_total` metric, which may be used in for alerting when there is no enough CPU resources for data ingestion and/or for executing heavy queries. Updates #291
FYI, VictoriaMetrics implements a mechanism for prioritizing data ingestion over querying starting from v1.38.0. |
FYI, the release v1.39.0 should improve further the prioritization of data ingestion over heavy queries. |
…lable - Maintain a separate worker pool per each part type (in-memory, file, big and small). Previously a shared pool was used for merging all the part types. A single merge worker could merge parts with mixed types at once. For example, it could merge simultaneously an in-memory part plus a big file part. Such a merge could take hours for big file part. During the duration of this merge the in-memory part was pinned in memory and couldn't be persisted to disk under the configured -inmemoryDataFlushInterval . Another common issue, which could happen when parts with mixed types are merged, is uncontrolled growth of in-memory parts or small parts when all the merge workers were busy with merging big files. Such growth could lead to significant performance degradataion for queries, since every query needs to check ever growing list of parts. This could also slow down the registration of new time series, since VictoriaMetrics searches for the internal series_id in the indexdb for every new time series. The third issue is graceful shutdown duration, which could be very long when a background merge is running on in-memory parts plus big file parts. This merge couldn't be interrupted, since it merges in-memory parts. A separate pool of merge workers per every part type elegantly resolves both issues: - In-memory parts are merged to file-based parts in a timely manner, since the maximum size of in-memory parts is limited. - Long-running merges for big parts do not block merges for in-memory parts and small parts. - Graceful shutdown duration is now limited by the time needed for flushing in-memory parts to files. Merging for file parts is instantly canceled on graceful shutdown now. - Deprecate -smallMergeConcurrency command-line flag, since the new background merge algorithm should automatically self-tune according to the number of available CPU cores. - Deprecate -finalMergeDelay command-line flag, since it wasn't working correctly. It is better to run forced merge when needed - https://docs.victoriametrics.com/#forced-merge - Tune the number of shards for pending rows and items before the data goes to in-memory parts and becomes visible for search. This improves the maximum data ingestion rate and the maximum rate for registration of new time series. This should reduce the duration of data ingestion slowdown in VictoriaMetrics cluster on e.g. re-routing events, when some of vmstorage nodes become temporarily unavailable. - Prevent from possible "sync: WaitGroup misuse" panic on graceful shutdown. This is a follow-up for fa566c6 . Thanks @misutoth to for the inspiration at #5212 Updates #5190 Updates #3790 Updates #3551 Updates #3337 Updates #3425 Updates #3647 Updates #3641 Updates #648 Updates #291
…lable - Maintain a separate worker pool per each part type (in-memory, file, big and small). Previously a shared pool was used for merging all the part types. A single merge worker could merge parts with mixed types at once. For example, it could merge simultaneously an in-memory part plus a big file part. Such a merge could take hours for big file part. During the duration of this merge the in-memory part was pinned in memory and couldn't be persisted to disk under the configured -inmemoryDataFlushInterval . Another common issue, which could happen when parts with mixed types are merged, is uncontrolled growth of in-memory parts or small parts when all the merge workers were busy with merging big files. Such growth could lead to significant performance degradataion for queries, since every query needs to check ever growing list of parts. This could also slow down the registration of new time series, since VictoriaMetrics searches for the internal series_id in the indexdb for every new time series. The third issue is graceful shutdown duration, which could be very long when a background merge is running on in-memory parts plus big file parts. This merge couldn't be interrupted, since it merges in-memory parts. A separate pool of merge workers per every part type elegantly resolves both issues: - In-memory parts are merged to file-based parts in a timely manner, since the maximum size of in-memory parts is limited. - Long-running merges for big parts do not block merges for in-memory parts and small parts. - Graceful shutdown duration is now limited by the time needed for flushing in-memory parts to files. Merging for file parts is instantly canceled on graceful shutdown now. - Deprecate -smallMergeConcurrency command-line flag, since the new background merge algorithm should automatically self-tune according to the number of available CPU cores. - Deprecate -finalMergeDelay command-line flag, since it wasn't working correctly. It is better to run forced merge when needed - https://docs.victoriametrics.com/#forced-merge - Tune the number of shards for pending rows and items before the data goes to in-memory parts and becomes visible for search. This improves the maximum data ingestion rate and the maximum rate for registration of new time series. This should reduce the duration of data ingestion slowdown in VictoriaMetrics cluster on e.g. re-routing events, when some of vmstorage nodes become temporarily unavailable. - Prevent from possible "sync: WaitGroup misuse" panic on graceful shutdown. This is a follow-up for fa566c6 . Thanks @misutoth to for the inspiration at #5212 Updates #5190 Updates #3790 Updates #3551 Updates #3337 Updates #3425 Updates #3647 Updates #3641 Updates #648 Updates #291
FYI, VictoriaMetrics supports This allows limiting the number of CPU cores, which can be used by a single query. VictoriaMetrics also provides an ability to configure the maximum number of concurrent queries via Closing the feature request as done. |
Is your feature request related to a problem? Please describe.
A single heavy query in VictoriaMetrics can occupy all the available CPU cores. This is good for returning query results ASAP. But this can negatively affect data ingestion pipeline, which may starve for CPU resources while heavy queries are performed. This is seen in production.
Describe the solution you'd like
It would be great adding
-search.maxCPUs
command-line flag for limiting the maximum number of CPUs that can be used by query pipeline. Then we can leave the guaranteed number of CPU cores for data ingestion pipeline. For instance, if a system has 64 CPU cores, then-search.maxCPUs=60
would guarantee that at least 4 CPU cores will be always available for data ingestion path.Describe alternatives you've considered
An alternative is to reduce
-search.maxConcurrentRequests
. This reduces the number of concurrently running queries, leaving more chances for ingestion path to get the required CPU resources. Note that-search.maxConcurrentRequests
doesn't limit the number of CPU cores that can be used by queries, since a single heavy query can take all the available CPU cores as described above.The text was updated successfully, but these errors were encountered: