Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
dcache-qos: rework verifier operation handling so most of it is in me…
…mory (as in resilience) Motivation: In adapting the original resilience code to the qos services, we decided to move away from completely in-memory processing (with lossy to-file checkpointing of the operation map) to using an underlying database to hold the operations. This was done in such a way that (a) operations were held in memory up to a cache capacity limit, and were replenished from the database store, and, as a consequence, (b) the state of each operation was updated (written through) to the database during its lifecycle. In testing the soon-to-be posted rule engine extension, I noticed that this design was not scaling efficiently. Part of the problem was a mistake in the service's message receiver, causing message backlog (this bug will be fixed downstream), but the second part of the problem was due to the database pressure. What is more, the constant updating of the underlying postgresql tables causes heavy fragmentation/holes and requires autovacuum to run more frequently. Modification: This patch moves us back in the direction of what resilience used to do. All operations are held in memory for their entire lifecyle. (16 GiB was usually sufficient for Resilience and this should be no different; JFR profiling reveals a stable memory footprint so far not exceeding 1 GiB, in fact). The RDMBS store is still used for recovery, but it now holds a reduced operation descriptor with many state-related fields eliminated. The operation is written once to the database, and persists until the entire verification sequence is completed, at which point it is eliminated (via a batch delete). Moreover, only `cache location` or `qos modified` message types are stored, since they originate outside of the qos components (scans, on the other hand, need not store their operations as they are repeating and based on queries). The overhaul largely involves code elimination, but this refactoring also reworks the central components: what was called the `operation map` and is now called the `operation manager`, and the internal queues. The new queueing schema replaces the single-threaded clock algorithm with independent thread pools for each queue type. The queue types are also now configurable through the spring.xml. Extra properties have been added to control thread pool sizes, etc., for this new setup. Result: Under rule-engine extension testing, the re-implementation performs much better, particularly in terms of the turnover of VOIDed operations, which on system or pool scans normally constitute the majority. It is recommended this change accompany the introduction of the rule engine extensions (to follow). Target: master Patch: https://rb.dcache.org/r/14063/ Requires-notes: yes (eventually, for 9.2) Acked-by: Tigran
- Loading branch information