From 403d231b3b492a216eb64620ff6f412d6d0147d5 Mon Sep 17 00:00:00 2001 From: Salman Rana Date: Mon, 22 Feb 2021 16:15:20 -0500 Subject: [PATCH] Adaptive GC Threading MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit **_For background and results see https://github.com/eclipse/omr/issues/5829_** > Using overhead data (busy/stall times for managing and synchronizing threads), Adaptive Threading aims to identify sub-optimal/detrimental parallelism and continuously adjusts the GC thread count to reach optimal parallelism. _Adaptive Threading changes are implemented at the various phases of GC as follows, during:_ 1. **Pre-collection _(during to task/thread dispatch)_:** to adjust thread count based on the previous cycle’s recommendation 2. **Collection:** to gather data for parallelization overhead for the running cycle 3. **Post-collection _(after worker threads are suspended)_:** to project/calculate optimal thread count and give recommendation for the next cycle based on the current GC that’s completed Adaptive threading will be enabled by default. The user may choose to enable/disable adaptive threading through the `-XX:[+-]AdaptiveThreading` options. However Adaptive Threading is ignored, even if it is enabled, when GC thread count is forced (e.g user specifics Xgcthreads). The user can also specify upper thread limit for adaptive threading using `-Xgcmaxthreads` option. _The specifics of the implementation are as follows:_ - Thread count associated with a parallel task instance, indicates the projected optimal number of threads to complete the task. - Drives Adaptive Threading - changes with each dispatch of the task, based on observations made when previously completing the same task - Currently, adaptive threading is only used with _Scavenger_, hence this is only applies to tasks of type `MM_ParallelScavengeTask` so a recommended thread count is provided with a new `ParallelScavengeTask` instance (i.e when scavenge is run) - `getRecommendedWorkingThreads` introduced to `MM_Task` base class, base implementation returns `UDATA_MAX` (signifies no adaptive threading), overridden by `ParallelScavengeTask` to return adaptive thread count. during collection): - ` _workerScavengeStartTime` and `_workerScavengeEndTime` - Thread local task (scavenge) start/end timestamps taken when worker thread starts/competes a task (scavenge). This is used to determine the time it takes a worker to start collection task from the time a cycle starts and how long a worker waits for others when it is completed its task. - `_adjustedSyncStallTime` - Similar to the existing `_SyncStallTime` stat **except** it accounts for Critical Section duration. That is, it subtracts critical section duration from stall time of synced thread. This is because the stall time being added from critical section is independent of number of the threads being synchronized. This independent stall time can not be adjusted for, hence it must be ignored. This is relevant for `syncAndRealeaseSingle/Master` APIs as they are the only APIs that record a stall time with critical sections, without these, `_SyncStallTime` == `_adjustedSyncStallTime` - existing `addToSyncStallTime` method extended to update `_adjustedSyncStallTime` in addition to `_SyncStallTime`. `addToSyncStallTime` now takes `criticalSectionDuration` (defaults to 0) as an input, a value for it passed when a critical section is executed prior to updating stall stats . - `_notifyStallTime ` - Used to record the stall times resulting from notifying other waiting threads. Note, this stat is not inclusive, it only records notify times relevant to adaptive threading. - We are more concerned about recording 'notify_all' rather than 'notify_one' as 'notify_all' is dependent on the number of threads being notified. - Introduced `calculateRecommendedWorkingThreads` routine - Called at the end of each successful scavenge to project optimal threads for next scavenge - Implements Adaptive Model _(see background for more info)_ - Calculates averages of stall stats and uses them as inputs to adaptive model, projected optimal thread count output is stored to new member `_recommendedThreads`. - Timestamps taken for worker thread scavenge start/end time - `notifyStallTime` recorded for `notify_all` on `_scanCacheMonitor` _(some instances are ignored as they are not relevant to adaptive threading such as backout cases)_ - `MergeThreadGCStats` updated to merge new scavenger stats (discussed above) - trace point added here to breakdown thread local stats for adaptive threading for Adaptive Threading - Dispatcher now queries the task for the recommended thread count when determining the number of threads to release from thread pool to complete the task - Ensures thread count is bounded properly to respect max thread count, either user provided adaptiveThreadCount or default max. - `_syncCriticalSectionStartTime` & `_syncCriticalSectionDuration` introduced to record Critical Section times for adjusted stall time - `_syncCriticalSectionStartTime` recorded when all threads are synced and the released thread is about to exit sync api to execute critical section - `_syncCriticalSectionDuration` is recored when thread executing critical section is about to release the synced thread (indicating critical section is complete). As synced threads are released, they update their stall time with the newly set critical section duration. - `notifyStallTime` recorded for `notify_all` on synced threads Signed-off-by: Salman Rana --- gc/base/GCExtensionsBase.hpp | 21 ++++ gc/base/ParallelDispatcher.cpp | 18 +++ gc/base/ParallelTask.cpp | 12 ++ gc/base/ParallelTask.hpp | 6 + gc/base/Task.hpp | 2 + gc/base/j9mm.tdf | 7 ++ gc/base/standard/ConcurrentScavengeTask.hpp | 2 +- gc/base/standard/ParallelScavengeTask.cpp | 26 ++++- gc/base/standard/ParallelScavengeTask.hpp | 8 +- gc/base/standard/Scavenger.cpp | 119 +++++++++++++++++++- gc/base/standard/Scavenger.hpp | 14 ++- gc/stats/ScavengerStats.cpp | 9 ++ gc/stats/ScavengerStats.hpp | 21 +++- 13 files changed, 253 insertions(+), 12 deletions(-) diff --git a/gc/base/GCExtensionsBase.hpp b/gc/base/GCExtensionsBase.hpp index f00c3ba7540..1ce13505ea6 100644 --- a/gc/base/GCExtensionsBase.hpp +++ b/gc/base/GCExtensionsBase.hpp @@ -517,6 +517,13 @@ class MM_GCExtensionsBase : public MM_BaseVirtual { bool enableSplitHeap; /**< true if we are using gencon with -Xgc:splitheap (we will fail to boostrap if we can't allocate both ranges) */ double aliasInhibitingThresholdPercentage; /**< percentage of threads that can be blocked before copy cache aliasing is inhibited (set through aliasInhibitingThresholdPercentage=) */ + /* Start of variables relating to Adaptive Threading */ + bool adaptiveGCThreading; /**< Flag to indicate whether the Scavenger Adaptive Threading Optimization is enabled*/ + float adaptiveThreadingSensitivityFactor; /**< Used by Adaptive Model to determine sensitivity/tolerance to stalling, higher number translates to less stall being tolerated (set through adaptiveThreadingSensitivityFactor=) */ + float adaptiveThreadingWeightActiveThreads; /**< Weight given to current active threads when averaging projected threads with current active threads (set through adaptiveThreadingWeightActiveThreads=) */ + float adaptiveThreadBooster; /**< Used to boost calculated thread count, gives opportunity for low thread count to grow. */ + /* End of variables relating to Adaptive Threading */ + enum HeapInitializationSplitHeapSection { HEAP_INITIALIZATION_SPLIT_HEAP_UNKNOWN = 0, HEAP_INITIALIZATION_SPLIT_HEAP_TENURE, @@ -1093,6 +1100,16 @@ class MM_GCExtensionsBase : public MM_BaseVirtual { MMINLINE void setConcurrentGlobalGCInProgress(bool inProgress) { _concurrentGlobalGCInProgress = inProgress; } #endif /* OMR_GC_MODRON_SCAVENGER */ + /** + * Determine whether Adaptive Threading is enabled. AdaptiveGCThreading flag + * is not sufficient; Adaptive threading must be ignored if GC thread count is forced. + * @return TRUE if adaptive threading routines can proceed, FALSE otherwise + */ + MMINLINE bool adaptiveThreadigEnabled() + { + return (adaptiveGCThreading && !gcThreadCountForced); + } + /** * Returns TRUE if an object is old, FALSE otherwise. * @param objectPtr Pointer to an object @@ -1595,6 +1612,10 @@ class MM_GCExtensionsBase : public MM_BaseVirtual { , dnssMinimumContraction(0.0) , enableSplitHeap(false) , aliasInhibitingThresholdPercentage(0.20) + , adaptiveGCThreading(true) + , adaptiveThreadingSensitivityFactor(1.0f) + , adaptiveThreadingWeightActiveThreads(0.50f) + , adaptiveThreadBooster(0.85f) , splitHeapSection(HEAP_INITIALIZATION_SPLIT_HEAP_UNKNOWN) #endif /* OMR_GC_MODRON_SCAVENGER */ , globalMaximumContraction(0.05) /* by default, contract must be at most 5% of the committed heap */ diff --git a/gc/base/ParallelDispatcher.cpp b/gc/base/ParallelDispatcher.cpp index 65a08a757e1..1136477f2f3 100644 --- a/gc/base/ParallelDispatcher.cpp +++ b/gc/base/ParallelDispatcher.cpp @@ -457,6 +457,24 @@ MM_ParallelDispatcher::recomputeActiveThreadCountForTask(MM_EnvironmentBase *env * available and ready to run). */ uintptr_t taskActiveThreadCount = OMR_MIN(_activeThreadCount, threadCount); + + /* Account for Adaptive Threading. RecommendedWorkingThreads will not be set (will return UDATA_MAX) if: + * + * 1) User forced a thread count (e.g Xgcthreads) + * 2) Adaptive threading flag is not set (-XX:-AdaptiveGCThreading) + * 3) or simply the task wasn't recommended a thread count (currently only recommended for STW Scavenge Tasks) + */ + if (task->getRecommendedWorkingThreads() != UDATA_MAX) { + /* Bound the recommended thread count. Determine the upper bound for the thread count, + * This will either be the user specified gcMaxThreadCount (-XgcmaxthreadsN) or else default max + */ + taskActiveThreadCount = OMR_MIN(_threadCount, task->getRecommendedWorkingThreads()); + + _activeThreadCount = taskActiveThreadCount; + + Trc_MM_ParallelDispatcher_recomputeActiveThreadCountForTask_useCollectorRecommendedThreads(task->getRecommendedWorkingThreads(), taskActiveThreadCount); + } + task->setThreadCount(taskActiveThreadCount); return taskActiveThreadCount; } diff --git a/gc/base/ParallelTask.cpp b/gc/base/ParallelTask.cpp index 469453768fc..1ee46f23714 100644 --- a/gc/base/ParallelTask.cpp +++ b/gc/base/ParallelTask.cpp @@ -214,6 +214,14 @@ MM_ParallelTask::synchronizeGCThreadsAndReleaseSingleThread(MM_EnvironmentBase * void MM_ParallelTask::releaseSynchronizedGCThreads(MM_EnvironmentBase *env) { + OMRPORT_ACCESS_FROM_OMRPORT(env->getPortLibrary()); + + if (_syncCriticalSectionStartTime != 0) { + /* Critical section complete, synced threads are about to be released. Record the duration. */ + _syncCriticalSectionDuration = (omrtime_hires_clock() - _syncCriticalSectionStartTime); + _syncCriticalSectionStartTime = 0; + } + if (1 == _totalThreadCount) { _synchronized = false; return; @@ -225,7 +233,11 @@ MM_ParallelTask::releaseSynchronizedGCThreads(MM_EnvironmentBase *env) omrthread_monitor_enter(_synchronizeMutex); _synchronizeCount = 0; _synchronizeIndex += 1; + uint64_t notifyStartTime = omrtime_hires_clock(); omrthread_monitor_notify_all(_synchronizeMutex); + + addToNotifyStallTime(env, notifyStartTime, omrtime_hires_clock()); + omrthread_monitor_exit(_synchronizeMutex); } diff --git a/gc/base/ParallelTask.hpp b/gc/base/ParallelTask.hpp index bf382fc9f07..9d83dfbf315 100644 --- a/gc/base/ParallelTask.hpp +++ b/gc/base/ParallelTask.hpp @@ -48,6 +48,9 @@ class MM_ParallelTask : public MM_Task */ private: protected: + uint64_t _syncCriticalSectionStartTime; /**< Timestamp taken when a critical section of the task starts execution. */ + uint64_t _syncCriticalSectionDuration; /**< The time, in hi-res ticks, it took to execute the lastest critical section. */ + bool _synchronized; const char *_syncPointUniqueId; uintptr_t _syncPointWorkUnitIndex; /**< The _workUnitIndex of the first thread to sync. All threads should have the same index once sync'ed. */ @@ -84,6 +87,7 @@ class MM_ParallelTask : public MM_Task _totalThreadCount = threadCount; } MMINLINE virtual uintptr_t getThreadCount() { return _totalThreadCount; } + MMINLINE virtual void addToNotifyStallTime(MM_EnvironmentBase *env, uint64_t startTime, uint64_t endTime) {} virtual bool isSynchronized(); @@ -92,6 +96,8 @@ class MM_ParallelTask : public MM_Task */ MM_ParallelTask(MM_EnvironmentBase *env, MM_ParallelDispatcher *dispatcher) : MM_Task(env, dispatcher) + ,_syncCriticalSectionStartTime(0) + ,_syncCriticalSectionDuration(0) ,_synchronized(false) ,_syncPointUniqueId(NULL) ,_syncPointWorkUnitIndex(0) diff --git a/gc/base/Task.hpp b/gc/base/Task.hpp index ac58239796e..cd084e8876a 100644 --- a/gc/base/Task.hpp +++ b/gc/base/Task.hpp @@ -65,6 +65,8 @@ class MM_Task : public MM_BaseVirtual virtual void run(MM_EnvironmentBase *env) = 0; virtual void cleanup(MM_EnvironmentBase *env); + virtual uintptr_t getRecommendedWorkingThreads() { return UDATA_MAX; } + /** * Single call setup routine for tasks invoked by the main thread before the task is dispatched. */ diff --git a/gc/base/j9mm.tdf b/gc/base/j9mm.tdf index af86c07c100..0d50266ab17 100644 --- a/gc/base/j9mm.tdf +++ b/gc/base/j9mm.tdf @@ -928,3 +928,10 @@ TraceEvent=Trc_MM_Scavenger_percolate_delegate Overhead=1 Level=1 Group=percolat TraceEntry=Trc_MM_MemorySubSpaceUniSpace_getHeapFreeMaximumHeuristicMultiplier Overhead=1 Level=1 Group=resize Template="Trc_MM_MemorySubSpaceUniSpace_getHeapFreeMaximumHeuristicMultiplier Maximum free multiplier = %zu" TraceEntry=Trc_MM_MemorySubSpaceUniSpace_getHeapFreeMinimumHeuristicMultiplier Overhead=1 Level=1 Group=resize Template="Trc_MM_MemorySubSpaceUniSpace_getHeapFreeMinimumHeuristicMultiplier Minimum free multiplier = %zu" + +TraceEntry=Trc_MM_Scavenger_calculateRecommendedWorkingThreads_entry Overhead=1 Level=1 Group=adaptivethread Template="[%u] MM_Scavenger_calculateRecommendedWorkingThreads entry" +TraceExit=Trc_MM_Scavenger_calculateRecommendedWorkingThreads_exitOverflow Overhead=1 Level=1 Group=adaptivethread Template="Scavenge ignored for recommending threads (Irregular stalling with SyncAndReleaseMaster)" +TraceExit=Trc_MM_Scavenger_calculateRecommendedWorkingThreads_setRecommendedThreads Overhead=1 Level=1 Group=adaptivethread Template="ScavengeTime: %5llu Avg.StallTime: %5llu (%.2f%%) Threads [Current: %2u Ideal: %.2f Weighted: %.2f Boosted: %.2f Recommend: %2u]" +TraceEvent=Trc_MM_Scavenger_calculateRecommendedWorkingThreads_averageStallBreakDown Overhead=1 Level=1 Group=adaptivethread Template="Average for: %u threads -> avgTimeToStartCollection: %5llu avgTimeIdleAfterCollection: %5llu avgScanStallTime: %5llu avgSyncStallTime: %5llu avgNotifyStallTime: %5llu" +TraceEvent=Trc_MM_Scavenger_calculateRecommendedWorkingThreads_threadStallBreakDown Overhead=1 Level=1 Group=adaptivethread Template="Thread: %2u -> timeToStartCollection: %5llu scanStall: %5llu syncStall: %5llu notifyStall: %5llu" +TraceEvent=Trc_MM_ParallelDispatcher_recomputeActiveThreadCountForTask_useCollectorRecommendedThreads noEnv Overhead=1 Level=1 Group=adaptivethread Template="Attempting to use Task Recommended Threads: %u -> Adjusting to Bounds: %u" diff --git a/gc/base/standard/ConcurrentScavengeTask.hpp b/gc/base/standard/ConcurrentScavengeTask.hpp index 09b5bc606f5..bd1b7f5a37b 100644 --- a/gc/base/standard/ConcurrentScavengeTask.hpp +++ b/gc/base/standard/ConcurrentScavengeTask.hpp @@ -69,7 +69,7 @@ class MM_ConcurrentScavengeTask : public MM_ParallelScavengeTask MM_Scavenger *scavenger, ConcurrentAction action, MM_CycleState *cycleState) : - MM_ParallelScavengeTask(env, dispatcher, scavenger, cycleState) + MM_ParallelScavengeTask(env, dispatcher, scavenger, cycleState, UDATA_MAX) , _bytesScanned(0) , _action(action) { diff --git a/gc/base/standard/ParallelScavengeTask.cpp b/gc/base/standard/ParallelScavengeTask.cpp index 366cb96737b..05d28048044 100644 --- a/gc/base/standard/ParallelScavengeTask.cpp +++ b/gc/base/standard/ParallelScavengeTask.cpp @@ -88,7 +88,15 @@ MM_ParallelScavengeTask::synchronizeGCThreadsAndReleaseMain(MM_EnvironmentBase * bool result = MM_ParallelTask::synchronizeGCThreadsAndReleaseMain(env, id); uint64_t endTime = omrtime_hires_clock(); - env->_scavengerStats.addToSyncStallTime(startTime, endTime); + if (result) { + /* Released thread will exit and execute critical section, recored the start time */ + _syncCriticalSectionStartTime = endTime; + _syncCriticalSectionDuration = 0; + } + + /* _syncCriticalSectionDuration must be set now, this thread's stall time is at least the duration of critical section */ + Assert_MM_true((endTime - startTime) >= _syncCriticalSectionDuration); + env->_scavengerStats.addToSyncStallTime(startTime, endTime, _syncCriticalSectionDuration); return result; } @@ -102,11 +110,25 @@ MM_ParallelScavengeTask::synchronizeGCThreadsAndReleaseSingleThread(MM_Environme bool result = MM_ParallelTask::synchronizeGCThreadsAndReleaseSingleThread(env, id); uint64_t endTime = omrtime_hires_clock(); - env->_scavengerStats.addToSyncStallTime(startTime, endTime); + if (result) { + /* Released thread will exit and execute critical section, recored the start time */ + _syncCriticalSectionStartTime = endTime; + _syncCriticalSectionDuration = 0; + } + + /* _syncCriticalSectionDuration must be set now, this thread's stall time is at least the duration of critical section */ + Assert_MM_true((endTime - startTime) >= _syncCriticalSectionDuration); + env->_scavengerStats.addToSyncStallTime(startTime, endTime, _syncCriticalSectionDuration); return result; } +void +MM_ParallelScavengeTask::addToNotifyStallTime(MM_EnvironmentBase *env, uint64_t startTime, uint64_t endTime) +{ + env->_scavengerStats.addToNotifyStallTime(startTime, endTime); +} + #endif /* J9MODRON_TGC_PARALLEL_STATISTICS */ #endif /* defined(OMR_GC_MODRON_SCAVENGER) */ diff --git a/gc/base/standard/ParallelScavengeTask.hpp b/gc/base/standard/ParallelScavengeTask.hpp index 431c9bc8f26..1d6ba7f3c56 100644 --- a/gc/base/standard/ParallelScavengeTask.hpp +++ b/gc/base/standard/ParallelScavengeTask.hpp @@ -50,6 +50,7 @@ class MM_ParallelScavengeTask : public MM_ParallelTask protected: MM_Scavenger *_collector; MM_CycleState *_cycleState; /**< Collection cycle state active for the task */ + uintptr_t _recommendedThreads; /**< Collector recommended threads for the task */ public: virtual UDATA getVMStateID() { return OMRVMSTATE_GC_SCAVENGE; }; @@ -77,15 +78,20 @@ class MM_ParallelScavengeTask : public MM_ParallelTask * @see MM_ParallelTask::synchronizeGCThreadsAndReleaseSingleThread */ virtual bool synchronizeGCThreadsAndReleaseSingleThread(MM_EnvironmentBase *env, const char *id); + + virtual void addToNotifyStallTime(MM_EnvironmentBase *env, uint64_t startTime, uint64_t endTime); #endif /* J9MODRON_TGC_PARALLEL_STATISTICS */ + virtual uintptr_t getRecommendedWorkingThreads() { return _recommendedThreads; }; + /** * Create a ParallelScavengeTask object. */ - MM_ParallelScavengeTask(MM_EnvironmentBase *env, MM_ParallelDispatcher *dispatcher, MM_Scavenger *collector,MM_CycleState *cycleState) : + MM_ParallelScavengeTask(MM_EnvironmentBase *env, MM_ParallelDispatcher *dispatcher, MM_Scavenger *collector,MM_CycleState *cycleState, uintptr_t recommendedThreads) : MM_ParallelTask(env, dispatcher) ,_collector(collector) ,_cycleState(cycleState) + ,_recommendedThreads(recommendedThreads) { _typeId = __FUNCTION__; }; diff --git a/gc/base/standard/Scavenger.cpp b/gc/base/standard/Scavenger.cpp index efeadcbea4a..fd15317990c 100644 --- a/gc/base/standard/Scavenger.cpp +++ b/gc/base/standard/Scavenger.cpp @@ -448,8 +448,14 @@ MM_Scavenger::mainSetupForGC(MM_EnvironmentStandard *env) void MM_Scavenger::workerSetupForGC(MM_EnvironmentStandard *env) { + OMRPORT_ACCESS_FROM_OMRPORT(env->getPortLibrary()); + clearThreadGCStats(env, true); + /* This thread just started the scavenge task, record the timestamp. + * This must be done after clearThreadGCStats or else the timestamp will be cleared. */ + env->_scavengerStats._workerScavengeStartTime = omrtime_hires_clock(); + /* Clear local language-specific stats */ _delegate.workerSetupForGC_clearEnvironmentLangStats(env); @@ -479,6 +485,88 @@ MM_Scavenger::calculateMaxCacheCount(uintptr_t activeMemorySize) return 5 * (activeMemorySize / (_extensions->scavengerScanCacheMaximumSize + _extensions->scavengerScanCacheMinimumSize)); } +void +MM_Scavenger::calculateRecommendedWorkingThreads(MM_EnvironmentStandard *env) +{ + if (!_extensions->adaptiveThreadigEnabled() || _extensions->isConcurrentScavengerEnabled()) { + return; + } + + Trc_MM_Scavenger_calculateRecommendedWorkingThreads_entry(env->getLanguageVMThread(), _extensions->scavengerStats._gcCount); + + if (_isRememberedSetInOverflowAtTheBeginning || _extensions->scavengerStats._causedRememberedSetOverflow) { + /* Scavenge cycle ignored for recommending threads. Scavenger had overflows, this will skew the model, + * as this is the not the normal case for stalling ("Irregular" stalling with SyncAndReleaseMaster, + * this is not the norm hence we don't have to adjust for it) */ + Trc_MM_Scavenger_calculateRecommendedWorkingThreads_exitOverflow(env->getLanguageVMThread()); + return; + } + + OMRPORT_ACCESS_FROM_OMRPORT(env->getPortLibrary()); + + uintptr_t totalThreads = _dispatcher->activeThreadCount(); + + /* Calculate the average time it takes the worker threads to start collection and avgerage time workers are idle waiting for task cleanup + * Calculated as (Sum_WorkerStartTime(t1 + t2 + ... + tn) - (n * collection_start)) / n */ + uint64_t avgTimeToStartCollection = omrtime_hires_delta((_extensions->scavengerStats._startTime * totalThreads), _extensions->scavengerStats._workerScavengeStartTime, OMRPORT_TIME_DELTA_IN_MICROSECONDS) / totalThreads; + uint64_t avgTimeIdleAfterCollection = omrtime_hires_delta(_extensions->scavengerStats._workerScavengeEndTime, (_extensions->scavengerStats._endTime * totalThreads), OMRPORT_TIME_DELTA_IN_MICROSECONDS) / totalThreads; + + /* Calculate average stall times */ + uint64_t avgScanStallTime = omrtime_hires_delta(0, (_extensions->scavengerStats._workStallTime + _extensions->scavengerStats._completeStallTime), OMRPORT_TIME_DELTA_IN_MICROSECONDS) / totalThreads; + uint64_t avgSyncStallTime = omrtime_hires_delta(0, (_extensions->scavengerStats._adjustedSyncStallTime), OMRPORT_TIME_DELTA_IN_MICROSECONDS) / totalThreads; + uint64_t avgNotifyStallTime = omrtime_hires_delta(0, (_extensions->scavengerStats._notifyStallTime), OMRPORT_TIME_DELTA_IN_MICROSECONDS) / totalThreads; + + Trc_MM_Scavenger_calculateRecommendedWorkingThreads_averageStallBreakDown(env->getLanguageVMThread(), totalThreads, avgTimeToStartCollection, avgTimeIdleAfterCollection, avgScanStallTime, avgSyncStallTime, avgNotifyStallTime); + + uint64_t totalStallTime = avgTimeToStartCollection + avgTimeIdleAfterCollection + avgScanStallTime + avgSyncStallTime + avgNotifyStallTime; + uint64_t scavengeTotalTime = omrtime_hires_delta(_extensions->scavengerStats._startTime, _extensions->scavengerStats._endTime, OMRPORT_TIME_DELTA_IN_MICROSECONDS); + + /* This Adaptive Threading Model aims to determine the efficiency of a cycle and predict the optimal GC thread count based on current number of threads, + * directly proportional to busy time and inversely proportional to idle times. + * + * The model can be expressed as a continues function, it's derived by finding a minimum of the folowing GC time function (used to project duration of GC for m threads, + * with oberved buys/stall times while performing GC with n threads): + * + * Time GC (m,n,b,s) = b * (n/m) + s * (m/n)^x + * Where m = number of threads for which total GC time (duration) is projected + * n = number of utilized threads (number of worker threads started + main thread) for the GC cycle used to observe s and b + * s = stall time per thread = average observed collection stall time for n threads = SUM(Stall time of n threads) / n + * b = busy time per thread = average collection busy time for n threads + * X = Stall Overhead Sensitivity/Tolerance, a model constant used to help model non-linear dependency of stall times on GC thread count + * + * + * Solving this function results in the model implementation expressed by expression 1 below. Expression 1 in combination with expression 2 gives up a complete implementation of the model + * + * (1) Number of Optimal Threads = m(n,b,s) = n * (B/X*s)^(1/X+1) + * (2) floor(((m(n,b,s) + H) * (1 - W)) + (n * W)) + * + * Where W and H are constants, W = Weighted average factor, H = Thread Booster + * + * Expression (1) can be simplified to be written in terms of % stall and working threads as follows: + * + * b/s = (1/%stall) - 1 + * + * ------------------------------------------------------------------ + * | (1) m(n,b,s) = m(n,%stall) = n * ((1/x)*(1/%stall - 1)^(1/(x+1))| + * ------------------------------------------------------------------- + */ + float percentStall = ((float) totalStallTime) / ((float) scavengeTotalTime); + float sensitivityFactor = _extensions->adaptiveThreadingSensitivityFactor; + float powerExponent = 1.0f / (sensitivityFactor + 1.0f); + float stallComponent = (1.0f / percentStall) - 1.0f; + float powerBase = (1.0f / sensitivityFactor) * stallComponent; + + float idealThreads = totalThreads * powf(powerBase, powerExponent); + float adjustedAverage = MM_Math::weightedAverage((float)totalThreads, idealThreads, _extensions->adaptiveThreadingWeightActiveThreads); + _recommendedThreads = (uintptr_t)(adjustedAverage + _extensions->adaptiveThreadBooster); + + if (_recommendedThreads < 2) { + _recommendedThreads = 2; + } + + Trc_MM_Scavenger_calculateRecommendedWorkingThreads_setRecommendedThreads(env->getLanguageVMThread(), scavengeTotalTime, totalStallTime, (percentStall*100), totalThreads, idealThreads, adjustedAverage, (adjustedAverage + _extensions->adaptiveThreadBooster), _recommendedThreads); +} + /** * Run a scavenge. */ @@ -486,7 +574,7 @@ void MM_Scavenger::scavenge(MM_EnvironmentBase *envBase) { MM_EnvironmentStandard *env = MM_EnvironmentStandard::getEnvironment(envBase); - MM_ParallelScavengeTask scavengeTask(env, _dispatcher, this, env->_cycleState); + MM_ParallelScavengeTask scavengeTask(env, _dispatcher, this, env->_cycleState, _recommendedThreads); _dispatcher->run(env, &scavengeTask); /* remove all scan caches temporary allocated in Heap */ @@ -741,6 +829,12 @@ MM_Scavenger::mergeGCStatsBase(MM_EnvironmentBase *env, MM_ScavengerStats *final finalGCStats->_workStallCount += scavStats->_workStallCount; finalGCStats->_completeStallCount += scavStats->_completeStallCount; _extensions->scavengerStats._syncStallCount += scavStats->_syncStallCount; + + /* Adaptive Threading Stats */ + finalGCStats->_workerScavengeStartTime += scavStats->_workerScavengeStartTime; + finalGCStats->_workerScavengeEndTime += scavStats->_workerScavengeEndTime; + finalGCStats->_notifyStallTime += scavStats->_notifyStallTime; + finalGCStats->_adjustedSyncStallTime += scavStats->_adjustedSyncStallTime; } @@ -754,11 +848,21 @@ MM_Scavenger::mergeThreadGCStats(MM_EnvironmentBase *env) MM_ScavengerStats *scavStats = &env->_scavengerStats; + /* This thread is just about to complete the scavenge task, record the timestamp. + * This must be done before mergeGCStatsBase or else the timestamp won't be mereged as needed by adaptive threading. */ + env->_scavengerStats._workerScavengeEndTime = omrtime_hires_clock(); mergeGCStatsBase(env, &_extensions->incrementScavengerStats, scavStats); /* Merge language specific statistics. No known interesting data per increment - they are merged directly to aggregate cycle stats */ _delegate.mergeGCStats_mergeLangStats(env); + uint64_t timeToStartCollection = omrtime_hires_delta(_extensions->scavengerStats._startTime, scavStats->_workerScavengeStartTime, OMRPORT_TIME_DELTA_IN_MICROSECONDS); + uint64_t scanStall = omrtime_hires_delta(0, (scavStats->_workStallTime + scavStats->_completeStallTime), OMRPORT_TIME_DELTA_IN_MICROSECONDS); + uint64_t syncStall = omrtime_hires_delta(0, (scavStats->_adjustedSyncStallTime), OMRPORT_TIME_DELTA_IN_MICROSECONDS); + uint64_t notifyStallTime = omrtime_hires_delta(0, (scavStats->_notifyStallTime), OMRPORT_TIME_DELTA_IN_MICROSECONDS); + + Trc_MM_Scavenger_calculateRecommendedWorkingThreads_threadStallBreakDown(env->getLanguageVMThread(), env->getWorkerID(), timeToStartCollection, scanStall, syncStall, notifyStallTime); + omrthread_monitor_exit(_extensions->gcStatsMutex); /* record the thread-specific parallelism stats in the trace buffer. This aprtially duplicates info in -Xtgc:parallel */ @@ -2117,11 +2221,15 @@ MM_Scavenger::getNextScanCache(MM_EnvironmentStandard *env) bool doneFlag = false; volatile uintptr_t doneIndex = _doneIndex; + OMRPORT_ACCESS_FROM_OMRPORT(env->getPortLibrary()); + if (checkAndSetShouldYieldFlag(env)) { flushBuffersForGetNextScanCache(env); omrthread_monitor_enter(_scanCacheMonitor); if (0 != _waitingCount) { + uint64_t notifyStartTime = omrtime_hires_clock(); omrthread_monitor_notify_all(_scanCacheMonitor); + env->_scavengerStats.addToNotifyStallTime(notifyStartTime, omrtime_hires_clock()); } omrthread_monitor_exit(_scanCacheMonitor); return NULL; @@ -2161,10 +2269,6 @@ MM_Scavenger::getNextScanCache(MM_EnvironmentStandard *env) env->_scavengerStats._acquireScanListCount += 1; #endif /* J9MODRON_TGC_PARALLEL_STATISTICS */ -#if defined(OMR_SCAVENGER_TRACE) || defined(J9MODRON_TGC_PARALLEL_STATISTICS) - OMRPORT_ACCESS_FROM_OMRPORT(env->getPortLibrary()); -#endif /* OMR_SCAVENGER_TRACE || J9MODRON_TGC_PARALLEL_STATISTICS */ - while (!doneFlag && !shouldAbortScanLoop(env)) { while (_cachedEntryCount > 0) { cache = getNextScanCacheFromList(env); @@ -2198,7 +2302,9 @@ MM_Scavenger::getNextScanCache(MM_EnvironmentStandard *env) if (shouldDoFinalNotify(env)) { _waitingCount = 0; _doneIndex += 1; + uint64_t notifyStartTime = omrtime_hires_clock(); omrthread_monitor_notify_all(_scanCacheMonitor); + env->_scavengerStats.addToNotifyStallTime(notifyStartTime, omrtime_hires_clock()); } } else { while((0 == _cachedEntryCount) && (doneIndex == _doneIndex) && !shouldAbortScanLoop(env)) { @@ -4092,6 +4198,9 @@ MM_Scavenger::mainThreadGarbageCollect(MM_EnvironmentBase *envBase, MM_AllocateD _extensions->scavengerStats._endTime = omrtime_hires_clock(); if(scavengeCompletedSuccessfully(env)) { + + calculateRecommendedWorkingThreads(env); + /* Merge sublists in the remembered set (if necessary) */ _extensions->rememberedSet.compact(env); diff --git a/gc/base/standard/Scavenger.hpp b/gc/base/standard/Scavenger.hpp index 38e9e5319fb..42e85e0fffb 100644 --- a/gc/base/standard/Scavenger.hpp +++ b/gc/base/standard/Scavenger.hpp @@ -100,6 +100,7 @@ class MM_Scavenger : public MM_Collector bool _cachedSemiSpaceResizableFlag; uintptr_t _minTenureFailureSize; uintptr_t _minSemiSpaceFailureSize; + uintptr_t _recommendedThreads; /** Number of threads recommended to the dispatcher for the Scavenge task */ MM_CycleState _cycleState; /**< Embedded cycle state to be used as the main cycle state for GC activity */ MM_CollectionStatisticsStandard _collectionStatistics; /** Common collect stats (memory, time etc.) */ @@ -326,8 +327,7 @@ class MM_Scavenger : public MM_Collector /* Check last few LSB of the object address for probability 1/16 */ return (0 == ((uintptr_t)objectPtr & 0x78)); } - - + void deepScanOutline(MM_EnvironmentStandard *env, omrobjectptr_t objectPtr, uintptr_t priorityFieldOffset1, uintptr_t priorityFieldOffset2); MMINLINE bool scavengeRememberedObject(MM_EnvironmentStandard *env, omrobjectptr_t objectPtr); @@ -555,6 +555,15 @@ class MM_Scavenger : public MM_Collector bool canCalcGCStats(MM_EnvironmentStandard *env); void calcGCStats(MM_EnvironmentStandard *env); + /** + * The implementation of Adaptive Threading. This routine is called at the + * end of each successful scavenge to determine the optimal number of threads for + * the subsequent cycle. This is based on the completed cycle's stall/busy stats (adaptive model). + * This function set's _recommendedThreads, which in turn get's used when dispatching + * the next cycle's scavege task. + */ + void calculateRecommendedWorkingThreads(MM_EnvironmentStandard *env); + void scavenge(MM_EnvironmentBase *env); bool scavengeCompletedSuccessfully(MM_EnvironmentStandard *env); virtual void mainThreadGarbageCollect(MM_EnvironmentBase *env, MM_AllocateDescription *allocDescription, bool initMarkMap = false, bool rebuildMarkBits = false); @@ -901,6 +910,7 @@ class MM_Scavenger : public MM_Collector , _expandTenureOnFailedAllocate(true) , _minTenureFailureSize(UDATA_MAX) , _minSemiSpaceFailureSize(UDATA_MAX) + , _recommendedThreads(UDATA_MAX) , _cycleState() , _collectionStatistics() , _cachedEntryCount(0) diff --git a/gc/stats/ScavengerStats.cpp b/gc/stats/ScavengerStats.cpp index 25aaac50913..44da39ddc6c 100644 --- a/gc/stats/ScavengerStats.cpp +++ b/gc/stats/ScavengerStats.cpp @@ -66,6 +66,10 @@ MM_ScavengerStats::MM_ScavengerStats() ,_depthDeepestStructure(0) ,_copyScanUpdates(0) #endif /* J9MODRON_TGC_PARALLEL_STATISTICS */ + ,_workerScavengeStartTime(0) + ,_workerScavengeEndTime(0) + ,_notifyStallTime(0) + ,_adjustedSyncStallTime(0) ,_avgInitialFree(0) ,_avgTenureBytes(0) ,_avgTenureBytesDeviation(0) @@ -195,6 +199,11 @@ MM_ScavengerStats::clear(bool firstIncrement) _slotsCopied = 0; _slotsScanned = 0; + _adjustedSyncStallTime = 0; + _notifyStallTime = 0; + _workerScavengeEndTime = 0; + _workerScavengeStartTime = 0; + #if defined(OMR_GC_CONCURRENT_SCAVENGER) _readObjectBarrierCopy = 0; _readObjectBarrierUpdate = 0; diff --git a/gc/stats/ScavengerStats.hpp b/gc/stats/ScavengerStats.hpp index ee603efa8eb..15c5c1424f8 100644 --- a/gc/stats/ScavengerStats.hpp +++ b/gc/stats/ScavengerStats.hpp @@ -75,6 +75,7 @@ class MM_ScavengerStats uintptr_t _failedFlipCount; uintptr_t _failedFlipBytes; uintptr_t _tenureAge; + /* The following start/end times are not used as thread local, but to record total cycle duration, done only by main thread. Ideally, these should be moved to the collector. */ uint64_t _startTime; uint64_t _endTime; #if defined(J9MODRON_TGC_PARALLEL_STATISTICS) @@ -98,6 +99,17 @@ class MM_ScavengerStats uintptr_t _copyScanUpdates; #endif /* J9MODRON_TGC_PARALLEL_STATISTICS */ + /* Stats Used Specifically for Adaptive Threading */ + uint64_t _workerScavengeStartTime; /**< Timestamp taken when worker starts the scavenge task */ + uint64_t _workerScavengeEndTime; /**< Timestamp taken when worker completes the scavenge task */ + + /* The time, in hi-res ticks, the thread spent stalled notifying other + * threads during scavenge. Note: this is not all inclusive, it records notify + * stall times only relevant to adaptive threading (e.g doesn't include backout cases) + */ + uint64_t _notifyStallTime; + uint64_t _adjustedSyncStallTime; /**< The time, in hi-res ticks, the thread spent stalled at a sync point ADJUSTED to account for critical section time */ + /* Average (weighted) number of bytes free after a collection and * average number of bytes promoted by a collection. Used by * concurrent collector to trigger concurrent when scavenger enabled. @@ -182,12 +194,19 @@ class MM_ScavengerStats } MMINLINE void - addToSyncStallTime(uint64_t startTime, uint64_t endTime) + addToSyncStallTime(uint64_t startTime, uint64_t endTime, uint64_t criticalSectionDuration = 0) { _syncStallCount += 1; _syncStallTime += (endTime - startTime); + _adjustedSyncStallTime += ((endTime - startTime) - criticalSectionDuration); } + MMINLINE void + addToNotifyStallTime(uint64_t startTime, uint64_t endTime) + { + _notifyStallTime += (endTime - startTime); + } + /** * Get the total stall time * @return the time in hi-res ticks