cpuminer: Significantly optimize mining workers. #2977

davecgh · 2022-07-27T23:58:42Z

This significantly optimizes the CPU miner that is useful for mining on testnet.

The existing CPU mining code prior to this change updated the relevant fields in the header and called its own hash method which internally makes use of the existing writer interface based serialization and shared buffers to prevent a lot of allocations. That makes sense most places in the code, but it is really bad for concurrent performance in a tight mining loop.

Thus, this modifies the CPU miner workers to instead remove all mutex contention by serializing the header once outside of the loop, directly updating the relevant bytes in the serialized data, using the allocation-free hash method to compute the hash from the serialized bytes directly, and finally updating the block template header when/if a solution is found.

It also now makes use of the much more efficient primitives and uint256 packages instead of big integers to further reduce allocations and speed up the calculations.

The net result is a major speedup when mining with a single core and nearly linear scalar with multiple cores.

It also includes a separate commit that reworks the speed stat tracking to make it more accurate as well as significantly reduce its lock contention.

A high level overview of the changes is as follows:

Each worker is given its own stats state to independently update
- Each worker now has a unique ID which is used to store and clean up said state
The elapsed hashing time per worker is now tracked
The individual speed stats are updated via lock-free atomic primitives versus the previous single mutex that all workers shared
Each worker now periodically updates its individual speed stats every fixed number of nonces instead of relying on a ticker
The speed monitor goroutine periodically sums the results from all of the individual worker stats
Each worker now only serializes the header once per template
Each worker independently directly updates the relevant bytes in the serialized header
The hashing process is now allocation free
The relevant checks use the much more efficient newer primtivies and uint256 packages

Concretely, prior to these changes, I was only seeing around 400-570 kh/s with 1 or 2 cores and mining on more cores actually made the performance worse due to the all of the contention. With these changes, I'm seeing around 1.2Mh/s on a single core and over 10MH/s with 10 cores.

This reworks the speed stat tracking to make it more accurate as well as significantly reduce lock contention. A high level overview of the changes is as follows: - Each worker is given its own stats state to independently update - Each worker now has a unique ID which is used to store and clean up said state - The elapsed hashing time per worker is now tracked - The individual speed stats are updated via lock-free atomic primitives versus the previous single mutex that all workers shared - Each worker now periodically updates its individual speed stats every fixed number of nonces instead of relying on a ticker - The speed monitor goroutine periodically sums the results from all of the individual worker stats Some local test mining shows that prior to these changes the reported hash speed varies pretty wildly between 43 kh/s and 427 kh/s and more wild swings with more workers. With the changes, the reporting is more in line with what I would expect without additional changes and comes in a much tigher (and a bit higher due to a bit of reduced lock contention) range of 408 kh/s and 568 kh/s.

internal/mining/cpuminer/cpuminer.go

The existing code prior to this change updated the relevant fields in the header and called its own hash method which internally makes use of the existing writer interface based serialization and shared buffers to prevent a lot of allocations. That makes sense most places in the code, but it is really bad for concurrent performance in a tight mining loop. Thus, this significantly optimizes the CPU miner workers to instead remove all mutex contention by serializing the header once outside of the loop, directly updating the relevant bytes in the serialized data, using the allocation-free hash method to compute the hash from the serialized bytes directly, and finally updating the block template header when/if a solution is found. It also now makes use of the much more efficient primitives and uint256 packages instead of big integers to further reduce allocations and speed up the calculations. The net result is a major speedup when mining with a single core and nearly linear scalar with multiple cores. Concretely, prior to these changes, I was only seeing around 400-570 kh/s with 1 or 2 cores and mining on more cores actually made the performance worse due to the all of the contention. With these changes, I'm seeing around 1.2Mh/s on a single core and over 10MH/s with 10 cores.

matheusd

Unrelated to this PR, but dcrctl generate 1 doesn't report on the speed. Found out after not seeing the H/s report :P

I can confirm the significant speedup in mining using this PR.

matheusd · 2022-07-28T14:16:43Z

internal/mining/cpuminer/cpuminer.go

-		// new value.
-		littleEndian.PutUint64(header.ExtraData[:], extraNonce+enOffset)
+		// Update the extra nonce in the serialized header bytes directly.
+		const enSerOffset = 144


double checked

matheusd · 2022-07-28T14:21:38Z

internal/mining/cpuminer/cpuminer.go

+				const bitsOffset = 116
+				const timestampOffset = 136


double checked

davecgh · 2022-07-28T15:20:40Z

Unrelated to this PR, but dcrctl generate 1 doesn't report on the speed. Found out after not seeing the H/s report :P

It could probably be updated now due to the speed stat tracking changes, but that was actually intentional since it didn't report properly for discrete mining and the HashesPerSecond method on the cpu miner calls it out:

// HashesPerSecond returns the number of hashes per second the normal mode
// mining process is performing.  0 is returned if the miner is not currently
// mining anything in normal mining mode.
//
// This function is safe for concurrent access.
func (m *CPUMiner) HashesPerSecond() float64 {

chappjc

Looks good. Can confirm massive hashrate improvement.

davecgh added this to the 1.8.0 milestone Jul 27, 2022

davecgh commented Jul 28, 2022

View reviewed changes

internal/mining/cpuminer/cpuminer.go Outdated Show resolved Hide resolved

davecgh force-pushed the cpuminer_optimize branch from 60bccef to 7176ea9 Compare July 28, 2022 02:43

davecgh force-pushed the cpuminer_optimize branch from 7176ea9 to 5c49ffb Compare July 28, 2022 02:50

davecgh added the optimization label Jul 28, 2022

matheusd approved these changes Jul 28, 2022

View reviewed changes

chappjc approved these changes Jul 28, 2022

View reviewed changes

davecgh merged commit 5c49ffb into decred:master Jul 30, 2022

davecgh deleted the cpuminer_optimize branch July 30, 2022 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpuminer: Significantly optimize mining workers. #2977

cpuminer: Significantly optimize mining workers. #2977

davecgh commented Jul 27, 2022 •

edited

Loading

matheusd left a comment

matheusd Jul 28, 2022

matheusd Jul 28, 2022

davecgh commented Jul 28, 2022 •

edited

Loading

chappjc left a comment

cpuminer: Significantly optimize mining workers. #2977

cpuminer: Significantly optimize mining workers. #2977

Conversation

davecgh commented Jul 27, 2022 • edited Loading

matheusd left a comment

Choose a reason for hiding this comment

matheusd Jul 28, 2022

Choose a reason for hiding this comment

matheusd Jul 28, 2022

Choose a reason for hiding this comment

davecgh commented Jul 28, 2022 • edited Loading

chappjc left a comment

Choose a reason for hiding this comment

davecgh commented Jul 27, 2022 •

edited

Loading

davecgh commented Jul 28, 2022 •

edited

Loading