Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpuminer: Significantly optimize mining workers. #2977

Merged
merged 2 commits into from
Jul 30, 2022

Conversation

davecgh
Copy link
Member

@davecgh davecgh commented Jul 27, 2022

This significantly optimizes the CPU miner that is useful for mining on testnet.

The existing CPU mining code prior to this change updated the relevant fields in the header and called its own hash method which internally makes use of the existing writer interface based serialization and shared buffers to prevent a lot of allocations. That makes sense most places in the code, but it is really bad for concurrent performance in a tight mining loop.

Thus, this modifies the CPU miner workers to instead remove all mutex contention by serializing the header once outside of the loop, directly updating the relevant bytes in the serialized data, using the allocation-free hash method to compute the hash from the serialized bytes directly, and finally updating the block template header when/if a solution is found.

It also now makes use of the much more efficient primitives and uint256 packages instead of big integers to further reduce allocations and speed up the calculations.

The net result is a major speedup when mining with a single core and nearly linear scalar with multiple cores.

It also includes a separate commit that reworks the speed stat tracking to make it more accurate as well as significantly reduce its lock contention.

A high level overview of the changes is as follows:

  • Each worker is given its own stats state to independently update
    • Each worker now has a unique ID which is used to store and clean up said state
  • The elapsed hashing time per worker is now tracked
  • The individual speed stats are updated via lock-free atomic primitives versus the previous single mutex that all workers shared
  • Each worker now periodically updates its individual speed stats every fixed number of nonces instead of relying on a ticker
  • The speed monitor goroutine periodically sums the results from all of the individual worker stats
  • Each worker now only serializes the header once per template
  • Each worker independently directly updates the relevant bytes in the serialized header
  • The hashing process is now allocation free
  • The relevant checks use the much more efficient newer primtivies and uint256 packages

Concretely, prior to these changes, I was only seeing around 400-570 kh/s with 1 or 2 cores and mining on more cores actually made the performance worse due to the all of the contention. With these changes, I'm seeing around 1.2Mh/s on a single core and over 10MH/s with 10 cores.

This reworks the speed stat tracking to make it more accurate as well as
significantly reduce lock contention.

A high level overview of the changes is as follows:

- Each worker is given its own stats state to independently update
  - Each worker now has a unique ID which is used to store and clean up
    said state
- The elapsed hashing time per worker is now tracked
- The individual speed stats are updated via lock-free atomic primitives
  versus the previous single mutex that all workers shared
- Each worker now periodically updates its individual speed stats every
  fixed number of nonces instead of relying on a ticker
- The speed monitor goroutine periodically sums the results from all of
  the individual worker stats

Some local test mining shows that prior to these changes the reported
hash speed varies pretty wildly between 43 kh/s and 427 kh/s and more
wild swings with more workers.  With the changes, the reporting is more
in line with what I would expect without additional changes and comes in
a much tigher (and a bit higher due to a bit of reduced lock contention)
range of 408 kh/s and 568 kh/s.
@davecgh davecgh added this to the 1.8.0 milestone Jul 27, 2022
The existing code prior to this change updated the relevant fields in
the header and called its own hash method which internally makes use of
the existing writer interface based serialization and shared buffers to
prevent a lot of allocations.  That makes sense most places in the code,
but it is really bad for concurrent performance in a tight mining loop.

Thus, this significantly optimizes the CPU miner workers to instead
remove all mutex contention by serializing the header once outside of
the loop, directly updating the relevant bytes in the serialized data,
using the allocation-free hash method to compute the hash from the
serialized bytes directly, and finally updating the block template
header when/if a solution is found.

It also now makes use of the much more efficient primitives and uint256
packages instead of big integers to further reduce allocations and speed
up the calculations.

The net result is a major speedup when mining with a single core and
nearly linear scalar with multiple cores.

Concretely, prior to these changes, I was only seeing around 400-570
kh/s with 1 or 2 cores and mining on more cores actually made the
performance worse due to the all of the contention.  With these changes,
I'm seeing around 1.2Mh/s on a single core and over 10MH/s with
10 cores.
Copy link
Member

@matheusd matheusd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but dcrctl generate 1 doesn't report on the speed. Found out after not seeing the H/s report :P

I can confirm the significant speedup in mining using this PR.

// new value.
littleEndian.PutUint64(header.ExtraData[:], extraNonce+enOffset)
// Update the extra nonce in the serialized header bytes directly.
const enSerOffset = 144
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double checked

Comment on lines +338 to +339
const bitsOffset = 116
const timestampOffset = 136
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double checked

@davecgh
Copy link
Member Author

davecgh commented Jul 28, 2022

Unrelated to this PR, but dcrctl generate 1 doesn't report on the speed. Found out after not seeing the H/s report :P

It could probably be updated now due to the speed stat tracking changes, but that was actually intentional since it didn't report properly for discrete mining and the HashesPerSecond method on the cpu miner calls it out:

// HashesPerSecond returns the number of hashes per second the normal mode
// mining process is performing.  0 is returned if the miner is not currently
// mining anything in normal mining mode.
//
// This function is safe for concurrent access.
func (m *CPUMiner) HashesPerSecond() float64 {

Copy link
Member

@chappjc chappjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Can confirm massive hashrate improvement.

@davecgh davecgh merged commit 5c49ffb into decred:master Jul 30, 2022
@davecgh davecgh deleted the cpuminer_optimize branch July 30, 2022 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants