AMD max. 4 GB per allocation workaround for 8GB cards #1977
Conversation
jean-m-cyr
commented
Mar 23, 2020
•
edited
edited
- Run AMD in split DAG memory mode such that individual memory allocations do not exceed 4GB.
- DAG memory is allocated in two equal size parts. One for even index entries, and one for odd index entries.
- Update binary kernels support split DAG
- Run AMD in split DAG memory mode such that individual memory allocations do not exceed 4GB. - DAG memory is allocated in two equal size parts. One for even index entries, and one for odd index entries. - Update binary kernels support split DAG
Will review but I have a question. |
Tested on Polaris and Radeon VII |
@AndreaLanfranchi There no way to guarantee two allocated block will be adjacent. Other users, such as a desktop GUI, can be allocating and freeing concurrently. |
g_dag = (__global hash128_t const*) _g_dag0; \ | ||
if (idx & 1) \ | ||
g_dag = (__global hash128_t const*) _g_dag1; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
g_dag = (__global hash128_t const*) _g_dag0; \ | |
if (idx & 1) \ | |
g_dag = (__global hash128_t const*) _g_dag1; \ | |
if (!(idx & 1)) \ | |
g_dag = (__global hash128_t const*) _g_dag0; \ | |
else \ | |
g_dag = (__global hash128_t const*) _g_dag1; \ |
This should save an address translation and semantically similar to same test in DAG generation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do it that way but tried it and see no perceptible speed difference on 480. There is no need for translation _g_dag0 and _g_dag1 are already in GPU context.
As far as I understand this change imposes the split regardless whether or not it's necessary. If we had only to maintain .cl kernel (source) a simple preprocessor directive would solve the problem: I understand the maintenance of binary files is a PITA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had it working with a compiler directive to control spit vs non-split mode, including for binary kernels.
- no measured speed difference between split vs. non-split.
- would double the number binary kernels
I hardly believe it : a conditional plus a an index "re-index" for every thread is something. If @ddobreff is ok with the test I'm also ok with it. |
Voids the need for #1969 |
Non split mode opencl
Split mode opencl
Split mode is actually faster! |
I'm puzzled ... anyway won't investigate. AMD and its drivers weirdness lost any interest for me. Bottom line : good job @jean-m-cyr |
where can we test the ethminer with those changes in ethminer with cuda cards...gtx 1070/1080 |
Changes are related to AMD OpenCL, nvidia is not affected. |
nvidias also start losing hashrate since november.... the bigger the dag... the less speed it gets from cards... 1070 that is! |