Skip to content
This repository has been archived by the owner on Apr 24, 2022. It is now read-only.

AMD max. 4 GB per allocation workaround for 8GB cards #1977

Merged
merged 2 commits into from Mar 24, 2020
Merged

AMD max. 4 GB per allocation workaround for 8GB cards #1977

merged 2 commits into from Mar 24, 2020

Conversation

jean-m-cyr
Copy link
Contributor

@jean-m-cyr jean-m-cyr commented Mar 23, 2020

  • Run AMD in split DAG memory mode such that individual memory allocations do not exceed 4GB.
  • DAG memory is allocated in two equal size parts. One for even index entries, and one for odd index entries.
  • Update binary kernels support split DAG

- Run AMD in split DAG memory mode such that individual
  memory allocations do not exceed 4GB.
- DAG memory is allocated in two equal size
  parts. One for even index entries, and one
  for odd index entries.
- Update binary kernels support split DAG
@AndreaLanfranchi
Copy link
Collaborator

Will review but I have a question.
Do we really need to split DAG ? Have anyone tried to issue two allocation and see if the resulting pointers do point to two adjacent memory locations ? If the latter it'd be only a problem of allocation and the rest of the code would be the same.

@ddobreff
Copy link
Collaborator

Tested on Polaris and Radeon VII
19.30 Opencl driver and worked, log is posted in DM to Jean M. Cyr.

@jean-m-cyr
Copy link
Contributor Author

jean-m-cyr commented Mar 24, 2020

@AndreaLanfranchi There no way to guarantee two allocated block will be adjacent. Other users, such as a desktop GUI, can be allocating and freeing concurrently.

Comment on lines +239 to +241
g_dag = (__global hash128_t const*) _g_dag0; \
if (idx & 1) \
g_dag = (__global hash128_t const*) _g_dag1; \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
g_dag = (__global hash128_t const*) _g_dag0; \
if (idx & 1) \
g_dag = (__global hash128_t const*) _g_dag1; \
if (!(idx & 1)) \
g_dag = (__global hash128_t const*) _g_dag0; \
else \
g_dag = (__global hash128_t const*) _g_dag1; \

This should save an address translation and semantically similar to same test in DAG generation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do it that way but tried it and see no perceptible speed difference on 480. There is no need for translation _g_dag0 and _g_dag1 are already in GPU context.

@AndreaLanfranchi
Copy link
Collaborator

AndreaLanfranchi commented Mar 24, 2020

As far as I understand this change imposes the split regardless whether or not it's necessary.
For sure there will be a decrease in hashrate for, say, private chains - or other ethash like, where DAG_SIZE < MAX_ALLOC_SIZE.

If we had only to maintain .cl kernel (source) a simple preprocessor directive would solve the problem: I understand the maintenance of binary files is a PITA

Copy link
Contributor Author

@jean-m-cyr jean-m-cyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it working with a compiler directive to control spit vs non-split mode, including for binary kernels.

  • no measured speed difference between split vs. non-split.
  • would double the number binary kernels

@AndreaLanfranchi
Copy link
Collaborator

no measured speed difference between split vs. non-split.

I hardly believe it : a conditional plus a an index "re-index" for every thread is something.
Anyway have to take your word as I don't have any AMD card to test on.

If @ddobreff is ok with the test I'm also ok with it.

@AndreaLanfranchi
Copy link
Collaborator

Voids the need for #1969

@jean-m-cyr
Copy link
Contributor Author

no measured speed difference between split vs. non-split.

I hardly believe it : a conditional plus a an index "re-index" for every thread is something.
Anyway have to take your word as I don't have any AMD card to test on.

If @ddobreff is ok with the test I'm also ok with it.

Non split mode opencl

 m 11:33:58 ethminer 0:00 A0 43.60 Mh - cl0 27.74 47C 62% A0, cu1 15.86 60C 31% A0

Split mode opencl

 m 11:30:06 ethminer 0:13 A4 44.76 Mh - cl0 28.90 60C 62% A2, cu1 15.86 60C 30% A2

Split mode is actually faster!

@AndreaLanfranchi
Copy link
Collaborator

I'm puzzled ... anyway won't investigate. AMD and its drivers weirdness lost any interest for me.

Bottom line : good job @jean-m-cyr

@joaogti36
Copy link

where can we test the ethminer with those changes in ethminer with cuda cards...gtx 1070/1080

@ddobreff
Copy link
Collaborator

ddobreff commented Apr 3, 2020

Changes are related to AMD OpenCL, nvidia is not affected.

@joaogti36
Copy link

nvidias also start losing hashrate since november.... the bigger the dag... the less speed it gets from cards... 1070 that is!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants