Skip to content
This repository has been archived by the owner on Apr 24, 2022. It is now read-only.

Mixed multiple AMD+NVIDIA fix #770

Merged
merged 1 commit into from
Feb 28, 2018

Conversation

Davesmacer
Copy link
Contributor

@Davesmacer Davesmacer commented Feb 19, 2018

I managed to get my hashrate working for all the cards, I'm currently working on fixing hwinfo monitors which are not working for the AMD cards. Please refer to #725 last comments to see the problem and solution.

Please don't close the PR until I fix monitors.

@Davesmacer
Copy link
Contributor Author

Davesmacer commented Feb 19, 2018

Hey @smurfy, do you know if adl overdrive API and sysfs use the same index to poll devices? I'm guessing ADL API start first AMD with 0 index and ignores the NVIDIA, but I don't think sysfs is the same. I just don't know how sysfs polling works... Question is to see how I should tackle the hwinfo fix, by saving right index or by changing it before different API calls...

@smurfy
Copy link
Collaborator

smurfy commented Feb 20, 2018

Both go 0 to n of the same type.

so if you have 5 nvidia and 3 amd you have:
0-7 miner gpus
0-4 id nvidia nvml
0-2 id amd syfs/adl

sysfs then maps to all amd cards in sysfs. but the interface uses indexes starting at 0 even if you have nvidia cards which are for example card0-card4 in the file system. then 0 sysfs internally map to card5

Adl should be the same, only that adl returns a bunch of logical cards and which then gets iterated to get the actual physical cards.

@Davesmacer
Copy link
Contributor Author

Thank you, I think I got it...

@Davesmacer
Copy link
Contributor Author

Davesmacer commented Feb 20, 2018

It is showing correctly the monitors with the last commit, but they seem to be in different order. Monitor for card 0 shows in card 1. I've been working on that, but couldn't manage to solve it yet. @smurfy any ideas? Maybe ADL uses different enumeration or something...

Edit: I figured with an old image that the issue was there before with -G option, I just didn't pay attention. I'm trying to figure it out anyway

@jean-m-cyr
Copy link
Contributor

@Davesmacer Just saw this go by in gitter, may be of interrest:

"to be able to match sysfs first you have to match opencl device to pci device
its not that hard to do it"

@Davesmacer
Copy link
Contributor Author

Davesmacer commented Feb 20, 2018

I doing the mapping to PCI actually! But @smurfy told me he had already the mapping for sysfs. I'm trying to figure that out to make a correct mapping for ADL... It is very simple to save PCI id given the card index (0...n), but then, mapping that to ADL format is the tricky part. I'll check on gitter to see if something of help comes out...

@smurfy
Copy link
Collaborator

smurfy commented Feb 20, 2018

i have method to get pci:id on sysfs :) we need a method to get pci:id from opencl

@smurfy
Copy link
Collaborator

smurfy commented Feb 20, 2018

the mapping need to be done in all 3 implementations.

@Davesmacer
Copy link
Contributor Author

oh i see! and you also have that method in nvml right? I'll do it in all 3 wrappers then...

@smurfy
Copy link
Collaborator

smurfy commented Feb 20, 2018

yes, nvml has a mapping between nvml device id + cuda device id based on pci-id.
The same need to be added for adl, sysfs and nvml id to opencl device id based on pci-id.
The fetching of pci-ids with opencl ive posted on the other issue.
Nvml already has pci-id
For adl and sysfs i can post pci-id fetching

@smurfy
Copy link
Collaborator

smurfy commented Feb 20, 2018

commited and pushed pci-id fetching

@Davesmacer
Copy link
Contributor Author

Awesome @smurfy!!! I'll do the matching in the wrappers so we have a common ID for the program. Should be ready at night. Thank you

@Davesmacer
Copy link
Contributor Author

All right guys! It is working for ADL on Windows now, I just feel like it can be improved somehow before I pass the same code to the other wrappers to map OpenCL ID... I believe it got a little messy when I had to make calls to OpenCL from the wrapper. I guess it would be nicer to write a wrapper method in cl2 or something...
Also, I'm not sure if I used the macros correctly, I am still getting used to c++

@Davesmacer
Copy link
Contributor Author

Davesmacer commented Feb 21, 2018

I'm checking the tests... I'll commit when fixed. Don't review yet haha

@Davesmacer
Copy link
Contributor Author

All right! I got it working for ADL, It is really simple to pass the mapping to the other wrappers (nvml and sysfs), but I wanted you to check if it can be improved before passing the code to the other wrappers...

I tried fixing warnings on Linux, but the "casting between pointer-to-function and pointer-to-object" warning is practically a compiler issue... I couldn't find any workaroud to solve it

Copy link
Collaborator

@smurfy smurfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awsome Job so far,
In my opinion the mapping array gathering should be in the wrappers, like with cuda->nvml. So putting it inside the ADL file is good.

As mentioned in my comments, the actual use of the mapping should be done in the CALLER, not in the wrapper itself. all wrapper functions should use its own id, because for example nvml need to support opencl and cuda.

@@ -180,7 +245,8 @@ int wrap_adl_get_tempC(wrap_adl_handle *adlh, int gpuindex, unsigned int *tempC)
return -1;

ADLTemperature *temperature = new ADLTemperature();
rc = adlh->adlOverdrive5TemperatureGet(adlh->phys_logi_device_id[gpuindex], 0, temperature);
int adlidx = adlh->opencl_adl_device_id[gpuindex];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpuindex should be always adl id, the caller should use the mapping array if needed

@@ -197,7 +263,8 @@ int wrap_adl_get_fanpcnt(wrap_adl_handle *adlh, int gpuindex, unsigned int *fanp

ADLFanSpeedValue *fan = new ADLFanSpeedValue();
fan->iSpeedType = 1;
rc = adlh->adlOverdrive5FanSpeedGet(adlh->phys_logi_device_id[gpuindex], 0, fan);
int adlidx = adlh->opencl_adl_device_id[gpuindex];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpuindex should be always adl id, the caller should use the mapping array if needed

@@ -213,7 +280,8 @@ int wrap_adl_get_power_usage(wrap_adl_handle *adlh, int gpuindex, unsigned int*
return -1;
}
int power = 0;
rc = adlh->adl2Overdrive6CurrentPowerGet(adlh->context, adlh->phys_logi_device_id[gpuindex], 0, &power);
int adlidx = adlh->opencl_adl_device_id[gpuindex];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpuindex should be always adl id, the caller should use the mapping array if needed

@@ -263,26 +273,29 @@ class Farm: public FarmFace
unsigned int tempC = 0, fanpcnt = 0, powerW = 0;
if (hwInfo.deviceIndex >= 0) {
if (hwInfo.deviceType == HwMonitorInfoType::NVIDIA && nvmlh) {
wrap_nvml_get_tempC(nvmlh, hwInfo.deviceIndex, &tempC);
wrap_nvml_get_fanpcnt(nvmlh, hwInfo.deviceIndex, &fanpcnt);
int typeidx = hwInfo.deviceIndex % m_cuda_count;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the mapping here ocl->nvml, but we need a additional field in hwInfo, if the deviceIndex source is CUDA or OpenCL. In case of cuda i would also use the cuda->nvml mapping to be on the safe side.

}
}
else if (hwInfo.deviceType == HwMonitorInfoType::AMD && adlh) {
wrap_adl_get_tempC(adlh, hwInfo.deviceIndex, &tempC);
wrap_adl_get_fanpcnt(adlh, hwInfo.deviceIndex, &fanpcnt);
int typeidx = hwInfo.deviceIndex % m_opencl_count;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the ocl->adl mapping here

}
}
#if defined(__linux)
// Overwrite with sysfs data if present
if (hwInfo.deviceType == HwMonitorInfoType::AMD && sysfsh) {
wrap_amdsysfs_get_tempC(sysfsh, hwInfo.deviceIndex, &tempC);
wrap_amdsysfs_get_fanpcnt(sysfsh, hwInfo.deviceIndex, &fanpcnt);
int typeidx = hwInfo.deviceIndex % m_opencl_count;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the ocl->sysfs mapping here

@Davesmacer
Copy link
Contributor Author

Got it!! I'll be making the changes and replying the ADL mapping to NVML and SYSFS wrappers later this afternoon.

@Davesmacer
Copy link
Contributor Author

Davesmacer commented Feb 21, 2018

All right!! @smurfy ADL mapping with OpenCL is working with the proper changes... Please review so I can pass the code to all the wrappers... Should be very easy

Copy link
Collaborator

@smurfy smurfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrappers look good, and the current OPENCL AMD Stuff should work with sysfs.
For nvidia we need the additional fetching of pci-id via opencl for that.

Also i hope i made clear the reasoning behind the need of physical/logical GPU stuff in stupid ADL ^^

@@ -192,12 +255,12 @@ int wrap_adl_get_tempC(wrap_adl_handle *adlh, int gpuindex, unsigned int *tempC)
int wrap_adl_get_fanpcnt(wrap_adl_handle *adlh, int gpuindex, unsigned int *fanpcnt)
{
wrap_adlReturn_t rc;
if (gpuindex < 0 || gpuindex >= adlh->adl_gpucount)
if (gpuindex < 0 || gpuindex >= adlh->log_gpucount){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use adl_gpucount not log_gpucount. the adl count contain only physical devices and not all logical devices which can be a LOT. All other wrappers use the physical index so if you have 5 cards index 0-4 works.
If we use logical 0-50 or so would be valid indexes, but get the same values till we are on the next physical device. like 0-7 or so is first card...8-14 the next.
The logical count also varies per physical card. it depends on its connectors and stuff.

ADLFanSpeedValue *fan = new ADLFanSpeedValue();
fan->iSpeedType = 1;
rc = adlh->adlOverdrive5FanSpeedGet(adlh->phys_logi_device_id[gpuindex], 0, fan);
rc = adlh->adlOverdrive5FanSpeedGet(gpuindex, 0, fan);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phys_logi_device_id maps to the first logical adapter so this needs to be used too

@@ -140,6 +160,33 @@ return NULL;
if (adapterID == last_adapter) {
continue;
}
#if ETH_ETHASHCL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be AFTER the logical to physical loop, not inside of it.
The logical gpus are just an annoyance and we should only use the physical ones.
(i also had a bug in the beginning, because i thought we get the actual physical gpus, but got logical. so i added the mapping after the first implementation)

So you can do a for (unsigned gpuindex =0; gpuindex<adlh->adl_gpucount;gpuindex++) and use adlh->devs[adlh->phys_logi_device_id[gpuindex]].iBusNumber

return -1;
}
int power = 0;
rc = adlh->adl2Overdrive6CurrentPowerGet(adlh->context, adlh->phys_logi_device_id[gpuindex], 0, &power);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, keep the physical mapping in place please

@Davesmacer
Copy link
Contributor Author

I'm on it @smurfy... I got rid of phys->log mapping because it was getting me the wrong index. I have to see how to make proper mapping between opencl->phys and then phys->log. I guess that doing another loop after the phys->log mapping as you pointed out would solve that. Thank you

@Davesmacer
Copy link
Contributor Author

I did the correction on ADL. I'll work on the other wrappers this night!

adapterIndex, adlh->devs[i].iBusNumber, adlh->devs[i].iDeviceNumber, adlh->devs[i].iFunctionNumber,
j, (int)topology.pcie.bus, (int)topology.pcie.device, (int)topology.pcie.function);
printf("[DEBUG] - ADL GPU[%d]%d,%d,%d matches OpenCL GPU[%d]%d,%d,%d\n",
adapterIndex, adlh->devs[i].iBusNumber, adlh->devs[i].iDeviceNumber, adlh->devs[i].iFunctionNumber,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug uses not phys to logi :)


int wrap_adl_get_tempC(wrap_adl_handle *adlh, int gpuindex, unsigned int *tempC)
{
wrap_adlReturn_t rc;
if (gpuindex < 0 || gpuindex >= adlh->adl_gpucount)
if (gpuindex < 0 || gpuindex >= adlh->log_gpucount){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still using log_gpucount instead adl_gpucount here.(and in other functions too)

@jean-m-cyr
Copy link
Contributor

@smurfy @Davesmacer This is all getting fairly confusing! What is the guiding philosophy now with respect to device indices??? Is there a unique per card index... or two sets of indices for CUDA vs. CL... or, 3 sets by how you retrieve card info?

@smurfy
Copy link
Collaborator

smurfy commented Feb 22, 2018

@jean-m-cyr

its fairly simple:

  • The wrapper library functions all use the card index of the driver api. (ADL here has internal mappings because the adl api only returns a list of all logical gpus, we are only interested in physical cards)
  • Since opencl and cuda device indices can differ from the driver ones the wrapper also provide mappings to and from the driver one.
  • The caller (farm.h) decides what mapping to use based on miner thread/type
  • ADL and SYSFS only need opencl mappings
  • NVML needs to provide both cuda and opencl

@jean-m-cyr
Copy link
Contributor

@Davesmacer This is a rather long chain of commits. Do you know about rebase squash, and push --force?

@Davesmacer Davesmacer force-pushed the mixed-fix branch 2 times, most recently from 6325d28 to ac11b3d Compare February 23, 2018 01:25
@Davesmacer
Copy link
Contributor Author

@jean-m-cyr I finally got some time to rebase last couple of days of failed commits haha. Sorry to reply until now, but I have to keep attending regular work.

@smurfy Please check now. I believe all the code is there:
-ADL mapping with OpenCL works like a charm as I'm using it right now on my mixed rig.
-NVML mapping with CUDA is working as expected, however, as all my Nvidia cards have the same performance, I can't really tell.
-NVML mapping with OpenCL is there, but I can't test that because I only can run OpenCL with the option "--opencl-platform 2". This happens because OpenCL platform 0 is listing as my Intel Graphics, and while attempting to run with all platforms it fails to mine with Intel Graphics. It would be nice to have a "--opencl-platforms 1 2" option
-SYSFS mapping with OpenCL code is there, but as I can't compile it I commented all the lines I added so anyone can test it and fix any compiling issues. Please help me with this

@Davesmacer
Copy link
Contributor Author

Why not many will be running mixed rigs in Linux? In my case, I just buy whatever card with good efficiency that comes to my suppliers on a reasonable price. With video cards getting sold out every day, I can't really decide which brand to buy...

@SnowLeopard71
Copy link
Contributor

I'll be able to test once the update is merged.
Maybe many settle for running mixed rigs on windows instead of linux due to difficulty in installing both nvidia and amd drivers correctly on linux?
@Davesmacer Want to open another issue to discuss problems overclocking your cards?

@Davesmacer
Copy link
Contributor Author

Oh I get it, I didn't even manage to try configuring both drivers, I switched to Windows after not being able to OC the Nvidia cards... I guess I could, but I don't think that is an ethminer issue.

@nachitox
Copy link

Quick steps to configure NVIDIA and AMD on Ubuntu. You can overclock later with nvidia-settings

  1. Select integrated graphic card on BIOS
  2. Enter as root in console mode (not xorg)
  3. Mount disk: mount -o rw,remount /
  4. Install NVIDIA drivers: ./NVIDIA-Linux-x86_64-375.26.run --no-opengl-files
  5. Install AMD drivers: ./amdgpu-pro-install --headless --opencl=legacy,rocm
  6. Configure /etc/X11/xorg.conf
Section "ServerLayout"
    Identifier "layout"
    Screen 0 "intel"
    Screen 1 "nvidia"
EndSection

Section "Device"
    Identifier "intel"
    Driver "intel"
    BusID "PCI:0@0:2:0"
    Option "AccelMethod" "SNA"
EndSection

Section "Screen"
    Identifier "intel"
    Device "intel"
EndSection

Section "Device"
    Identifier "nvidia"
    Driver "nvidia"
    BusID "PCI:1@0:0:0"
    Option "ConstrainCursor" "off"
    Option "Coolbits" "28"
EndSection

Section "Screen"
    Identifier "nvidia"
    Device "nvidia"
    Option "AllowEmptyInitialConfiguration" "on"
    Option "IgnoreDisplayDevices" "CRT"
EndSection

@Davesmacer
Copy link
Contributor Author

Davesmacer commented Feb 26, 2018

Hello everyone!

@nachitox thank you! I'll try to configure both GPU's when I switch to Linux. On the meantime I'm stuck on Winshit.

@smurfy the mapping using those undocumented tokens seemed to work ok, it actually mapped devices on the same order that OCL gave them. I don't know if this is only my machine or if they are always in the same order for NVML and OCL. However, when trying to actually mine with OCL and my Nvidia cards, the program stopped abruptly as if some uncaught exception raised after line 690 of CLMiner:

ETHCL_LOG("Creating mining buffer");

I reviewed the code and noticed nothing strange. I think the bug is in the lines I added/changed in CLMiner.cpp. I'll dig deeper into this later tonight, but if you have any ideas of what this might be I'll be grateful.

@smurfy
Copy link
Collaborator

smurfy commented Feb 26, 2018

Sorry, i'm a bit busy at the moment but i will rebase and fix all remaining stuff here.
Probably within this week

@smurfy smurfy force-pushed the mixed-fix branch 3 times, most recently from e5a9106 to f5ceb15 Compare February 26, 2018 23:25
@smurfy
Copy link
Collaborator

smurfy commented Feb 26, 2018

Ok, found some time :) Since i do not have a mixed setup to test with i cant test it.
I tested:

  • nvml opencl/cuda (only one card)
  • sysfs opencl (3 cards)

@smurfy smurfy force-pushed the mixed-fix branch 3 times, most recently from 358bee8 to 750f592 Compare February 26, 2018 23:43
@smurfy
Copy link
Collaborator

smurfy commented Feb 27, 2018

After some thinking while trying to sleep, i think the m_cuda_count/m_opencl_count approach is wrong and does not really work. especially with custom device selection. In my opinion the deviceIndex should be set correctly with the Miner class. I just need to figure out how to set it. I think the correct value is already there somewhere :)

Samples of correct values:

4 Nvidia cuda cards: index 0-3
3 Opencl cards: index 0-2
4 nvidia cuda + 3 amd opencl: index 0-3 for cuda, 0-2 opencl

4 nvidia with device selection --cuda-devices 1 3: index 1 3

Especially the last sample will not work with current version. Since i can test this without mixed cards. I will do that tonight.

@chfast
Copy link
Contributor

chfast commented Feb 27, 2018

I have no read all of the comments, but what do you think about enumerating devices globally, without distinguishing the vendor?

@smurfy
Copy link
Collaborator

smurfy commented Feb 27, 2018

@chfast well, the enumeration of devices will be done only once per wrapper on miner startup.

  • nvml: nvml api devices, opencl devices with limit to nvidia, cuda devices
  • sysfs: detect linux cards hwmon, opencl devices with limit to amd
  • adl: adl api device, opencl devices with limit to amd.

Only double code is opencl get devices + count. Its around 10 lines. (1:1 on adl/sysfs, slight modified on nvml) Which can be probably moved to wrapper. everything else is quite wrapper api specific.

opencl and cuda fetching is to provide mapping arrays between internal api device index and opencl/cuda device index, which can and is different depending on operating system and possible enabled/disabled devices.

@Davesmacer
Copy link
Contributor Author

Davesmacer commented Feb 27, 2018

I just pulled the last version, I couldn't build. Hunter throws an error while configuring new OpenSSL lib:

image

I believe is something related with a file named NUL in the OpenSSL lib. Windows get some errors with those files.
Any ideas?

@smurfy
Copy link
Collaborator

smurfy commented Feb 27, 2018

Any ideas?

i think you need perl installed

@smurfy
Copy link
Collaborator

smurfy commented Feb 27, 2018

Ok, i rebased again and added a fix of the device ids. Tested with AMD on linux only.
It worked great even with --opencl-device 0 2 and it showed the correct temp/fan.

So hopefully this PR is now complete. I have some HWMON optimisation in my mind but i will do them with a separate PR.

@Davesmacer
Copy link
Contributor Author

Works like a charm on Windows 10 with 7 Nvidia and 2 AMD. Also with 18 Nvidia shows correct monitors. It also works specifying devices with --opencl-devices and cuda-devices! You're awesome @smurfy!

@jean-m-cyr
Copy link
Contributor

Looks like this might be it! Time for:

git rebase -i HEAD~n

…cted cards.

Multiple AMD+NVIDIA HW monitors fix

Help corrected in -X option. Some minor bugs fixed. Working on switched monitors values

Help minor change

Added methods to get pci-id

WrapADL is working now and showing correctly the monitors for my mixed rig with two AMD's.

Lost include fix

Fixing OpenCL index mapping in all the wrappers. Merge from master

Farm.h mappings

Mapping NVML with OCL

Some cleanup and refactoring

Using the device index, needs testing
@smurfy
Copy link
Collaborator

smurfy commented Feb 28, 2018

Rebased and squashed, ready to merge in my opinion :)

@jean-m-cyr jean-m-cyr self-requested a review February 28, 2018 19:42
@jean-m-cyr jean-m-cyr merged commit b81dcde into ethereum-mining:master Feb 28, 2018
@Davesmacer Davesmacer deleted the mixed-fix branch February 28, 2018 20:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants