Skip to content
This repository has been archived by the owner on Apr 24, 2022. It is now read-only.

ethminer crashing #596

Closed
ajayaks opened this issue Jan 19, 2018 · 44 comments
Closed

ethminer crashing #596

ajayaks opened this issue Jan 19, 2018 · 44 comments

Comments

@ajayaks
Copy link

ajayaks commented Jan 19, 2018

Hi,
We are ethminer with 4 GTX 1070 Ti and MSI Z270 motherboard. After 1 hr of mining ethminer is crashing and throwing below error.

"CUDA error in func 'ethash_cuda_miner:: search' at line 300 : unspecified launch failure"

Please suggest.

@fastaprilia
Copy link

I would see this issue regularly when the 1070 gets too hot. Try backing off your overclocking or at least set your max temps to 60C.

@kronem
Copy link

kronem commented Jan 20, 2018

I have this same error with the latest release. Rolled back to 0.12.0 release which never crashes. My GPU temps average around 55C, so I'm not over pushing it.
I am running 8 Gigabyte 1070 G1's

@AndreaLanfranchi
Copy link
Collaborator

From 0.12 to 0.13.rc9 there has been a massive change in jobs switching and calls to the GPU's kernels.

Thus I suggest to adopt 0.13.rc9 and lower your OC settings. You will surely get the same (or even better) hashrate with lower gpu stress (less power consumption and heat production) with a waaaay more stable hashrate detected by the pool.

@AndreaLanfranchi
Copy link
Collaborator

In general : OC settings for 0.12 may result to be too high for 0.13.rc9

@DLS-bau
Copy link

DLS-bau commented Jan 23, 2018

0.13 doesn't give higher effective hashrate than 0.12 even at the same clocks. It's the reported hashrate that seems more stable. Factor in the crashes and you get a lower hashrate than even claymore with fee included.
No, cards aren't running too hot, 0.13 is simply broken.

@jean-m-cyr
Copy link
Contributor

jean-m-cyr commented Jan 23, 2018

0.13 doesn't give higher effective hashrate than 0.12.

It would take more than 24 hours running two identically configured miners against the same workload, for you to make that claim.

My take on these types of crashes. Overclocking... period. Software that doesn't crash when not overclocked, can't be blamed for crashes when overclocked. It's that simple! You want to push your GPUs and busses beyond their limits, fine... your call. Don't blame the software.

BTW. It is entirely possible that this problem can be mitigated in software. Make it happen at default clocking!

@satori-q3a
Copy link

I've been running v0.13 for half a day now on two rigs and haven't had any problems, in fact Nanopool is reporting a slightly higher hash rate than with v0.12 but that may be subjective or due to to ebbs and flow of the pool tide.

Tuning is a compromise between high hash rates, power levels and running stable. My cards use Micron memory and I've settled for settings geared more towards stability...

nvidia 1070 (micron) ... GPU +80, Memory +900, Power -30 (70%) gives me 30 MHash, 104 watts and 65C with MCU running 100%

@ZiDanRO
Copy link

ZiDanRO commented Jan 24, 2018

This error appears on one or two of my rigs

CUDA error in func 'ethash_cuda_miner:: search' at line 300 : unspecified launch failure

I notice it happens in 1-2-14 hours after starting ethminer, but if i close it and start again (only the program) in 99% of the cases it runs without problems for days. I've tried also lowering OC for a while and the same behavior. I have also symmetrical rigs without this error.

I think it has something with the memory allocation in the beginning and if some cases are fulfield it craches. Hard to find where is the problem. Anyway it started with 13.rc1

@ajayaks
Copy link
Author

ajayaks commented Jan 24, 2018

Just saw 0.13.0 release , i hope this issue has been fixed in this release. Will verify and update.

@kronem
Copy link

kronem commented Jan 24, 2018

I moved up to 0.13.0 rc9 and its been running stable with no issues on two rigs, a total of 9 1070's. Also appears to have a better hash rate than previously.

@ddobreff
Copy link
Collaborator

Compared to previous rc1-7 rc9 has significant improvement in sharerate, hashrate remains the same but reported->effective is on par or effective is a bit more. Lower your OC settings with at least -100 of memory for stability.

@AndreaLanfranchi
Copy link
Collaborator

0.13 doesn't give higher effective hashrate than 0.12

As we're talking about 0.13.0rc9 there is an empirical demonstration that this is actually possible. Having lowered jobs switch time each of yours GPU has slightly more time to hash a job and suffers from minor dips in hashrate as depicted by the output.
Also the average "effective" hashrate as reported by the pool has way minor variance from reported hashrate. Thus you're overall performance has improved.

This anyway is MY experience all with NVIDIA (1050 ti, 1060 and 1070).

@jackyfd
Copy link

jackyfd commented Jan 24, 2018

The pre-built 0.13.0 binary works well on my end, but when I build the binary from 0.13.0 source, it crash on startup.

I am using vs 2017 community with Chinese language plugins, and I can see some wrong-encoded words in the console. I can't be sure it is related to the crash.

@jackyfd
Copy link

jackyfd commented Jan 24, 2018

confirmed that a reinstall of vs 2017 with English language package does not help

@jean-m-cyr
Copy link
Contributor

@ZiDanRO

I notice it happens in 1-2-14 hours after starting ethminer, but if i close it and start again (only the program) in 99% of the cases it runs without problems for days.

Another interesting data point. Are you saying that once you restart the program after such a failure, it never happens again on the same rig?

@ZiDanRO
Copy link

ZiDanRO commented Jan 24, 2018 via email

@kronem
Copy link

kronem commented Jan 24, 2018

Need to withdraw my earlier comment. After a restart of my system, it is crashing after 5-10 minutes of mining. Repeated starts is not helping. Switched to claymore with exact same overclock settings and no issues.

How can the problem be overclocking when I am mining with the same settings with another miner???

@AndreaLanfranchi
Copy link
Collaborator

How can the problem be overclocking when I am mining with the same settings with another miner???

Different CUDA kernels implementations may be the answer.

@jean-m-cyr
Copy link
Contributor

jean-m-cyr commented Jan 24, 2018

How can the problem be overclocking when I am mining with the same settings with another miner???

Ok, legitimate question.

I spent 20 years working closely with the silicon designers at Broadcom. Here's what they taught me:

There is no such thing as digital logic! Everything is analog. We like to think of a flip-flop or a memory cell as being either a 0 or a 1, but in fact this is just a convenient way of thinking for most of us that don't have to deal with high-speed silicon design. In reality what we really have is the probability of a 1 being read back as a 1, and the same for a zero. Designers choose the 'default' clock rate a chip will run at such that the probability of error is so low that it can be considered to be 0 for all intensive purposes. As you increase the clock rate the probability of error increases. When we overclock, we are effectively tuning a dial that controls that probability of error.

A single bit GPU error can have an almost infinite range of effects, from a pixel the wrong color for one frame in a video game, to a misinterpreted cuda instruction causing a bus fault! It can happen anywhere, in the cuda instruction pipeline, in the DAG memory, at the Pcie interface... It can be caused by specific sequences of cuda instructions that may or may not exist in any given version of a program or programs.

These are not the types of phenomenon that are diagnosable or correctable at the host software level. Sure we could take the nuclear approach, implement some kind of watchdog and demand reboot privilege for when the mine gets gummed up, but in the end all I'm interested in is the number of ka-chings I see at the end of the day. My humble 4 card miner at +600 mem xfer offset never crashes, and I'm ok with that!

@satori-q3a
Copy link

I wonder, do they still teach Digital Logic Design? At it's heart, logic elements are really just analog transistors that are fine tuned to switch at specific voltage levels and to ignore noise on the line propagated by other logic elements in the system. And array processors, which cuda hides from programmers, are especially susceptible to noise because of the high density of logic elements and massive interactions among them.

opps... I digress..

@fastaprilia
Copy link

Hope this is helpful -

My experience going from 13rc5 to 13.0 was that 13.0 is far more stable than 13rc5. There appears to be some sort of timing issue that surfaces during the search that can cause an illegal access error - the issue may be in the nvidia software itself (I am on nvidia 390.65 and win10, all Pascal chips, no opencl hashing). I can force the timing issue more reliably by setting the cuda parameters well above defaults.

I started on 13.dev0 so I have no comparison between 12 vs 13.

In my environment, right after the DAG is built, my hashing rate skyrockets momentarily, well above what the card is capable of sustaining. I'm talking about the miner reports 90Mh/s on a card that I can push to sustain 40-45Mh. If the miner is going to lock, usually it is during this spike time. I am running with --cuda-parallel-hash 8 --cuda-streams 16 --cuda-block-size 256 --cuda-grid-size 8192. If I double the grid size it will fail consistently and reasonably soon.

With the current settings it seems to be running well (beyond 15 hours at this point). I did have to back off the clocking a little bit between rc5 and 13.0 but my share rates are overall improved.

@jean-m-cyr
Copy link
Contributor

jean-m-cyr commented Jan 24, 2018

@fastaprilia

--cuda-streams 16

Try, --cuda-streams 1
I'm not sure why this parameter even exists? In Nvidia streams were introduced to support the interleaving of host-to-GPU and GPU-to-host data transfers. The cuda miner does near 0 such data transfers, so there's no benefit to increasing streams, in fact higher stream numbers will slow your job switch time a little.

@jean-m-cyr
Copy link
Contributor

@satori-q3a

I wonder, do they still teach Digital Logic Design?

Yeah, they do. I'm not worried for the future. I've already passed the torch on to very capable young engineers.

@kronem
Copy link

kronem commented Jan 24, 2018

"--cuda-parallel-hash 8 --cuda-streams 16 --cuda-block-size 256 --cuda-grid-size 8192"

I'm not using any of these settings. What is the effect of each? What should I set mine to, with Windows 10 and 8 nvidia 1070's?

@jean-m-cyr
Copy link
Contributor

jean-m-cyr commented Jan 24, 2018

@kronem Hard to say... not fluent in Windows. I go with the defaults and --cuda-streams 1 on my 1060s.

@kronem
Copy link

kronem commented Jan 25, 2018

I've been running claymore for the last 8 hours and so far ethminer outperforms it. Need it to be stable though, otherwise it is pointless.

@satori-q3a
Copy link

Actually, I've seen that error on Ethminer, but that's only on the work station. I blame it on Chrome with gpu acceleration enabled not playing nice with cuda, but I hate to disable gpu accel because a page like GDAX exchange will run a cpu core at 100%. So I'm aware that when the desktop goes blank for half a sec that ethminer probably glitched too.

The dedicated miner rig uses the intel HD graphics for the desktop, but I never use the miner for anything else and ethminer keeps ticking until I do periodic maintainance on the rig.

@aleqx
Copy link

aleqx commented Jan 25, 2018

@jean-m-cyr you have my appreciation and gratitude for your work on this. I also like that we have similar backgrounds (I also worked in fpga and asic design). I have been using ethminer for quite a while, since before 0.12, and I tried many 0.13 dev code along the way (even buggy ones). I have access to hundreds of gtx1070 cards hosted in a temperature controlled (very low temp) environment. Mining ethereum is a bad choice in terms of profitability for gtx1070, but i like it that i can keep the cards cool and fans running at low speed (longer life) since the gpu itself is mostly doing mem transfers and is hardly stressed, hence why reducing tdp or using 0 overclocking on the gpu doesn't affect eth hashrates.

I have mixed feelings about the changes you guys made in 0.13. My problem is that i've been testing all sort of 0.13dev code in the past 3-4 weeks that i can't tell anymore if 0.13 is better, mostly because 0.13 is crashing more than 0.12 and I had to tone down my mem overclocking to get it stable and sadly I now get slightly less hashrates than before, though I need to do more testing (and I now have little time for testing).

ethminer.org does report slithly more stable hashrates (it's not drastically better), but it also reports a higher rate of stale shares that i didn't have with 0.12. I used to have 2% stales. Now I get 3%, knocking on 4%, grrr.

Also, is it just me or does cuda9.1 + 390.12 driver (linux) perform better? You say the cards are more stressed with the new 0.13 code, but they seem to respond quicker when I query them with nvidia-smi while they are mining. With 384.* drivers and cuda8 some cards were almost hanging if you sent an nvidia-smi query to them while they were hashing. I think the new cuda9.1 and driver are making this aspect better. Not sure if anyone else had such issues (you get to see all sort of things when you deal with hundreds of cards) but this aspect alone is why I may keep 0.13 with cuda9.1 and 390.12 drivers.

I always disliked claymore's miner. I saw many claims that the pool reported hashrate is better than ethminers, but that absolutely never the case for me. The stale shares rate in claymore was also way higher in my case, up to 6% (from 2%). The miner's reported hashrate is indeed quite a bit higher in claymore, but that's misleading ...

Finally, ka-chings are great, but a watchdog would actually be incredibly useful and also give you more ka-chings: you wouldn't have to reboot or restart manually (or via scripts and watchdogs written externally). Lots of fatal errors reported by ethminer should actrually be just warnings (not even a gpu restart is needed, e.g. Xid 31). I kept hoping I'll get some time to implement it myself and contribute it, but alas, doesn't look I'll be able to in the near future.

@jean-m-cyr
Copy link
Contributor

jean-m-cyr commented Jan 25, 2018

@aleqx Lots to respond to...

I assume all of this is about the final, not the rc's. Many the the rc's had significant problems and I'll take credit for most of that.

I really never ran release 12 so I've no basis for comparison. All I know is when I came to this a month or so ago, r13 was underway and looking at it purely from the CUDA perspective, glaring performance issues were evident. Based on what I was getting with the early 13 releases, the final is a big improvement in effective hash rate. I don't know what I would have gotten with r12. I've mostly used Claymore as comparison since everyone seems to think it's some kind of magical golden standard!

I didn't notice any difference switching over to 9.1 and 390.12 drivers, but I was pretty busy sorting other things at the time.

I'm hearing so much demand for this watchdog thing, that I've actually given it a little thought! But since I'm not seeing any of these restarts, and I don't want to push my cards till I do, I'm not sure how I'd go about testing an implementation? Not much incentive... all is running smoothly on my tiny miner.

@AndreaLanfranchi
Copy link
Collaborator

I go with the defaults and --cuda-streams 1 on my 1060s.

Thank you @jean-m-cyr
must say that with cuda-streams set to 1 I record way smaller dips in hashrate when several different jobs get pushed from the pool.

On one test rig 6 x Gtx 1050 Ti which averages (in total) 86.5 Mhs I used to see the hashrate to dip to 84, 83 or even 81 for few seconds when multiple jobs received.
On streams=1 it never gets below 86.1

Wonder why streams is not set as default value to 1

@fastaprilia
Copy link

Thank you @jean-m-cyr
must say that with cuda-streams set to 1 I record way smaller dips in hashrate when several different jobs get pushed from the pool.

Ditto. Might even be able to turn the clocks up and tinker with the other parameters a little more. Thank you.

@jean-m-cyr
Copy link
Contributor

jean-m-cyr commented Jan 25, 2018

Wonder why streams is not set as default value to 1

I wonder why it's even an option? I makes no sense for an app like a miner to use any value other than 1. So much urban legend around this thing... more is not always better!

@aleqx
Copy link

aleqx commented Jan 25, 2018

Regarding the watchdog: push your memory overclocking and watch your kernel log (dmesg or /var/log/kern.log) for Xid messages reported by NVRM, here's an example of Xid 31. Note that it also reports the pci bus id of thew affected GPU:

Jan 25 18:15:26 node13 kernel: [382330.121233] NVRM: Xid (PCI:0000:05:00): 31, Ch 00000013, engmask 00000101, intr 10000000
2

On windows, I don't know (event viewer I guess).

Here's some of my knowledge acquired through blood and pain ... lots of pain.

  • Most of the errors you get at high OC should be Xid 31, which are rather harmless. You should be fine if you just skipped to the next share in the code (no driver restarts should be needed). Right now I just relaunch ethminer, so time is wasted on reconnecting (which can take ages in ethminer because of these bugs: Failed to resolve server for a long time #259 and "Could not resolve host" - even when using IP addresses, even on same LAN #624 ) and reloading DAG etc.
  • You may get Xid 32 and Xid 8. These usually leave the card in an incorrect state and a driver restart is needed (nvidia-smi -i gpuID -r or rmmod nvidia* in Linux suffice) - a machine reboot is not needed. In Linux, bloody Nvidia doesn't provide you with cmd line tools that allow restarting a single GPU if any other GPU is being used (hell, they don't even allow you to reset GPU0 even when all GPUs are idle and no gpu app is running, not even X). Some mining software do restart all GPUs via CUDA when an error is encountered. I think ccminer (on github - tpruvot's or klaust's) can do so, but I haven't used it much so can't confirm. I use a few closed source ones that do recover form Xid 8 and Xid 32 without exiting (e.g. bminer).
    NOTE: Sometimes it MAY look like relaunching ethminer works; after relaunch ethminer will show hashrate, and you may even get a few shares accepted (usually they are rejected), but after a while ethminer starts reporting insane hashrates, e.g.:
    m 16:31:57|miner.1 Speed 134.52 Mh/s gpu/0 134.52 [A310+1:R3+0:F0] Time: 11:14
    (that's for a single gtx1070) and all submitted shares are then rejected. A driver restart is required to get the card out of this state.
  • You may also get Xid 13 and Xid 43 (they usually com in pairs, first a bunch of 13, then one 43). Similar story to Xid 8 and 32 above. Sometimes its fully recoverable by a simple relaunch of ethminer, other times a driver restart is needed. Distinguishing the two cases may be possible form the kernel error report (e.g. Channel ID 00000009 intr 00020000 which are not always the same).
  • You may also get Xid 61 and Xid 62. These are very weird. When these happen, ethminer doesn't always detect an error and continues to hash happily. However, listen to this: the hashrate is reduced (sometimes slightly, sometimes by as much as 40%) and as soon as you quit ethminer, the card becomes unusable and hangs all other cards! You can restart the affected GPUs manually via nvidia-smi -i ${gpuid} -r without a machine reboot.
  • The deadliest is Xid 79 (gpu fell off the bus) when the card truly goes away and the driver can't see it anymore. System restart is the only way to bring it back. Sometimes (rarely) it comes back dead still, and you have to actually cycle the power to the rig ... talk about the chip getting into unpredictable states, eh?

In summary: most Xids are recoverable either directly or via a driver restart, which should not require exiting the miner (might even allow you to keep the DAG).

@jean-m-cyr
Copy link
Contributor

@aleqx Good stuff. Filed for future reference. Thank you.

@aleqx
Copy link

aleqx commented Jan 25, 2018

Xid errors reference from Nvidia: http://docs.nvidia.com/deploy/xid-errors/index.html

@aleqx
Copy link

aleqx commented Jan 26, 2018

One further comment: I definitely get lower hashrate with --cuda-streams 1 than I do with the default --cuda-streams 2 ... this is the ethminer reported hashrate (not yet tested with pool, but I don't expect it to be different). About 1 MH/s lower in fact (which is about 3% in my case). Increasing streams beyond 2 doesn't improve hashrate.

@aleqx
Copy link

aleqx commented Jan 26, 2018

Also, changing grid size or block size makes no difference in hashrate.

But I've been meaning to ask @jean-m-cyr , especially given the new changes he made: would increasing cuda-streams and/or cuda-block-size and/or cuda-grid-size and or cuda-parallel-hash put less stress on the GPU (less context switching, offloading, etc) or it makes no difference? It may affect overclocking potential

Put differently, if changing either of those made no difference to hashrate, how would you change each of them to achieve the least stress on the gpu and memory?

@jean-m-cyr
Copy link
Contributor

jean-m-cyr commented Jan 26, 2018

@aleqx All of these tuning parameters are mostly relevant for gamers, where the amount of data pushed back and forth between host and GPU is often high, and where the diversity of thread functions is also high. We don't have that in mining. We have a single thread type that runs a single calculation where the only things that go back and forth are the job header hash once per new job, and a few bytes each time a solution is found. That's why we get away with using 1X Pcie.

cuda-streams are meant to allow the developer to break up work where contentious Pcie access is a problem. It isn't a problem for hashing and using cuda-streams greater than one only means that we have to stop and restart more streams instead of just one. This can only be done sequentially so it lengthens the switch time.

Nvidia is not fond of mining, they know where their bread-and-butter is, HPC and gaming, so you'll find all CUDA features targeted and optimized for those environments. I'm not sure that any of these parameters will lower GPU memory stress. A hash calculation takes a fixed amount of calculations and a fixed amount access to the DAG memory. Hard to imagine how you'd get around that.

Again, GPU's hash at a fixed rate. The only thing that affects the measured hash count is how long you stall the GPU to switch jobs (discounting any power and thermal throttling). The shorter it takes, the closer you get to the GPU's actual hash rate, the more power you burn, etc...

There is always the possibility of improving the GPU's hash rate and power efficiency through CUDA code improvements, but none of that has happened recently.

@aleqx
Copy link

aleqx commented Jan 26, 2018

But the hashrate is definitely higher --with cuda-streams 2 instead of --cuda-streams 1 ... you should try it. Also, increasing --cuda-parallel-hash to from 4 to 8, or lowering it from 4 to 2, will decrease the hashrate, but any other value (3..7) seem to not affect hashrate.

I'm not yet familiar with the gpu architectures or programming, but why would ethminer provide all those --cuda-* options if (according to you) they do nothing? Don't they affect any switching time at all?

Wouldn't cuda-parallel-hash 3 result in less stress (3 instead of 4 parallel hashes being computed)?

EDIT: thanks for the explanations. Very educational. It's great to have you contributing to this project.

@jean-m-cyr
Copy link
Contributor

@aleqx Actually, your GPUs are doing 1000s of hashes in parallel.

I get pretty much the same hash rate with =1, =2, =4. Hard to say exactly when the averaged difference is less than .1%. What I do see is an increase in the standard deviation of the hash rate with higher values.

Can you quantify your claim a little? I'm not denying it, I just need more specific data to better understand.

@kronem
Copy link

kronem commented Jan 26, 2018

I switched to 0.13.0 24 hours ago and it has been stable on two rigs with no issues. The hash rate and Eth earned has been the highest per gpu since Jan 14th. I am running a total of 9 Nvidia 1070's, 287.64 Mh/s avg hash rate and 0.00350 mined per card.

Fingers crossed that on a reboot I don't have any issues. FYI, I didn't change any settings in the startup batch file.

@aleqx
Copy link

aleqx commented Jan 26, 2018

Can you quantify your claim a little?

Sure, I did so earlier in here #596 (comment) where I said I lose ~1 MH/s from 32 MH/s if I use cuda-streams 1 instead of cuda-streams 2. GTX1070. That's quite a bit.

Wrt to cuda-parallel-hash, I was talking about the description given in --help: Define how many hashes to calculate in a kernel, can be scaled to achieve better performance. Default=4 ... For this one, values between 3 and 7 give the same hashrate, but 1, 2 or 7 give a lower hashrate. I was curious if 3 instead of 4 puts less stress on the GPU.

@aleqx
Copy link

aleqx commented Feb 1, 2018

I added more (useful) info in my driver errors post above: #596 (comment) ... hopefully someone can code a proper watchdog inside ethminer

@DeadManWalkingTO
Copy link
Contributor

After #757 (added --exit parameter to exit whenever an error occurred) you can use a watchdog.

Here is my ETHminerWatchDogDmW Windows7/8/10 [32/64] & Linux (Any Dist/Any Ver/Any Arch) (#735).

Try with latest Ethminer version and feedback please.
Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests