How to Deal with Intermittent Bugs
The intermittent bug is a cousin of the 50-foot-invisible-scorpion-from-outer-space kind of bug. This nightmare occurs so rarely that it is hard to observe, yet often enough that it can't be ignored. You can't debug because you can't find it.
Although after 8 hours you will start to doubt it, the intermittent bug has to obey the same laws of logic everything else does. What makes it hard is that it occurs only under unknown conditions. Try to record the circumstances under which the bug does occur, so that you can guess what the variability really is. The condition may be related to data values, such as ‘This only happens when we enter Wyoming as a value.’ If that is not the source of variability, the next suspect should be improperly synchronized concurrency.
Try, try, try to reproduce the bug in a controlled way. If you can't reproduce it, set a trap for it by building a logging system, a special one if you have to, that can log what you guess you need when it really does occur. Resign yourself to that if the bug only occurs in production and not at your whim, this may be a long process. The hints that you get from the log may not provide the solution but may give you enough information to improve the logging. The improved logging system may take a long time to be put into production. Then, you have to wait for the bug to reoccur to get more information. This cycle can go on for some time.
The stupidest intermittent bug I ever created was in a multi-threaded implementation of a functional programming language for a class project. I had very carefully ensured correct concurrent evaluation of the functional program, good utilization of all the CPUs available (eight, in this case). I simply forgot to synchronize the garbage collector. The system could run a long time, often finishing whatever task I began, before anything noticeable went wrong. I'm ashamed to admit I had begun to question the hardware before my mistake dawned on me.
At work we recently had an intermittent bug that took us several weeks to find. We have multi-threaded application servers in Java™ behind Apache™ web servers. To maintain fast page turns, we do all I/O in small set of four separate threads that are different than the page-turning threads. Every once in a while these would apparently get ‘stuck’ and cease doing anything useful, so far as our logging allowed us to tell, for hours. Since we had four threads, this was not in itself a giant problem - unless all four got stuck. Then the queues emptied by these threads would quickly fill up all available memory and crash our server. It took us about a week to figure this much out, and we still didn't know what caused it, when it would happen, or even what the threads where doing when they got ‘stuck’.
This illustrates some risk associated with third-party software. We were using a licensed piece of code that removed HTML tags from text. Although we had the source code (thank goodness!) we had not studied it carefully until by turning up the logging on our servers we finally realized that the email threads were getting stuck in this problematic licensed code.
The program performed well except on some long and unusual kinds of texts. On these texts, the code was quadratic or worse. This means that the processing time was proportional to the square of the length of the text. Had these texts occurred commonly, we would have found the bug right away. If they had never occurred at all, we would never have had a problem. As it happens, it took us weeks to finally understand and resolve the problem.