-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test for and fix (*vm).Interrupt data race #14
Conversation
The race occurs due to a read on vm.interrupt without locking. The write locks it, but it appears the lock is only used to set/clear the interrupt value. The test is a bit of a tricky thing since it may not detect it 100% of the time, but so far it hasn't failed to spot the race. The race detector spits out the following: ================== WARNING: DATA RACE Write at 0x00c420081841 by goroutine 7: github.com/dop251/goja.(*vm).Interrupt() /go/src/github.com/dop251/goja/vm.go:299 +0x96 github.com/dop251/goja.(*Runtime).Interrupt() /go/src/github.com/dop251/goja/runtime.go:781 +0x68 github.com/dop251/goja.testInterruptRace.func3() /go/src/github.com/dop251/goja/interrupt_test.go:23 +0x99 Previous read at 0x00c420081841 by goroutine 6: [failed to restore the stack] Goroutine 7 (running) created at: github.com/dop251/goja.testInterruptRace() /go/src/github.com/dop251/goja/interrupt_test.go:24 +0x204 github.com/dop251/goja.TestInterruptRace() /go/src/github.com/dop251/goja/interrupt_test.go:41 +0x44 testing.tRunner() /usr/local/go/src/testing/testing.go:610 +0xc9 Goroutine 6 (running) created at: testing.(*T).Run() /usr/local/go/src/testing/testing.go:646 +0x52f testing.RunTests.func1() /usr/local/go/src/testing/testing.go:793 +0xb9 testing.tRunner() /usr/local/go/src/testing/testing.go:610 +0xc9 testing.RunTests() /usr/local/go/src/testing/testing.go:799 +0x4ba testing.(*M).Run() /usr/local/go/src/testing/testing.go:743 +0x12f main.main() github.com/dop251/goja/_test/_testmain.go:544 +0x1b8 ==================
The interrupt bool is checked both inside and outside of a lock during writes, resulting in a data race. This change replaces the bool with a uint32 and adds an accessor method, interrupted(), that returns whether an interrupt has been set (value irrelevant to this since it's only ever read/written after acquiring a lock). The interrupt is now signalled by flipping the interrupt integer to a non-zero value. This can only be done via atomic methods -- all reads and writes to that integer must go through the atomic package.
The race condition here (as well as not using a channel) is quite deliberate. I may be missing something, but here is my rationale: in the worst case scenario (i.e. when the race condition occurs) we may read false when it's actually true. So what? We interrupt one step later. Is it a problem? I don't think so. On the other hand this check sits in the most critical spot of the VM. Replacing it with a function call, introducing locking, or using a channel would impact the performance quite significantly. |
not 100% sure, but so far as i know, race conditions are allowed to do anything. not restricted to "things which make any kind of sense or would be otherwise possible". So, you could read the wrong value, sure. Or you could get a runtime panic. Or you could read a value which is neither true nor false. Or just about anything else. The compiler is allowed to generate code which will explode spectacularly in such cases; it's not required that the result will definitely be a successful read of either false or true. |
I have trouble understanding the merits of a deliberate data race. It is, under no circumstance, acceptable to have an interrupt with a data race -- it's something that must behave predictably. More generally, it isn't acceptable to have a data race at all -- intended or not, this invokes undefined behavior. Also, although you say it'll affect performance significantly -- and there's no data to back that up just yet -- but "faster than a data race" shouldn't be a requirement for fixing a data race. The data race should be fixed, one way or another, and then someone can invest time in recouping any lost performance if it's considered a problem. I'd argue that performance gains should be easier to make elsewhere than around an interrupt check. So, in this case, I'd urge you to prioritize correctness over performance for this. If you need performance with the interrupt still and it somehow happens that goja is optimized to the point that only the interrupt can be improved, please investigate options that don't involve data races. |
I'm sorry @nilium but what do you mean by "correctness" in this particular case? Correct behaviour would be to interrupt the VM at some point after Interrupt() is called which is exactly what happens (even in your test case). Note that the race condition is only on the interrupt flag, not on the value. If, as @seebs suggests, it is possible to get a panic I'll fix it of course. So far I could not find any reference to indicate it's the case. |
In the context of a data race, the program's behavior becomes undefined. It is not reasonable to speak of correct behavior when the behavior is not defined in the first place. A program invoking undefined behavior can behave in any way at all; any (possible) guarantee of correctness has been lost. For a taste of the effects of such "benign" data races, I recommend reading this article. Yes, in practice, the VM will probably be interrupted at some point after calling Interrupt(), and the program will probably not crash, but that is not the critical point here. To quote from the aforementioned article:
As for the question of performance, my intuition is that the cost of the reflect-heavy code in the VM completely dwarfs the cost of an |
@dop251 If I run a program with the race detector on and goja triggers a series of error messages in the output, that should be enough to justify fixing it. It should also be alarming, not something to be content with (much less deliberate, which is questionable at best). That means the program is not correct because it's -- as it suggests -- a race. If nothing else, what seebs is referring to is that anything that isn't covered by the Go memory model, is effectively undefined and may result in anything from incorrect behavior to reindeers to panics (this is because it's a data race -- Go can't define what happens in those scenarios). Unless you can amend the race detector and spec, goja is incorrect. And, per the doc:
Anyhow, I ran the benchmarks included in goja to see if this would have any impact. I can see no significant difference in the results on my laptop outside of the fibonacci benchmark, which I assume is broken since it's only managed one run. The empty loop may have suffered the most, with a difference of about 300ns (or about 0.3 microseconds), which isn't significant enough for me to say the interrupt fix has "impact[ed] performance quite significantly." The master columns refer to master as it is now, the patched columns refer to the benchmark times with a correct interrupt:
So, again, I urge you to prioritize fixing the data race instead of trying to rationalize it as benign. I'm happy to trade 300ns for sane code, but I would never be happy to gain 300ns at the cost of a data race. In go, there are no benign data races, and goja having one in its interrupt code is not acceptable when it should be safe to interrupt a VM. |
The fibonacci benchmark is not broken, @nilium, it just takes that long to compute and is the most representative test to illustrate the performance impact. The only real danger here is that the compiler optimises the check away because it can prove the flag can't change in the current thread. When that happens:
Until such time, even the go memory model document you're referring to contains the following code:
without any indication that anything bad can happen (other than it occasionally printing 2 then 0). To sum this up, I will generally reject any PR that has a clear negative impact on performance (regardless of how small) unless it fixes a real problem. If this policy upsets or offends anyone, I'm sorry. |
Wow. Just. Wow. This is thoroughly into the category of "I am now absolutely terrified to run any code with your name on it." Undefined behavior means undefined behavior, dude. If you won't listen when several experienced developers point out that something is a giant glaring red flag and absolutely unambiguously an error of a kind which is dangerous, and you'll chase performance at the cost of correctness? Just. Wow. |
@seebs @dop251 @nilium @AndreiCalin if you want to discuss it, write me on telegram ( |
@dop251 Please consider merging the PR. The performance impact is absolutely minimal and correct, clean code should always be chosen over undefined behaviour. |
@chriswessels we are talking about alternative implementation in #16. check it out. |
This patch adds a test for a data race under (*vm).Interrupt, where
an interrupt flag (boolean) is set to true inside of a lock but read
outside of a lock. Since the lock is only taken at the end of (*vm).run
to clear the interrupt value and un-set the interrupt flag, it doesn't
make sense to lock for every check on the interrupt flag.
I assume there's a reason for not using a channel here (even though
it'd be simpler), but the most minimal change to be made right now is
to replace the bool with a uint32 (or some other integer) and
read/write it via atomics.
The patch adds an (*vm).interrupted method to check the interrupt flag
atomically and stores it only via atomics. The type change to uint32
also ensures that there are no other uses of the interrupt bool field
(since a uint32 isn't ever treated as a bool and vice-versa). The lock
remains in place to serialize setting and clearing the interrupt value.
To test the race, checkout the first commit (73d6439) and run
go test -race -run TestInterruptRace
-- this should be enough totrigger it.