Test for and fix (*vm).Interrupt data race #14

nilium · 2017-02-14T17:31:19Z

This patch adds a test for a data race under (*vm).Interrupt, where
an interrupt flag (boolean) is set to true inside of a lock but read
outside of a lock. Since the lock is only taken at the end of (*vm).run
to clear the interrupt value and un-set the interrupt flag, it doesn't
make sense to lock for every check on the interrupt flag.

I assume there's a reason for not using a channel here (even though
it'd be simpler), but the most minimal change to be made right now is
to replace the bool with a uint32 (or some other integer) and
read/write it via atomics.

The patch adds an (*vm).interrupted method to check the interrupt flag
atomically and stores it only via atomics. The type change to uint32
also ensures that there are no other uses of the interrupt bool field
(since a uint32 isn't ever treated as a bool and vice-versa). The lock
remains in place to serialize setting and clearing the interrupt value.

To test the race, checkout the first commit (73d6439) and run
go test -race -run TestInterruptRace -- this should be enough to
trigger it.

The race occurs due to a read on vm.interrupt without locking. The write locks it, but it appears the lock is only used to set/clear the interrupt value. The test is a bit of a tricky thing since it may not detect it 100% of the time, but so far it hasn't failed to spot the race. The race detector spits out the following: ================== WARNING: DATA RACE Write at 0x00c420081841 by goroutine 7: github.com/dop251/goja.(*vm).Interrupt() /go/src/github.com/dop251/goja/vm.go:299 +0x96 github.com/dop251/goja.(*Runtime).Interrupt() /go/src/github.com/dop251/goja/runtime.go:781 +0x68 github.com/dop251/goja.testInterruptRace.func3() /go/src/github.com/dop251/goja/interrupt_test.go:23 +0x99 Previous read at 0x00c420081841 by goroutine 6: [failed to restore the stack] Goroutine 7 (running) created at: github.com/dop251/goja.testInterruptRace() /go/src/github.com/dop251/goja/interrupt_test.go:24 +0x204 github.com/dop251/goja.TestInterruptRace() /go/src/github.com/dop251/goja/interrupt_test.go:41 +0x44 testing.tRunner() /usr/local/go/src/testing/testing.go:610 +0xc9 Goroutine 6 (running) created at: testing.(*T).Run() /usr/local/go/src/testing/testing.go:646 +0x52f testing.RunTests.func1() /usr/local/go/src/testing/testing.go:793 +0xb9 testing.tRunner() /usr/local/go/src/testing/testing.go:610 +0xc9 testing.RunTests() /usr/local/go/src/testing/testing.go:799 +0x4ba testing.(*M).Run() /usr/local/go/src/testing/testing.go:743 +0x12f main.main() github.com/dop251/goja/_test/_testmain.go:544 +0x1b8 ==================

The interrupt bool is checked both inside and outside of a lock during writes, resulting in a data race. This change replaces the bool with a uint32 and adds an accessor method, interrupted(), that returns whether an interrupt has been set (value irrelevant to this since it's only ever read/written after acquiring a lock). The interrupt is now signalled by flipping the interrupt integer to a non-zero value. This can only be done via atomic methods -- all reads and writes to that integer must go through the atomic package.

dop251 · 2017-02-14T21:48:28Z

The race condition here (as well as not using a channel) is quite deliberate. I may be missing something, but here is my rationale: in the worst case scenario (i.e. when the race condition occurs) we may read false when it's actually true. So what? We interrupt one step later. Is it a problem? I don't think so. On the other hand this check sits in the most critical spot of the VM. Replacing it with a function call, introducing locking, or using a channel would impact the performance quite significantly.

seebs · 2017-02-14T22:51:57Z

not 100% sure, but so far as i know, race conditions are allowed to do anything. not restricted to "things which make any kind of sense or would be otherwise possible". So, you could read the wrong value, sure. Or you could get a runtime panic. Or you could read a value which is neither true nor false. Or just about anything else. The compiler is allowed to generate code which will explode spectacularly in such cases; it's not required that the result will definitely be a successful read of either false or true.

nilium · 2017-02-14T23:00:58Z

I have trouble understanding the merits of a deliberate data race. It is, under no circumstance, acceptable to have an interrupt with a data race -- it's something that must behave predictably. More generally, it isn't acceptable to have a data race at all -- intended or not, this invokes undefined behavior.

Also, although you say it'll affect performance significantly -- and there's no data to back that up just yet -- but "faster than a data race" shouldn't be a requirement for fixing a data race. The data race should be fixed, one way or another, and then someone can invest time in recouping any lost performance if it's considered a problem. I'd argue that performance gains should be easier to make elsewhere than around an interrupt check.

So, in this case, I'd urge you to prioritize correctness over performance for this. If you need performance with the interrupt still and it somehow happens that goja is optimized to the point that only the interrupt can be improved, please investigate options that don't involve data races.

dop251 · 2017-02-14T23:08:00Z

I'm sorry @nilium but what do you mean by "correctness" in this particular case? Correct behaviour would be to interrupt the VM at some point after Interrupt() is called which is exactly what happens (even in your test case). Note that the race condition is only on the interrupt flag, not on the value.

If, as @seebs suggests, it is possible to get a panic I'll fix it of course. So far I could not find any reference to indicate it's the case.

acln0 · 2017-02-15T00:04:04Z

In the context of a data race, the program's behavior becomes undefined. It is not reasonable to speak of correct behavior when the behavior is not defined in the first place. A program invoking undefined behavior can behave in any way at all; any (possible) guarantee of correctness has been lost. For a taste of the effects of such "benign" data races, I recommend reading this article.

Yes, in practice, the VM will probably be interrupted at some point after calling Interrupt(), and the program will probably not crash, but that is not the critical point here. To quote from the aforementioned article:

The bottom line. Just say No to “benign” races. Even if you still think that that particular data race is 100% safe (which I doubt), it’s still formally incorrect, fragile during code maintenance and produces noise under race detection tools.

As for the question of performance, my intuition is that the cost of the reflect-heavy code in the VM completely dwarfs the cost of an atomic.LoadUint32 on every loop iteration. On this matter, however, you shouldn't believe a word I say, but benchmark and profile the code yourself.

nilium · 2017-02-15T00:06:25Z

@dop251 If I run a program with the race detector on and goja triggers a series of error messages in the output, that should be enough to justify fixing it. It should also be alarming, not something to be content with (much less deliberate, which is questionable at best). That means the program is not correct because it's -- as it suggests -- a race.

If nothing else, what seebs is referring to is that anything that isn't covered by the Go memory model, is effectively undefined and may result in anything from incorrect behavior to reindeers to panics (this is because it's a data race -- Go can't define what happens in those scenarios). Unless you can amend the race detector and spec, goja is incorrect. And, per the doc:

If you must read the rest of this document to understand the behavior of your program, you are being too clever.

Don't be clever.

Anyhow, I ran the benchmarks included in goja to see if this would have any impact. I can see no significant difference in the results on my laptop outside of the fibonacci benchmark, which I assume is broken since it's only managed one run. The empty loop may have suffered the most, with a difference of about 300ns (or about 0.3 microseconds), which isn't significant enough for me to say the interrupt fix has "impact[ed] performance quite significantly." The master columns refer to master as it is now, the patched columns refer to the benchmark times with a correct interrupt:

                      |     master |        patched |              master |            patched
----------------------+------------+----------------+---------------------+-------------------
BenchmarkCompile      |        500 |            500 |       2739791 ns/op |      2752266 ns/op
BenchmarkGoReflectGet |    1000000 |         500000 |          2572 ns/op |         2570 ns/op
BenchmarkPut          |   30000000 |       30000000 |          46.2 ns/op |         49.0 ns/op
BenchmarkPutStr       |   30000000 |       30000000 |          39.9 ns/op |         41.8 ns/op
BenchmarkGet          |  100000000 |      100000000 |          16.2 ns/op |         16.2 ns/op
BenchmarkGetStr       |  100000000 |      100000000 |          12.5 ns/op |         10.9 ns/op
BenchmarkToString1    | 2000000000 |     2000000000 |          0.63 ns/op |         0.65 ns/op
BenchmarkToString2    |  500000000 |     1000000000 |          3.20 ns/op |         3.06 ns/op
BenchmarkConv         | 2000000000 |     2000000000 |          0.32 ns/op |         0.33 ns/op
BenchmarkArrayGetStr  |   50000000 |       50000000 |          23.9 ns/op |         27.8 ns/op
BenchmarkArrayGet     |   20000000 |       20000000 |          66.4 ns/op |         72.6 ns/op
BenchmarkArrayPut     |  100000000 |      100000000 |          11.3 ns/op |         11.9 ns/op
BenchmarkToUTF8String | 1000000000 |      500000000 |          2.88 ns/op |         3.26 ns/op
BenchmarkAdd          |   50000000 |       50000000 |          32.5 ns/op |         32.5 ns/op
BenchmarkAddString    |    5000000 |        5000000 |           260 ns/op |          259 ns/op
BenchmarkVmNOP2       |  200000000 |      200000000 |          6.51 ns/op |         6.65 ns/op
BenchmarkVmNOP1       |   50000000 |       50000000 |          37.9 ns/op |         37.8 ns/op
BenchmarkVmNOP        |  100000000 |      100000000 |          11.8 ns/op |         16.9 ns/op
BenchmarkVm1          |   30000000 |       30000000 |          50.3 ns/op |         56.8 ns/op
BenchmarkFib          |          1 |              1 |    6836786055 ns/op |   7250626136 ns/op
BenchmarkEmptyLoop    |     100000 |         100000 |         16830 ns/op |        17151 ns/op
BenchmarkVMAdd        |  100000000 |      100000000 |          22.6 ns/op |         22.5 ns/op

So, again, I urge you to prioritize fixing the data race instead of trying to rationalize it as benign. I'm happy to trade 300ns for sane code, but I would never be happy to gain 300ns at the cost of a data race. In go, there are no benign data races, and goja having one in its interrupt code is not acceptable when it should be safe to interrupt a VM.

dop251 · 2017-02-15T09:43:46Z

The fibonacci benchmark is not broken, @nilium, it just takes that long to compute and is the most representative test to illustrate the performance impact.

The only real danger here is that the compiler optimises the check away because it can prove the flag can't change in the current thread. When that happens:

I will be very happy for the go community (myself included).
TestInterrupt will fail making the problem immediately obvious.

Until such time, even the go memory model document you're referring to contains the following code:

var a, b int

func f() {
	a = 1
	b = 2
}

func g() {
	print(b)
	print(a)
}

func main() {
	go f()
	g()
}

without any indication that anything bad can happen (other than it occasionally printing 2 then 0).

To sum this up, I will generally reject any PR that has a clear negative impact on performance (regardless of how small) unless it fixes a real problem. If this policy upsets or offends anyone, I'm sorry.

seebs · 2017-02-15T13:50:03Z

Wow.

Just. Wow.

This is thoroughly into the category of "I am now absolutely terrified to run any code with your name on it." Undefined behavior means undefined behavior, dude. If you won't listen when several experienced developers point out that something is a giant glaring red flag and absolutely unambiguously an error of a kind which is dangerous, and you'll chase performance at the cost of correctness? Just. Wow.

kirillDanshin · 2017-02-15T14:04:12Z

@seebs @dop251 @nilium @AndreiCalin
we can fix it without negative impact on performance.

if you want to discuss it, write me on telegram (@kirilldanshin) or create an issue.

chriswessels · 2017-02-15T17:18:51Z

@dop251 Please consider merging the PR. The performance impact is absolutely minimal and correct, clean code should always be chosen over undefined behaviour.

kirillDanshin · 2017-02-15T17:32:59Z

@chriswessels we are talking about alternative implementation in #16. check it out.

nilium added 2 commits February 14, 2017 09:13

dop251 closed this Feb 15, 2017

nilium mentioned this pull request Feb 15, 2017

(*vm).Interrupt causes a data race #16

Closed

nilium mentioned this pull request Feb 23, 2017

Test for and fix (*vm).Interrupt data race #18

Closed

zupa-hu mentioned this pull request Jun 18, 2019

Runtime.Interrupt() never succeeds #97

Closed

tsedgwick added a commit to tsedgwick/goja that referenced this pull request Feb 24, 2021

REALMC-7948: support int64 types (dop251#14)

50329e5

tsedgwick added a commit to tsedgwick/goja that referenced this pull request Feb 24, 2021

REALMC-7948: support int64 types (dop251#14)

8a40ee0

tsedgwick added a commit to tsedgwick/goja that referenced this pull request May 5, 2021

REALMC-7948: support int64 types (dop251#14)

ec934f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test for and fix (*vm).Interrupt data race #14

Test for and fix (*vm).Interrupt data race #14

nilium commented Feb 14, 2017 •

edited

Loading

dop251 commented Feb 14, 2017 •

edited

Loading

seebs commented Feb 14, 2017

nilium commented Feb 14, 2017

dop251 commented Feb 14, 2017

acln0 commented Feb 15, 2017

nilium commented Feb 15, 2017

dop251 commented Feb 15, 2017

seebs commented Feb 15, 2017

kirillDanshin commented Feb 15, 2017

chriswessels commented Feb 15, 2017

kirillDanshin commented Feb 15, 2017

Test for and fix (*vm).Interrupt data race #14

Test for and fix (*vm).Interrupt data race #14

Conversation

nilium commented Feb 14, 2017 • edited Loading

dop251 commented Feb 14, 2017 • edited Loading

seebs commented Feb 14, 2017

nilium commented Feb 14, 2017

dop251 commented Feb 14, 2017

acln0 commented Feb 15, 2017

nilium commented Feb 15, 2017

dop251 commented Feb 15, 2017

seebs commented Feb 15, 2017

kirillDanshin commented Feb 15, 2017

chriswessels commented Feb 15, 2017

kirillDanshin commented Feb 15, 2017

nilium commented Feb 14, 2017 •

edited

Loading

dop251 commented Feb 14, 2017 •

edited

Loading