Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test for and fix (*vm).Interrupt data race #14

Closed
wants to merge 2 commits into from

Conversation

nilium
Copy link

@nilium nilium commented Feb 14, 2017

This patch adds a test for a data race under (*vm).Interrupt, where
an interrupt flag (boolean) is set to true inside of a lock but read
outside of a lock. Since the lock is only taken at the end of (*vm).run
to clear the interrupt value and un-set the interrupt flag, it doesn't
make sense to lock for every check on the interrupt flag.

I assume there's a reason for not using a channel here (even though
it'd be simpler), but the most minimal change to be made right now is
to replace the bool with a uint32 (or some other integer) and
read/write it via atomics.

The patch adds an (*vm).interrupted method to check the interrupt flag
atomically and stores it only via atomics. The type change to uint32
also ensures that there are no other uses of the interrupt bool field
(since a uint32 isn't ever treated as a bool and vice-versa). The lock
remains in place to serialize setting and clearing the interrupt value.

To test the race, checkout the first commit (73d6439) and run
go test -race -run TestInterruptRace -- this should be enough to
trigger it.

The race occurs due to a read on vm.interrupt without locking. The
write locks it, but it appears the lock is only used to set/clear the
interrupt value.

The test is a bit of a tricky thing since it may not detect it 100% of
the time, but so far it hasn't failed to spot the race.

The race detector spits out the following:

        ==================
        WARNING: DATA RACE
        Write at 0x00c420081841 by goroutine 7:
          github.com/dop251/goja.(*vm).Interrupt()
              /go/src/github.com/dop251/goja/vm.go:299 +0x96
          github.com/dop251/goja.(*Runtime).Interrupt()
              /go/src/github.com/dop251/goja/runtime.go:781 +0x68
          github.com/dop251/goja.testInterruptRace.func3()
              /go/src/github.com/dop251/goja/interrupt_test.go:23 +0x99

        Previous read at 0x00c420081841 by goroutine 6:
          [failed to restore the stack]

        Goroutine 7 (running) created at:
          github.com/dop251/goja.testInterruptRace()
              /go/src/github.com/dop251/goja/interrupt_test.go:24 +0x204
          github.com/dop251/goja.TestInterruptRace()
              /go/src/github.com/dop251/goja/interrupt_test.go:41 +0x44
          testing.tRunner()
              /usr/local/go/src/testing/testing.go:610 +0xc9

        Goroutine 6 (running) created at:
          testing.(*T).Run()
              /usr/local/go/src/testing/testing.go:646 +0x52f
          testing.RunTests.func1()
              /usr/local/go/src/testing/testing.go:793 +0xb9
          testing.tRunner()
              /usr/local/go/src/testing/testing.go:610 +0xc9
          testing.RunTests()
              /usr/local/go/src/testing/testing.go:799 +0x4ba
          testing.(*M).Run()
              /usr/local/go/src/testing/testing.go:743 +0x12f
          main.main()
              github.com/dop251/goja/_test/_testmain.go:544 +0x1b8
        ==================
The interrupt bool is checked both inside and outside of a lock during
writes, resulting in a data race. This change replaces the bool with
a uint32 and adds an accessor method, interrupted(), that returns
whether an interrupt has been set (value irrelevant to this since it's
only ever read/written after acquiring a lock).

The interrupt is now signalled by flipping the interrupt integer to
a non-zero value. This can only be done via atomic methods -- all reads
and writes to that integer must go through the atomic package.
@dop251
Copy link
Owner

dop251 commented Feb 14, 2017

The race condition here (as well as not using a channel) is quite deliberate. I may be missing something, but here is my rationale: in the worst case scenario (i.e. when the race condition occurs) we may read false when it's actually true. So what? We interrupt one step later. Is it a problem? I don't think so. On the other hand this check sits in the most critical spot of the VM. Replacing it with a function call, introducing locking, or using a channel would impact the performance quite significantly.

@seebs
Copy link

seebs commented Feb 14, 2017

not 100% sure, but so far as i know, race conditions are allowed to do anything. not restricted to "things which make any kind of sense or would be otherwise possible". So, you could read the wrong value, sure. Or you could get a runtime panic. Or you could read a value which is neither true nor false. Or just about anything else. The compiler is allowed to generate code which will explode spectacularly in such cases; it's not required that the result will definitely be a successful read of either false or true.

@nilium
Copy link
Author

nilium commented Feb 14, 2017

I have trouble understanding the merits of a deliberate data race. It is, under no circumstance, acceptable to have an interrupt with a data race -- it's something that must behave predictably. More generally, it isn't acceptable to have a data race at all -- intended or not, this invokes undefined behavior.

Also, although you say it'll affect performance significantly -- and there's no data to back that up just yet -- but "faster than a data race" shouldn't be a requirement for fixing a data race. The data race should be fixed, one way or another, and then someone can invest time in recouping any lost performance if it's considered a problem. I'd argue that performance gains should be easier to make elsewhere than around an interrupt check.

So, in this case, I'd urge you to prioritize correctness over performance for this. If you need performance with the interrupt still and it somehow happens that goja is optimized to the point that only the interrupt can be improved, please investigate options that don't involve data races.

@dop251
Copy link
Owner

dop251 commented Feb 14, 2017

I'm sorry @nilium but what do you mean by "correctness" in this particular case? Correct behaviour would be to interrupt the VM at some point after Interrupt() is called which is exactly what happens (even in your test case). Note that the race condition is only on the interrupt flag, not on the value.

If, as @seebs suggests, it is possible to get a panic I'll fix it of course. So far I could not find any reference to indicate it's the case.

@acln0
Copy link

acln0 commented Feb 15, 2017

In the context of a data race, the program's behavior becomes undefined. It is not reasonable to speak of correct behavior when the behavior is not defined in the first place. A program invoking undefined behavior can behave in any way at all; any (possible) guarantee of correctness has been lost. For a taste of the effects of such "benign" data races, I recommend reading this article.

Yes, in practice, the VM will probably be interrupted at some point after calling Interrupt(), and the program will probably not crash, but that is not the critical point here. To quote from the aforementioned article:

The bottom line. Just say No to “benign” races. Even if you still think that that particular data race is 100% safe (which I doubt), it’s still formally incorrect, fragile during code maintenance and produces noise under race detection tools.

As for the question of performance, my intuition is that the cost of the reflect-heavy code in the VM completely dwarfs the cost of an atomic.LoadUint32 on every loop iteration. On this matter, however, you shouldn't believe a word I say, but benchmark and profile the code yourself.

@nilium
Copy link
Author

nilium commented Feb 15, 2017

@dop251 If I run a program with the race detector on and goja triggers a series of error messages in the output, that should be enough to justify fixing it. It should also be alarming, not something to be content with (much less deliberate, which is questionable at best). That means the program is not correct because it's -- as it suggests -- a race.

If nothing else, what seebs is referring to is that anything that isn't covered by the Go memory model, is effectively undefined and may result in anything from incorrect behavior to reindeers to panics (this is because it's a data race -- Go can't define what happens in those scenarios). Unless you can amend the race detector and spec, goja is incorrect. And, per the doc:

If you must read the rest of this document to understand the behavior of your program, you are being too clever.

Don't be clever.

Anyhow, I ran the benchmarks included in goja to see if this would have any impact. I can see no significant difference in the results on my laptop outside of the fibonacci benchmark, which I assume is broken since it's only managed one run. The empty loop may have suffered the most, with a difference of about 300ns (or about 0.3 microseconds), which isn't significant enough for me to say the interrupt fix has "impact[ed] performance quite significantly." The master columns refer to master as it is now, the patched columns refer to the benchmark times with a correct interrupt:

                      |     master |        patched |              master |            patched
----------------------+------------+----------------+---------------------+-------------------
BenchmarkCompile      |        500 |            500 |       2739791 ns/op |      2752266 ns/op
BenchmarkGoReflectGet |    1000000 |         500000 |          2572 ns/op |         2570 ns/op
BenchmarkPut          |   30000000 |       30000000 |          46.2 ns/op |         49.0 ns/op
BenchmarkPutStr       |   30000000 |       30000000 |          39.9 ns/op |         41.8 ns/op
BenchmarkGet          |  100000000 |      100000000 |          16.2 ns/op |         16.2 ns/op
BenchmarkGetStr       |  100000000 |      100000000 |          12.5 ns/op |         10.9 ns/op
BenchmarkToString1    | 2000000000 |     2000000000 |          0.63 ns/op |         0.65 ns/op
BenchmarkToString2    |  500000000 |     1000000000 |          3.20 ns/op |         3.06 ns/op
BenchmarkConv         | 2000000000 |     2000000000 |          0.32 ns/op |         0.33 ns/op
BenchmarkArrayGetStr  |   50000000 |       50000000 |          23.9 ns/op |         27.8 ns/op
BenchmarkArrayGet     |   20000000 |       20000000 |          66.4 ns/op |         72.6 ns/op
BenchmarkArrayPut     |  100000000 |      100000000 |          11.3 ns/op |         11.9 ns/op
BenchmarkToUTF8String | 1000000000 |      500000000 |          2.88 ns/op |         3.26 ns/op
BenchmarkAdd          |   50000000 |       50000000 |          32.5 ns/op |         32.5 ns/op
BenchmarkAddString    |    5000000 |        5000000 |           260 ns/op |          259 ns/op
BenchmarkVmNOP2       |  200000000 |      200000000 |          6.51 ns/op |         6.65 ns/op
BenchmarkVmNOP1       |   50000000 |       50000000 |          37.9 ns/op |         37.8 ns/op
BenchmarkVmNOP        |  100000000 |      100000000 |          11.8 ns/op |         16.9 ns/op
BenchmarkVm1          |   30000000 |       30000000 |          50.3 ns/op |         56.8 ns/op
BenchmarkFib          |          1 |              1 |    6836786055 ns/op |   7250626136 ns/op
BenchmarkEmptyLoop    |     100000 |         100000 |         16830 ns/op |        17151 ns/op
BenchmarkVMAdd        |  100000000 |      100000000 |          22.6 ns/op |         22.5 ns/op

So, again, I urge you to prioritize fixing the data race instead of trying to rationalize it as benign. I'm happy to trade 300ns for sane code, but I would never be happy to gain 300ns at the cost of a data race. In go, there are no benign data races, and goja having one in its interrupt code is not acceptable when it should be safe to interrupt a VM.

@dop251
Copy link
Owner

dop251 commented Feb 15, 2017

The fibonacci benchmark is not broken, @nilium, it just takes that long to compute and is the most representative test to illustrate the performance impact.

The only real danger here is that the compiler optimises the check away because it can prove the flag can't change in the current thread. When that happens:

  1. I will be very happy for the go community (myself included).
  2. TestInterrupt will fail making the problem immediately obvious.

Until such time, even the go memory model document you're referring to contains the following code:

var a, b int

func f() {
	a = 1
	b = 2
}

func g() {
	print(b)
	print(a)
}

func main() {
	go f()
	g()
}

without any indication that anything bad can happen (other than it occasionally printing 2 then 0).

To sum this up, I will generally reject any PR that has a clear negative impact on performance (regardless of how small) unless it fixes a real problem. If this policy upsets or offends anyone, I'm sorry.

@dop251 dop251 closed this Feb 15, 2017
@seebs
Copy link

seebs commented Feb 15, 2017

Wow.

Just. Wow.

This is thoroughly into the category of "I am now absolutely terrified to run any code with your name on it." Undefined behavior means undefined behavior, dude. If you won't listen when several experienced developers point out that something is a giant glaring red flag and absolutely unambiguously an error of a kind which is dangerous, and you'll chase performance at the cost of correctness? Just. Wow.

@kirillDanshin
Copy link

@seebs @dop251 @nilium @AndreiCalin
we can fix it without negative impact on performance.

if you want to discuss it, write me on telegram (@kirilldanshin) or create an issue.

@chriswessels
Copy link

@dop251 Please consider merging the PR. The performance impact is absolutely minimal and correct, clean code should always be chosen over undefined behaviour.

@kirillDanshin
Copy link

@chriswessels we are talking about alternative implementation in #16. check it out.

tsedgwick added a commit to tsedgwick/goja that referenced this pull request Feb 24, 2021
tsedgwick added a commit to tsedgwick/goja that referenced this pull request Feb 24, 2021
tsedgwick added a commit to tsedgwick/goja that referenced this pull request May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants