Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fatal Bug with coroutine.yield #431

Closed
wants to merge 1 commit into from
Closed

Fix fatal Bug with coroutine.yield #431

wants to merge 1 commit into from

Conversation

JakobDev
Copy link
Contributor

It is possible to froze all Computers in the world by running this code:

coroutine.yield = function() end
while true do
end

To only Way to break the freeze is to rejoin the world. With this fix, the Computer who run this code will crash after a few seconds and everything works normal.

@SquidDev
Copy link
Contributor

This is one of those bugs I've never been able to reproduce, so I'd appreciate someone else chiming in here:

Has anyone been able to diagnose the root cause of this issue? Whilst infinite loops are inevitably going to block other computers from running, there shouldn't ever be a time when they stop them executing forever.

On a side note, does this occur with #163? It fixed some issues with yield prevention, so it might also fix this.

@JakobDev
Copy link
Contributor Author

@SquidDev It does not occur with #163. It has there the same effect as my fix on LUA Side.

@dirthsj
Copy link
Contributor

dirthsj commented Aug 30, 2017

@SquidDev the infinite loop is not actually required to do this. Simplest reproduction steps:

>lua
lua>coroutine.yield = function() end
lua>_

Symptoms:

  • Cursor will continue to blink
  • No input will be accepted by the computer (ei typing or clicking has no effect)
  • Other computers in the world will not turn on when clicked
  • Other computers will not execute code or receive input
  • On other computers, if the cursor was blinking when this was executed, it will continue to blink
  • If you wait long enough, the computer this was executed on will error with something like:
    bios.lua:271: Too long without yielding ... this leads to the computer printing "Goodbye" and presumably attempting to shut off... which doesn't work, and the computer cursor continues to blink.
    image Other symptoms listed continue to occur.

Conclusion: The infinite loop has nothing to do with it. Something about replacing coroutine.yield with a blank function does.

@SquidDev
Copy link
Contributor

SquidDev commented Aug 30, 2017

Sadly this fix isn't going to fix everything - overriding os.pullEventRaw = function() end has the same effect. I haven't looked into this a lot, but here's a tiny bit more useful information:

One problem seems to be the lock on m_machine is never released, as mentioned here. Taking a selective thread dump backs this up full dump here.

java.lang.Thread.State: BLOCKED (on object monitor)
	at dan200.computercraft.core.computer.Computer$3.execute(Computer.java:871)
	- waiting to lock <0x00000000f7a4f7f8> (a dan200.computercraft.core.lua.LuaJLuaMachine)
	at dan200.computercraft.core.computer.ComputerThread$1$1.run(ComputerThread.java:109)
	at java.lang.Thread.run(Thread.java:745)

"Thread-250" #303 daemon prio=5 os_prio=0 tid=0x0000000019d74800 nid=0x2190 waiting for monitor entry [0x000000005e4ae000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:502)
	at org.luaj.vm2.LuaThread$State.lua_resume(Unknown Source)
	- locked <0x00000000f7a4fc08> (a org.luaj.vm2.LuaThread$State)
	at org.luaj.vm2.LuaThread.resume(Unknown Source)
	at org.luaj.vm2.lib.CoroutineLib.invoke(Unknown Source)
	at dan200.computercraft.core.lua.LuaJLuaMachine.handleEvent(LuaJLuaMachine.java:225)
	at dan200.computercraft.core.computer.Computer$3.execute(Computer.java:871)
	- locked <0x00000000f7a4f7f8> (a dan200.computercraft.core.lua.LuaJLuaMachine)
	at dan200.computercraft.core.computer.ComputerThread$1$1.run(ComputerThread.java:109)
	at java.lang.Thread.run(Thread.java:745)

Of course, the real issue is why the Lua thread isn't being interrupted correctly.

function coroutine.yield( sFilter )
return nativeyield( sFilter )
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why override coroutine.yield itself??? Is that not simply overhead without any benefit?

-- Install lua parts of the os api
function os.version()
return "CraftOS 1.8"
end

function os.pullEventRaw( sFilter )
return coroutine.yield( sFilter )
return nativeyield( sFilter )
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.pullEventRaw = nativeyield maybe

@@ -683,15 +688,15 @@ local nativeShutdown = os.shutdown
function os.shutdown()
nativeShutdown()
while true do
coroutine.yield()
nativeyield()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could consider setting a "neverResume" filter, or somesuch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally os.shutdown would just terminate/suspend the VM there, meaning we didn't need this code at all.

@SquidDev
Copy link
Contributor

SquidDev commented Aug 31, 2017

I've done a little bit more debugging, and come up with the following reproduction case:

local co = coroutine.create(function()
    while true do end
end)
while coroutine.status(co) ~= "dead" do
    coroutine.resume(co)
end

As we're not calling any CC methods and we never yield, we won't ever get a timeout error, instead we'll just hit the debug hook:

String hardAbortMessage = m_hardAbortMessage;
if( hardAbortMessage != null )
{
LuaThread.yield(LuaValue.NIL);
}

This'll cause our coroutine to yield into the parent loop, which just resumes the thread again - thus nothing is terminated. Changing the hook to execute every instruction by changing this line to be 1 seems to fix this particular instance, but may cause problems elsewhere. Of course, we could always use Cobalt instead :p.

It'd be nice if @dan200 could chime in on this, as I'm not entirely sure why the timeout code is structured the way it is.

@dan200
Copy link
Owner

dan200 commented Sep 9, 2017

@Wilma456 This is a bad fix and doesn't resolve the actual problem (computers that infinite loop and timeout aren't allowing other computers to resume afterwards). Please fix that instead.

@SquidDev The timeout code is 5 years old at this point, so forgive me for not remembering the details. It sounds like you're on the right track though.

@dan200 dan200 closed this Sep 9, 2017
@JakobDev JakobDev deleted the yieldfix branch September 10, 2017 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants