Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in LuaTable::set #18

Closed
jonas-schievink opened this issue Jul 21, 2017 · 26 comments
Closed

Segfault in LuaTable::set #18

jonas-schievink opened this issue Jul 21, 2017 · 26 comments

Comments

@jonas-schievink
Copy link
Contributor

jonas-schievink commented Jul 21, 2017

After a long debugging session, I've noticed that cargo test --release fails with a segfault on the current nightly.

This might as well be a regression of the compiler, I'm not yet sure.

@jonas-schievink jonas-schievink changed the title Undefined behaviour in LuaTable::set Segfault in LuaTable::set Jul 21, 2017
@jonas-schievink
Copy link
Contributor Author

jonas-schievink commented Jul 21, 2017

Valgrind stacktrace (take this with a grain of salt, it's optimized after all):

==11084== Invalid read of size 8
==11084==    at 0x14F0E6: lua_touserdata (in /home/jonas/dev/rlua/target/release/rlua-c35ad98123863f87)
==11084==    by 0x1256DA: rlua::util::error_guard::call_impl (in /home/jonas/dev/rlua/target/release/rlua-c35ad98123863f87)
==11084==    by 0x601C9CF: ???
==11084==    by 0x162641: luaD_callnoyield (in /home/jonas/dev/rlua/target/release/rlua-c35ad98123863f87)
==11084==    by 0x160351: luaD_rawrunprotected (in /home/jonas/dev/rlua/target/release/rlua-c35ad98123863f87)
==11084==    by 0x162DDE: luaD_pcall (in /home/jonas/dev/rlua/target/release/rlua-c35ad98123863f87)
==11084==    by 0x152354: lua_pcallk (in /home/jonas/dev/rlua/target/release/rlua-c35ad98123863f87)
==11084==    by 0x127CA2: rlua::lua::LuaTable::set (in /home/jonas/dev/rlua/target/release/rlua-c35ad98123863f87)
==11084==    by 0x144779: rlua::tests::test_error (in /home/jonas/dev/rlua/target/release/rlua-c35ad98123863f87)

==11084== by 0x601C9CF: ??? looks pretty suspicious.

Instruction at fault:

Dump of assembler code for function lua_touserdata:
   0x000055555559b0e0 <+0>:	sub    rsp,0x8
   0x000055555559b0e4 <+4>:	test   esi,esi
=> 0x000055555559b0e6 <+6>:	mov    r8,QWORD PTR [rdi+0x20]

rdi is the first argument, ie. the pointer to the Lua state, and has a value of 0x40 (64), which is obviously not a valid pointer.

I suspect that luaD_callnoyield calls a bad function pointer, causing this mess.

@kyren
Copy link
Contributor

kyren commented Jul 21, 2017

hmmmm, that's concerning, it's ONLY on the current nightly compiler? Do you have a reproduction I could look at?

@jonas-schievink
Copy link
Contributor Author

jonas-schievink commented Jul 21, 2017

It's also at least on yesterday's nightly, but I neither have the bandwidth nor the disk space to bisect. The reproduction is just to run cargo test --release (I'm referring to rluas own tests here). Maybe it won't reproduce on other platforms, though.

EDIT: I'm on x64 Linux, for reference

@jonas-schievink
Copy link
Contributor Author

The easiest way to cause this:

let lua = Lua::new();
lua.globals().set("bla", LuaNil).unwrap();

It happens with any value, not just LuaNil.

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

It doesn't seem to be failing anymore, but it might be because I have "fixed" #19. I say "fixed" because I'm half convinced nothing will ever be fixed again, and that the Lua API is designed to drive me insane.

@jonas-schievink
Copy link
Contributor Author

@kyren I can still reproduce this

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

It's not failing on travis, maybe I was incorrectly assuming travis was x86_64 linux? Let me actually boot up a vm and try it.

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

I completely can't reproduce this, all tests pass on my linux x86_64 vm on the latest nightly.

I realize that's not super helpful.

@jonas-schievink
Copy link
Contributor Author

Odd. Meanwhile, I've also reproduced this on ARM (also Linux). My exact rustc version, for the record:

rustc 1.20.0-nightly (ae98ebfcb 2017-07-20)
binary: rustc
commit-hash: ae98ebfcb9ad5a5384fd229a6ee91315b02ca969
commit-date: 2017-07-20
host: x86_64-unknown-linux-gnu
release: 1.20.0-nightly
LLVM version: 4.0

Also reproduces on rustc 1.20.0-nightly (8d22af87d 2017-07-22) (the current one).

Maybe it's a difference between C compilers that's exposing UB? I'm using gcc version 7.1.1 20170630, but I can reproduce with clang 4.0.1, too.

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

I should be more clear, I said:

It doesn't seem to be failing anymore

But what I should have said is I actually have not seen this fail at all so far.

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

I have tried gcc version 6.3, but I can try 7.1.1. I've also tried whatever the latest apple clang is, I guess Apple LLVM version 8.1.0 (clang-802.0.42)

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

Actually can you just tell me what OS you're using, that might be faster than trying to set up a specific compiler version

@jonas-schievink
Copy link
Contributor Author

Oh, right. Arch Linux.

Currently bisecting by the way, so no need to start that if you manage to reproduce.

@jonas-schievink
Copy link
Contributor Author

Result rustup version -Vv
good nightly-2017-07-14 rustc 1.20.0-nightly (b2c070787 2017-07-13)
bad nightly-2017-07-15 rustc 1.20.0-nightly (6d9d82d3d 2017-07-14)

@jonas-schievink
Copy link
Contributor Author

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

Nice sleuthing!

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

So, you possibly won't like this answer, but I have now installed arch linux x86_64 in a vm and run rlua tests and they all pass. I'm not sure where to go from here :(

image

@jonas-schievink
Copy link
Contributor Author

Wow, thanks for putting all the effort into this bug!

The only difference is that I'm running a slightly different kernel now, I'll check out and see if that makes a difference although it shouldn't affect anything really. Just to make sure: You did run with --release, right?

I'll look into more ways to properly debug/reproduce this, then.

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

The only difference is that I'm running a slightly different kernel now, I'll check out and see if that makes a difference although it shouldn't affect anything really. Just to make sure: You did run with --release, right?

AHA, thank you, I was missing that part. NOW I can reproduce it.

I had completely forgotten that was not the default mode of test, because I don't usually run cargo directly, I have a Makefile that has a test target in it, and THAT sets --release. Sorry for all that!

@jonas-schievink
Copy link
Contributor Author

Ah, okay, so I'm not going insane yet. Good :)

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

Forget everything I said before, it's reproducible without a VM, you just have to ACTUALLY run in release.

PEBKAC

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

Okay, for real this time, I think this is fixed by 44c99ea.

I'm sure error_guard was subtly wrong somehow with regards to.. I guess maybe mutable aliasing? I'm not sure, because I abandoned the approach as too difficult to do correctly.

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

So, really sorry about not seeing the bug at first, you were absolutely correct that there was a problem.

Now that I have a bit of time to breathe, are we possibly back to the "only" problem left being that errors in __gc metamethods are unsound? Are all the rest of these fun soundness issues solved?

I have no idea how to even approach dealing with errors in __gc metamethods, other than simply writing a LOT of protected versions of ffi calls. I don't think that's really a workable solution, it would be massively slow and awkward, as well as slow at runtime. SURELY there must be a better way than that?

Is there some way possibly of forcing errors in __gc to simply always panic? That's precisely what would happen when called from the top level of Lua, but in any sort of callback, there's always going to be a protected lua call somewhere up the stack that has called setjmp that will handle it (which is precisely what we don't want).

If I even had an idea of how to approach it I would feel much better, right now I really don't know how to proceed

@jonas-schievink
Copy link
Contributor Author

Great, that seems to fix it for me as well!

Apart from what you mentioned, there's also #21, but it's unclear if my assumptions about setjmp/longjmp always being UB are correct.

Is there some way possibly of forcing errors in __gc to simply always panic?

Since there's no way except pcall to find out that this happens, no.

Well, technically... m means that both out-of-memory and errors in __gc cause the function to raise an error. OOM can be handled by providing our own allocator that simply aborts on OOM like Rust already does by default - in fact, the default allocator might already do this. As for __gc, it might be possible to disable GC to prevent those from being called, but we need some way to reenable it at the right moments so we don't cause a leak.

@kyren
Copy link
Contributor

kyren commented Jul 23, 2017

Yeah, that was actually on my list of things to do was use the default allocator as the Lua allocator, and do exactly that, panic on failure to allocate. I should have also mentioned the longjmp from rust problem as one of the remaining issues.

I thought about disabling gc during callbacks, but I had another thought while I was in the shower.

It could be possible to much more cheaply wrap Lua function calls by using thread local storage to store the arguments and return values to protected versions of calls, and if that's possible I could maybe use macros to much more cheaply (in both runtime and programmer time) protected call wrap things? That, combined with stack_guard, could mean that I would just call a very large number of protected calls in most API entry points.

MAYBE that would be okay?

@jonas-schievink
Copy link
Contributor Author

Possibly. I'm not sure how exactly that would look, though. Anyways, closing this issue since it was fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants