Overview

This is a quick proof of concept demonstrating best practice for high-performance LuaJIT C Binding.

This is a proof of concept applying some of the approaches documented in: Niklas Frykholm's presentation, LUA WorkShop 2015

Exposing "object" as lightuserddata, with type tag is struct to detect when incorrect type is passed
Using a shared meta-table for all lightuserdata to support operator overloading

Then further extended to leverage the LUA JIT Allocation/Sinking optimization documented there: http://wiki.luajit.org/Allocation-Sinking-Optimization This optimization get ride of allocation of ffi metatype when use is local (temporary vector operation for example).

Building

luajit is retrieve from luarocks git repository as a sub-module, so make sure to update sub-modules after clone: git submodule update --init --recursive

Build is done using CMake 3, selecting Visual Studio 2015 64 bits.
To test with LUA JIT 2.1, ensure WITH_LUAJIT21 is ON in CMake.
If build fail with something related to read, just remove read-line library autodetected by CMake (just set readline include/lib to a blank string in CMake GUI)

Measure

See src/luajit_poc/bench.lua for bench code.

Rough performance measured on Intel Core i7-6700K CPU @ 4.00GHz running on Windows 10 64 bits, compiled with Visual Studio 2015 in 64 bits with luajit 2.1. Notes that this is an ideal setup, no icode cache pressure... Just some simple measurement to get a rough idea of where we stand.

Set the x, y integer coordinate of a Vector2D struct to current loop index.

Call to A C LUA Binding associated to lightuserdata:

Vector2D.set * 100,000,000.00 in 2.231s = 44,822,793.46 operation/s [64bits]
Vector2D.set * 100,000,000.00 in 2.230s = 44,851,959.06 operation/s [32bits]

Caching Vector2D.set in a local variable (improved performance with luajit 2.0, no longer need with luajit 2.1):

Vector2D_set * 100,000,000.00 in 2.246s = 44,514,743.73 operation/s [64bits]
Vector2D_set * 100,000,000.00 in 2.075s = 48,197,507.80 operation/s [32bits]

Both of the above call have to cross the C/Lua language barrier. The C function has to read the parameters from the LUA Stack. IMHO, those performance a very impressive...

Using LUA JIT ffi metatype, bring us to performance comparable to pure C++:

setLuaV2 (ffi struct) * 100,000,000.00 in 0.048s = 2,073,420,998.68 operation/s [64bits]
setLuaV2 (ffi struct) * 100,000,000.00 in 0.049s = 2,054,834,322.41 operation/s [32bits]

Creates a new Vector2D initialized to 1, 1 and add it to a sum Vector2D on each loop iteration.

Call to A C LUA Binding associated to lightuserdata, goes through the C/LUA language barrier and call new/delete:

Application.create/sum/DestroyVector2D * 10,000,000.00 in 1.310s = 7,631,665.20 operation/s [64bits]
Application.create/sum/DestroyVector2D * 10,000,000.00 in 1.491s = 6,707,879.24 operation/s [32bits]

Using LUA JIT ffi metatype struct for Vector2D:

LuaV2Create/sum/Destroy (ffi struct) * 10,000,000.00 in 0.002s = 4,057,280,688.76 operation/s [64bits]
LuaV2Create/sum/Destroy (ffi struct) * 10,000,000.00 in 0.002s = 4,071,216,984.47 operation/s [32bits] The LUA JIT Allocation/Sinking optimization is clearly triggered and got ride of the struct allocation. Performance is comparable with what you could get in fullly inlined C++. Since is this a 4GHz processor, we're basically at 1 cycle per iteration...

Pass a Vector2D to the Application

Call to A C LUA Binding, passing a Vector2D lightuserdata: (nothing new, comparable to previous LUA C binding performance)

Application_setOrigin * 100,000,000.00 in 1.879s = 53,219,384.30 operation/s [64bits]
Application_setOrigin * 100,000,000.00 in 1.469s = 68,081,362.24 operation/s [32bits]

Call C function export by DLL passing a pointer on ffi Vector2D struct:

ffi_c_Application_setOrigin * 100,000,000.00 in 0.122s = 821,501,223.82 operation/s [64bits]
ffi_c_Application_setOrigin * 100,000,000.00 in 0.121s = 828,446,273.54 operation/s [32bits] => This show a reduction of the C/LUA language barrier by a magnitude, comparable to the cost of a non-inlined function call in C++.

Call C function pointer from a struct returned by another C function exporedt by DLL passing a pointer on ffi Vector2D struct:

bench_ffi_setOrigin_via_TestApi * 100,000,000.00 in 0.120s = 832,537,968.19 operation/s [64bits]
bench_ffi_setOrigin_via_TestApi * 100,000,000.00 in 0.134s = 746,714,629.24 operation/s [32bits] => Fairly similar to directly calling a C function (delta within noise of measurement).

Trick to pass a function pointer to LUA without FFI

I initially made a DLL containing C function exposed to LUA via FFI as is documented in the tutorial. But what I really wanted was pass the function pointer directly from my executable to LUA FFI.

Below is the trick I found to expose a C function pointer to LUA JIT FFI without having to use ffi.load() to load it from a dynamic library.

Short story: class LUA binding function returns pointer as a number, which is then casted back to a pointer using ffi.cast().

Classic LUA binding function:

int clua_Application_getTestApi( lua_State *L )
{
    uintptr_t ptr = (uintptr_t)(&testApi);
    lua_pushinteger( L, ptr ); // even if stored as double, it should be safe as x86 can only address ~48bits.
    return 1;
}

LUA code using FFI:

local ffi = require("ffi")
ffi.cdef( [[
    struct TestApi
    {
        int( *doPrint1 )( const char *what );
    };

    typedef struct TestApi *TestApiPtr;
]] )

local TestApiPtr = ffi.typeof("TestApiPtr")
local testApi2Num = Application.getTestApi()
local testApi2 = ffi.cast( TestApiPtr, testApi2Num )
testApi2.doPrint1( "via Application.getTestApi() cast" )

Conclusion

LUA FFI is clearly very interesting to optimize away temporary allocation of vector objects.

LUA FFI function call overhead to C is ~10 times smaller.

32bits build custom allocator show that FFI struct is not allocated using the allocator (and therefore likely kept on a stack).

Baptiste Lepilleur.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
thirdparty		thirdparty
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
appveyor.yml		appveyor.yml
result_i7_6700k_32bits.txt		result_i7_6700k_32bits.txt
result_i7_6700k_64bits.txt		result_i7_6700k_64bits.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

thirdparty

thirdparty

.gitattributes

.gitattributes

.gitignore

.gitignore

.gitmodules

.gitmodules

CMakeLists.txt

CMakeLists.txt

LICENSE

LICENSE

README.md

README.md

appveyor.yml

appveyor.yml

result_i7_6700k_32bits.txt

result_i7_6700k_32bits.txt

result_i7_6700k_64bits.txt

result_i7_6700k_64bits.txt

Repository files navigation

Overview

Building

Measure

Trick to pass a function pointer to LUA without FFI

Conclusion

About

Releases

Packages

Languages

License

blep/luajit_perf_poc

Folders and files

Latest commit

History

Repository files navigation

Overview

Building

Measure

Trick to pass a function pointer to LUA without FFI

Conclusion

About

Resources

License

Stars

Watchers

Forks

Languages