Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LFX Workspace: Improving the performance of running miniruby #1292

Closed
eat4toast opened this issue Mar 2, 2022 · 13 comments
Closed

LFX Workspace: Improving the performance of running miniruby #1292

eat4toast opened this issue Mar 2, 2022 · 13 comments

Comments

@eat4toast
Copy link
Contributor

eat4toast commented Mar 2, 2022

Motivation

Miniruby is a porting ruby to wasm file.
With this container hope to find the bottleneck both in interpreter mode and AOT mode.
Handle the bottleneck in miniruby and achieve a whole improvement(not specific for miniruby).

Details

  • Run various test cases in miniruby to find the hot path.
  • Compare and record the test case result both in interpreter and AOT mode.

Appendix

miniruby proposal: https://bugs.ruby-lang.org/issues/18462

Output PR

#1587
#1382

@eat4toast
Copy link
Contributor Author

Week 1

  • Understanding the basic ruby env: ruby file over miniruby over WasmEdge. Time cost come from: initialize + actually running
  • Take a glance in WasmEdge codebase (from perf report call-chain):
VM::Async
  VM::runWasmFile
    VM::unsafeRunWasmFile (relation between the prefix -> safe_func() { lock(); run unsafe_func() ;} )
       VM::unsafeExecute
-----------
(lib/executor/executor.cc) invoke->
	(lib/executor/engine/engine.cc) runFunction() ->enterFunction
		(lib/executor/engine/engine.cc) execute()   ->trans each instruction immediately

next week

  • Take some experiments to increase the actual running time cost, since there exists a mid-level in the ruby env.
  • Focus on the specific opcode run in the execute func. ( We need more detailed analysis in execute func )

@eat4toast
Copy link
Contributor Author

eat4toast commented Mar 20, 2022

Week 2

  • Write 5 ruby programs: fib, gemm, data dependency scan array, unlimited-recursive, long time print console IO
  • With the perf flame graph, find the Valvarity function call chain deep(the parameter pack needs to expand to unsigned_int128),
    modify the type order in Valvarity ( include/common/types.h: 80L ), but the improvement is really small.
  • Since the WASM is a stack-machine, or from the WAT also find the majority instruction consist :
    pop/ push --> get/set global/local Num

Next week

  • Focus on the stackmgr source code ( include/runtime/stackmgr.h ) and try to dive into PR1353
  • More related profile and try to figure out the miniruby call WASI related path.

@eat4toast
Copy link
Contributor Author

Week 3

  • Run more ruby test file
  • Do some slight modifications, but the improvement is very small😟
    1. in lib/executor/engine/engine.cpp: 1617L
      move the OpCode Code = PC->getOpCode(); into if (Stat)
    2. in lib/executor/engine/controlInstr.cpp: 62L and 67L
      add value = min(value, LabelTableSize) and remove the 67L if block
    3. in lib/executor/engine/controlInstr.cpp: 48L
      it will immediately jump to branchToLabel and I move the 39L to replace 48L.

Next week

  • Figure out and solve the weird isInstructionCounting cost.
  • Run more ruby test files and find the hot path

@eat4toast
Copy link
Contributor Author

eat4toast commented Apr 5, 2022

Week 4

  • Run more ruby test files (unfortunately, some ruby standard lib is compiled to the shared library. But WASM is not supporting load shared lib(also not portable if code run browser), so miss some chance. (sad)
  • Solve the weir isInstructionCounting cost last week🎉[bug] fix Stat incorrect initialize value #1382

Next week

  • Figure out why cost more time to load section
  • (expected)Porting the test case result into a visualize table.

@eat4toast
Copy link
Contributor Author

Week 5

  • Run test case in AOT mode find the bottleneck in:
  1. WasmEdge::Loader::Loader::loadSection (/include/loader/loader.h) -> load section
  2. Executor::ProxyHelper (lib/executor/engine/proxy.cc: 49L) -> jump instruction table

Next week

  • Focus on interpreter mode with the pure ruby library.
    (Since the AOT mode is refer to the native library, and is hard to refactor.)

@eat4toast
Copy link
Contributor Author

Week 6

  • Run test case in interpreter mode, the major opcode table(time equal to count)
opcode-table
 
opcode = runAndOp time = 1
opcode = runNeOp time = 1
opcode = runLtOp time = 1
opcode = runEqOp time = 1
opcode = runAddOp time = 2
opcode = runBrOp time = 3
opcode = runBrTableOp time = 3
opcode = runEqzOp time = 3
opcode = runLocalSetOp time = 3
opcode = runGlobalSetOp time = 4
opcode = runReturnOp time = 4
opcode = runLocalTeeOp time = 4
opcode = runCallIndirectOp time = 4
opcode = runIfElseOp time = 6
opcode = runLoadOp time = 7
opcode = runLocalGetOp time = 7
opcode = runGlobalGetOp time = 7
opcode = runCallOp time = 7
opcode = runBrIfOp time = 7
opcode = runStoreOp time = 7
  

Next week

  • Start testing other aspects(memory usage, cache miss position, branch prediction penalty).
  • Run a complete opcode time count in interpreter mode with latest commit,

@eat4toast
Copy link
Contributor Author

Week 7

  • Run test case focus on branch_miss and cache_miss

interpreter mode:

interpreter_mode_branchmiss:

  1. runCallOp lib/executor/engine/controlInstr.cpp: 83L

  2. AST::Instruction::getNum include/ast/instruction.h: 181L

  3. Executor::instantiateModule -> instantiate -> Instance::ModuleInstance::addFunc

    lib/executor/instantiate/function.cpp: 22L


interpreter_mode_cachemiss:

  1. runStoreOp include/executor/engine/memory.ipp: 41L

  2. stackMgr::push include/runtime/stackmgr.h: 68L

  3. Executor::instantiateModule->Executor::instantiate (duplicate)

    lib/executor/instantiate/function.cpp: 22L


AOT clock:

Loader::parseModule:

  1. loadSectionContent: include/loader/loader.h:131L

  2. FileMgr:: readBytes: lib/loader/filemgr.cpp:415L

AOT branch miss && cache miss:

VM destructor and (linux)native library

Next week

  • Try to take a glance at the Loader code base
  • Find the context with Instance::ModuleInstance::addFunc

@eat4toast
Copy link
Contributor Author

Week 8

Do few things due to May Day (holiday evenly 5days in my local)

This week

  • Try to take a glance at the Loader code base (focus on the whole structure, since native lib is hard to modify)
  • Find the context with Instance::ModuleInstance::addFunc (insert code in branch to measure which be called frequent)

@eat4toast
Copy link
Contributor Author

Week 9

  • Link with jemalloc to benchmark: the interpreter mode improves overall by 10%(more minor then) and has no improvement with AOT mode (since AOT mode is always fast). Also, cache miss && branch miss in interpreter mode is the same as the previous.
  • for refer: \lib\executor\instantiate\function.cpp(Instance::ModuleInstance::addFunc ), insert atomic var to count branch, Result shows :
    AOT always goes to the if branch, interpreter mode only in the else branch.

This week

@eat4toast
Copy link
Contributor Author

Week 10

Stuck with issue#1457, I try different ways to solve and run various test cases, but the time cost in the interpreter is not improved much(the branch misses certainly decrease, but it depends on the branch miss take percent)

This week

Try to improve/ fix this issuehttps://github.com//issues/1457

@eat4toast
Copy link
Contributor Author

eat4toast commented Jun 1, 2022

Week 11

Sorry for the late report, I was busy with my final exam. And I do a few simple tests last week, I hope to solve this issue as soon as possible.

@eat4toast
Copy link
Contributor Author

Summary

The main goal for this mentorship is to improve the whole VM performance by running the ruby test case above the miniruby file. And what I had done is small: I just solve an issue that occur useless instruction counting due to incorrect initialization. And for the future plan/ potential optimization point: the loader in loading content and issue #1457 .

I really appreciate the mentor’s patient guidance(I get many test methods/tips which are not easy to get from books/videos). Besides, the code in the WasmEdge runtime is also needed further study, (It uses a long switch case table to dispatch various instructions in interpreter mode, maybe there have other VM implements way also works well).

@dannypsnl
Copy link
Member

Close as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants