Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize time wasted in passing argument and fetching return #137

Open
muayyad-alsadi opened this issue Mar 21, 2023 · 7 comments
Open

optimize time wasted in passing argument and fetching return #137

muayyad-alsadi opened this issue Mar 21, 2023 · 7 comments

Comments

@muayyad-alsadi
Copy link
Contributor

While discussing #96 I was able to identify a bottleneck
there was needlessly large time wasted in passing the arguments and retrieving the return

basically I made a simple 32-bit hash function cdb_djp_hash.c

in pure python the time is proportional to string length
in wasm it was almost constant time ~40ms regardless of string length 13 bytes to 1300 bytes (100x)
which means that most of those 40ms are in passing the arguments not in the actual loop (if it was the loop it would be proportional to string length or a large fraction of that).

This was confirmed using

pr=Profile()
pr.enable()
for i in range(10000):
        cdb_djp_hash_wasm(large_a)
pr.disable()
#pr.print_stats()
#pr.print_stats('cumulative')
pr.print_stats('calls')

the first note that the actual WASM call is fast (169ms out of 800ms)
which means that we might able to squeeze ~650ms of 800ms making it 4x faster

and I was able to identify bottleneck

  • things that take too much cumulative time
  • things that are called more than 10k (number of iterations)

examples of such things

  • too many calls to isinstance which is 170k (in a 10k iteration benchmark), that is 17 times per call
  • converting the argument _value.py:129(_convert) too ~200ms out of 800ms
  • we got dynamically sized lists operations like {method 'append' of 'list' objects}

Are we sure that passing arguments / converting can be made faster?
Yes

I can confirm _value.py:129(_convert) that is too slow

I've moved the conversion outside the loop and confirmed it was the bottleneck

    i32_start = Val.i32(start)
    i32_size = Val.i32(size)
    for i in range(10000):
        np_mem[start:stop]=input
        hash_wasm = ctypes.c_uint32(_cdb_djp_hash_wasm(i32_start, i32_size)).value

also I can confirm that this time is just waste because passing parameter should do the following

to proof that we can convert in almost no time

in_struct=Struct('<LL')
in_pack=in_struct.pack
for i in range(10000):
    b=pack(start, size)
@muayyad-alsadi
Copy link
Contributor Author

in _value._convert
we see a lot of isinstance() and _func.call do a lot of work on each call despite the fact that the signature of the function does not change at runtime

I believe in the constructor of class Func we need to set a property

in_fmt = ''.join([i.struct_code for i in ty])
self._in_struct=Struct('<'+in_fmt) # `<` because WASM is little endian
self._in_pack=self._in_struct.pack
self._out_struct=Struct('<'+out_fmt)
self._out_unpack=self._in_struct.unpack

where Val.struct_code is b, B, l, L, q, Q, f, d for i8, u8, i32, u32, i64, u64, f32, f64 respectively

and this should be done once on Func constructor not on every code
then when we want to encode the arguments

b=self._in_pack(...args)

@muayyad-alsadi

This comment was marked as outdated.

@muayyad-alsadi
Copy link
Contributor Author

I was able to make it 4x faster
running 10k iteration of this hash function typically took ~350ms and now ~80ms

old:  0.3466089940011443
new:  0.0788318810009514

here is the code

def cdb_djp_hash_wasm(input: bytearray):
    size = len(input)
    start = hb
    stop = start + size
    np_mem[start:stop]=input
    params = (start, size)
    raw=(wasmtime_val_raw_t*len(params))()
    for row_i, param in zip(raw, params): row_i.i32=param
    raw_ptr_casted = ctypes.cast(raw, ctypes.POINTER(wasmtime_val_raw_t))
    with enter_wasm(store) as trap:
            error = ffi.wasmtime_func_call_unchecked(
                store._context,
                ctypes.byref(_cdb_djp_hash_wasm_in._func),
                raw_ptr_casted,
                trap)
            if error:
                raise WasmtimeError._from_ptr(error)
    return ctypes.c_uint32(raw[0].i32).value

and there is still more room from improvement

@muayyad-alsadi
Copy link
Contributor Author

now it's ~66ms instead of ~80ms (compared to old 350ms)

by doing things once outside of the loop (the function won't change it's prototype/signature inside the loop)

def func_init(func, store):
    ty = func.type(store)
    ty_params = ty.params
    ty_results = ty.results
    params_str = (str(i) for i in ty_params)
    params_n = len(ty_params)
    results_n = len(ty_results)
    n = max(params_n, results_n)
    raw_type = wasmtime_val_raw_t*n
    func.raw_type = raw_type
    def _create_raw(params):
        raw = raw_type()
        for i, param_str in enumerate(params_str):
            setattr(raw[i], param_str, params[i])
        return raw
    func._create_raw = _create_raw

func_init(_cdb_djp_hash_wasm_in, store)
#....
    raw = _cdb_djp_hash_wasm_in._create_raw(params)
    raw_ptr_casted = ctypes.cast(raw, ctypes.POINTER(wasmtime_val_raw_t))

@muayyad-alsadi
Copy link
Contributor Author

using struct.Struct gave ~80ms that is no extra benefit

each wasmtime_val_raw_t is 16 bytes (that is each one is 4xi32)
since our function takes two i32 we need a struct of <LLLLLLLL

# outside the loop
st=Struct('<LLLLLLLL')
raw_ptr_type = ctypes.c_uint8*st.size
st_ret=Struct('<L')
# inside the loop
    raw = bytearray(st.pack(start,0,0,0, size, 0,0,0))
    raw_ptr = raw_ptr_type.from_buffer(raw)
    raw_ptr_casted = ctypes.cast(raw_ptr, ctypes.POINTER(wasmtime_val_raw_t))

I'll continue using the previous func_init with wasmtime_val_raw_t

@muayyad-alsadi
Copy link
Contributor Author

when profiler is enabled old method took ~700ms
and new method takes ~110ms
of which the actual wasm took less than 20ms (that is _bindings.py:2485(wasmtime_func_call_unchecked) or _bindings.py:2479(wasmtime_func_call) )

the profile of new method is very clean and does not seem to be possible to optimize it further

here are the report

old:  0.3560021289958968
new:  0.07202652499836404
         180001 function calls in 0.112 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.011    0.000    0.011    0.000 __init__.py:509(cast)
    10000    0.016    0.000    0.016    0.000 _bindings.py:2485(wasmtime_func_call_unchecked)
    20000    0.007    0.000    0.009    0.000 _func.py:256(enter_wasm)
    10000    0.004    0.000    0.004    0.000 cdb_djp_hash.py:72(_create_raw)
    10000    0.039    0.000    0.112    0.000 cdb_djp_hash.py:85(cdb_djp_hash_wasm)
    10000    0.008    0.000    0.010    0.000 contextlib.py:102(__init__)
    10000    0.004    0.000    0.014    0.000 contextlib.py:130(__enter__)
    10000    0.007    0.000    0.011    0.000 contextlib.py:139(__exit__)
    10000    0.004    0.000    0.013    0.000 contextlib.py:279(helper)
    20000    0.002    0.000    0.002    0.000 {built-in method _ctypes.POINTER}
    20000    0.003    0.000    0.003    0.000 {built-in method _ctypes.byref}
    10000    0.002    0.000    0.002    0.000 {built-in method builtins.getattr}
    10000    0.001    0.000    0.001    0.000 {built-in method builtins.len}
    20000    0.005    0.000    0.014    0.000 {built-in method builtins.next}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


         180001 function calls in 0.112 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.039    0.000    0.112    0.000 cdb_djp_hash.py:85(cdb_djp_hash_wasm)
    10000    0.016    0.000    0.016    0.000 _bindings.py:2485(wasmtime_func_call_unchecked)
    20000    0.005    0.000    0.014    0.000 {built-in method builtins.next}
    10000    0.004    0.000    0.014    0.000 contextlib.py:130(__enter__)
    10000    0.004    0.000    0.013    0.000 contextlib.py:279(helper)
    10000    0.011    0.000    0.011    0.000 __init__.py:509(cast)
    10000    0.007    0.000    0.011    0.000 contextlib.py:139(__exit__)
    10000    0.008    0.000    0.010    0.000 contextlib.py:102(__init__)
    20000    0.007    0.000    0.009    0.000 _func.py:256(enter_wasm)
    10000    0.004    0.000    0.004    0.000 cdb_djp_hash.py:72(_create_raw)
    20000    0.003    0.000    0.003    0.000 {built-in method _ctypes.byref}
    20000    0.002    0.000    0.002    0.000 {built-in method _ctypes.POINTER}
    10000    0.002    0.000    0.002    0.000 {built-in method builtins.getattr}
    10000    0.001    0.000    0.001    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


         180001 function calls in 0.112 seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    20000    0.007    0.000    0.009    0.000 _func.py:256(enter_wasm)
    20000    0.005    0.000    0.014    0.000 {built-in method builtins.next}
    20000    0.002    0.000    0.002    0.000 {built-in method _ctypes.POINTER}
    20000    0.003    0.000    0.003    0.000 {built-in method _ctypes.byref}
    10000    0.016    0.000    0.016    0.000 _bindings.py:2485(wasmtime_func_call_unchecked)
    10000    0.011    0.000    0.011    0.000 __init__.py:509(cast)
    10000    0.008    0.000    0.010    0.000 contextlib.py:102(__init__)
    10000    0.004    0.000    0.014    0.000 contextlib.py:130(__enter__)
    10000    0.007    0.000    0.011    0.000 contextlib.py:139(__exit__)
    10000    0.004    0.000    0.013    0.000 contextlib.py:279(helper)
    10000    0.039    0.000    0.112    0.000 cdb_djp_hash.py:85(cdb_djp_hash_wasm)
    10000    0.004    0.000    0.004    0.000 cdb_djp_hash.py:72(_create_raw)
    10000    0.002    0.000    0.002    0.000 {built-in method builtins.getattr}
    10000    0.001    0.000    0.001    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

I just wonder why enter_wasm was called twice per iteration

@muayyad-alsadi
Copy link
Contributor Author

this is a great enhancement from 43 ms to 6 ms in @alexprengere question in #96
yes the pure python implementation beats that at 1ms
although this is a huge improvement we have long way to beat pure python (not to mention C)
but I believe according to "Profile" this is as fast as python binding can go, any further optimization should be in the wasmtime itself not in the binding

0.0002523580042179674 # math_gcd
0.0014094869984546676 # python_gcd
0.043804362998344004  #  wasm_gcd
0.005873051006346941  # wasm_gcd_alt <-------- my change

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 25, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 25, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 25, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 26, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 26, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 26, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 27, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 27, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 28, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 2, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 2, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023
muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant