Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using threaded runtime destroys performance of C calls #115

Open
l29ah opened this issue Jul 28, 2020 · 5 comments
Open

Using threaded runtime destroys performance of C calls #115

l29ah opened this issue Jul 28, 2020 · 5 comments

Comments

@l29ah
Copy link
Contributor

l29ah commented Jul 28, 2020

‰ ghc -O2 -threaded --make inline-c-crit.hs && ./inline-c-crit +RTS -N        
Linking inline-c-crit ...
benchmarking haskell +
time                 7.679 ns   (7.570 ns .. 7.774 ns)
                     0.998 R²   (0.998 R² .. 0.999 R²)
mean                 7.594 ns   (7.495 ns .. 7.733 ns)
std dev              397.9 ps   (300.3 ps .. 528.6 ps)
variance introduced by outliers: 76% (severely inflated)

benchmarking c +
time                 182.9 ns   (182.0 ns .. 184.6 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 182.7 ns   (182.1 ns .. 183.8 ns)
std dev              2.572 ns   (1.813 ns .. 3.817 ns)
variance introduced by outliers: 15% (moderately inflated)

benchmarking cu +
time                 78.18 ns   (76.60 ns .. 79.59 ns)
                     0.998 R²   (0.998 R² .. 1.000 R²)
mean                 77.06 ns   (76.43 ns .. 78.05 ns)
std dev              2.792 ns   (1.559 ns .. 4.588 ns)
variance introduced by outliers: 56% (severely inflated)

benchmarking c block +
time                 195.8 ns   (188.0 ns .. 203.0 ns)
                     0.993 R²   (0.990 R² .. 0.999 R²)
mean                 189.3 ns   (186.6 ns .. 193.3 ns)
std dev              11.25 ns   (7.601 ns .. 14.98 ns)
variance introduced by outliers: 76% (severely inflated)

benchmarking cu block +
time                 76.09 ns   (75.35 ns .. 76.99 ns)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 78.40 ns   (76.93 ns .. 81.21 ns)
std dev              6.560 ns   (2.949 ns .. 10.09 ns)
variance introduced by outliers: 88% (severely inflated)

76ns call overhead on a modern 3.5GHz i7 is just insane. That's 266000 cycles!
Disabling -N makes it much less (but still more than Haskell), but then i can't meaningfully use threads in the application.

Code:

{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE QuasiQuotes #-}
import Criterion.Main
import qualified Language.C.Inline as C
import qualified Language.C.Inline.Unsafe as CU
import Data.Word
import System.IO.Unsafe

C.include "<stdint.h>"

type Fun = Word32 -> Word32 -> Word32

fun :: Fun
fun = (+)

cfun :: Fun
cfun x y = [C.pure| uint32_t { $(uint32_t x) + $(uint32_t y) }|]

cufun :: Fun
cufun x y = [CU.pure| uint32_t { $(uint32_t x) + $(uint32_t y) }|]

cblockfun :: Fun
cblockfun x y = unsafePerformIO [C.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

cublockfun :: Fun
cublockfun x y = unsafePerformIO [CU.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

main = defaultMain
	[ bgroup ""
		[ bench "haskell +" $ nf (fun 1) 1
		, bench "c +" $ nf (cfun 1) 1
		, bench "cu +" $ nf (cufun 1) 1
		, bench "c block +" $ nf (cblockfun 1) 1
		, bench "cu block +" $ nf (cublockfun 1) 1
		]
	]
@Ofenhed
Copy link
Contributor

Ofenhed commented Jul 28, 2020

Interesting tests. It seems to be partly related to lazy evaluation. I expanded your test a bit with tests that doesn't use lazy evaluation:

{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE QuasiQuotes #-}
import Criterion.Main
import qualified Language.C.Inline as C
import qualified Language.C.Inline.Unsafe as CU
import Data.Word
import System.IO.Unsafe

C.include "<stdint.h>"

type Fun = Word32 -> Word32 -> Word32
type IOFun = Word32 -> Word32 -> IO Word32

fun :: Fun
fun = (+)

cfun :: Fun
cfun x y = [C.pure| uint32_t { $(uint32_t x) + $(uint32_t y) }|]

cufun :: Fun
cufun x y = [CU.pure| uint32_t { $(uint32_t x) + $(uint32_t y) }|]

cblockfun :: Fun
cblockfun x y = unsafePerformIO [C.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

cublockfun :: Fun
cublockfun x y = unsafePerformIO [CU.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

cblockiofun :: IOFun
cblockiofun x y = [C.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

cublockiofun :: IOFun
cublockiofun x y = [CU.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

main = defaultMain
	[ bgroup ""
		[ bench "haskell +" $ nf (fun 1) 1
		, bench "c +" $ nf (cfun 1) 1
		, bench "cu +" $ nf (cufun 1) 1
		, bench "unsafe c block +" $ nf (cblockfun 1) 1
		, bench "unsafe cu block +" $ nf (cublockfun 1) 1
		, bench "c IO block +" $ nfIO (cblockiofun 1 1)
		, bench "cu IO block +" $ nfIO (cublockiofun 1 1)
		]
	]

I get almost the same type of results (but actually not as bad for the unsafe functions on an old i7-2620M) as you, but removing the unsafePerformIO gives me Haskell-like performance for the CU.block.

benchmarking haskell +
time                 16.57 ns   (16.49 ns .. 16.68 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 16.73 ns   (16.61 ns .. 17.01 ns)
std dev              554.8 ps   (303.3 ps .. 995.0 ps)
variance introduced by outliers: 54% (severely inflated)

benchmarking c +
time                 395.9 ns   (391.7 ns .. 401.5 ns)
                     0.998 R²   (0.996 R² .. 1.000 R²)
mean                 399.1 ns   (395.1 ns .. 406.4 ns)
std dev              18.28 ns   (11.36 ns .. 31.15 ns)
variance introduced by outliers: 64% (severely inflated)

benchmarking cu +
time                 27.76 ns   (26.76 ns .. 29.92 ns)
                     0.979 R²   (0.948 R² .. 1.000 R²)
mean                 27.48 ns   (26.99 ns .. 28.96 ns)
std dev              2.789 ns   (607.3 ps .. 5.285 ns)
variance introduced by outliers: 92% (severely inflated)

benchmarking unsafe c block +
time                 396.2 ns   (393.8 ns .. 399.7 ns)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 395.8 ns   (394.1 ns .. 398.1 ns)
std dev              6.578 ns   (5.015 ns .. 8.649 ns)
variance introduced by outliers: 19% (moderately inflated)

benchmarking unsafe cu block +
time                 26.70 ns   (26.61 ns .. 26.82 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 27.08 ns   (26.90 ns .. 27.31 ns)
std dev              712.1 ps   (563.6 ps .. 949.6 ps)
variance introduced by outliers: 42% (moderately inflated)

benchmarking c IO block +
time                 397.2 ns   (394.4 ns .. 401.5 ns)
                     0.998 R²   (0.995 R² .. 1.000 R²)
mean                 402.7 ns   (398.4 ns .. 414.5 ns)
std dev              22.17 ns   (10.67 ns .. 40.93 ns)
variance introduced by outliers: 72% (severely inflated)

benchmarking cu IO block +
time                 17.18 ns   (17.13 ns .. 17.23 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 17.29 ns   (17.22 ns .. 17.37 ns)
std dev              255.5 ps   (200.3 ps .. 341.7 ps)
variance introduced by outliers: 19% (moderately inflated)

@l29ah
Copy link
Contributor Author

l29ah commented Jul 28, 2020

Thanks for your findings! With unsafeDupablePerformIO used instead of unsafePerformIO the performance is much better (though still slower than Haskell). Maybe pure or its unsafe friend should use unsafeDupablePerformIO under the hood then?

@bitonic
Copy link
Collaborator

bitonic commented Sep 20, 2020

@l29ah I've merged #117, it is now released as 0.9.1.1 (thanks @Ofenhed !). Do you think we can close this?

@Ofenhed
Copy link
Contributor

Ofenhed commented Sep 20, 2020

It increases the performance for pure slightly, but the overhead for safe functions is still very slow. That might be the cost of safe calls, I don't know, but I wouldn't say that this question is put to rest yet.

@bitonic
Copy link
Collaborator

bitonic commented Sep 21, 2020

OK, let's leave it open then -- I don't think I will have time to work on this but contributions are very welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants