TimeTBB example: no speedup - actually slower by a factor X20 #92

izzys · 2019-07-23T10:58:07Z

Description

When running the example: TimeTBB.cpp, here are the results:

numberOfProblems = 1000000
problemSize = 4
With 1 threads:
Without memory allocation, grain size = 1, time = 0.284984
Without memory allocation, grain size = 10, time = 0.279206
Without memory allocation, grain size = 100, time = 0.256432
Without memory allocation, grain size = 1000, time = 0.253955
With memory allocation, grain size = 1, time = 0.422034
With memory allocation, grain size = 10, time = 0.444783
With memory allocation, grain size = 100, time = 0.437323
With memory allocation, grain size = 1000, time = 0.418359

With 4 threads:
Without memory allocation, grain size = 1, time = 4.46345
Without memory allocation, grain size = 10, time = 4.58412
Without memory allocation, grain size = 100, time = 4.66668
Without memory allocation, grain size = 1000, time = 4.60369
With memory allocation, grain size = 1, time = 5.07619
With memory allocation, grain size = 10, time = 5.38483
With memory allocation, grain size = 100, time = 5.23105
With memory allocation, grain size = 1000, time = 5.28864

With 8 threads:
Without memory allocation, grain size = 1, time = 5.24027
Without memory allocation, grain size = 10, time = 5.25576
Without memory allocation, grain size = 100, time = 5.2626
Without memory allocation, grain size = 1000, time = 5.25358
With memory allocation, grain size = 1, time = 5.95175
With memory allocation, grain size = 10, time = 5.93275
With memory allocation, grain size = 100, time = 5.92773
With memory allocation, grain size = 1000, time = 5.93785

Summary of results:
4 threads, without allocation, grain size = 1, speedup = 0.0638485
4 threads, without allocation, grain size = 10, speedup = 0.0609071
4 threads, without allocation, grain size = 100, speedup = 0.0549497
4 threads, without allocation, grain size = 1000, speedup = 0.0551635
4 threads, with allocation, grain size = 1, speedup = 0.0831399
4 threads, with allocation, grain size = 10, speedup = 0.0825993
4 threads, with allocation, grain size = 100, speedup = 0.0836012
4 threads, with allocation, grain size = 1000, speedup = 0.0791052
8 threads, without allocation, grain size = 1, speedup = 0.0543836
8 threads, without allocation, grain size = 10, speedup = 0.0531238
8 threads, without allocation, grain size = 100, speedup = 0.0487273
8 threads, without allocation, grain size = 1000, speedup = 0.0483396
8 threads, with allocation, grain size = 1, speedup = 0.0709091
8 threads, with allocation, grain size = 10, speedup = 0.0749709
8 threads, with allocation, grain size = 100, speedup = 0.0737758
8 threads, with allocation, grain size = 1000, speedup = 0.0704562

Steps to reproduce

Just run the example.

Expected behavior

i would expect some speedup, and not a slow down...

Environment

Linux 16.04
Intel i7

Here is my CMAKE output:

-- GTSAM_SOURCE_ROOT_DIR: [/home/izzys/samples/gtsam_samples]
-- Boost version: 1.58.0
-- Found the following Boost libraries:
-- serialization
-- system
-- filesystem
-- thread
-- program_options
-- date_time
-- timer
-- chrono
-- regex
-- atomic
-- GTSAM_BOOST_LIBRARIES: optimized;/usr/lib/x86_64-linux-gnu/libboost_serialization.so;optimized;/usr/lib/x86_64-linux-gnu/libboost_system.so;optimized;/usr/lib/x86_64-linux-gnu/libboost_filesystem.so;optimized;/usr/lib/x86_64-linux-gnu/libboost_thread.so;optimized;/usr/lib/x86_64-linux-gnu/libboost_date_time.so;optimized;/usr/lib/x86_64-linux-gnu/libboost_regex.so;debug;/usr/lib/x86_64-linux-gnu/libboost_serialization.so;debug;/usr/lib/x86_64-linux-gnu/libboost_system.so;debug;/usr/lib/x86_64-linux-gnu/libboost_filesystem.so;debug;/usr/lib/x86_64-linux-gnu/libboost_thread.so;debug;/usr/lib/x86_64-linux-gnu/libboost_date_time.so;debug;/usr/lib/x86_64-linux-gnu/libboost_regex.so
Ignoring Boost restriction on optional lvalue assignment from rvalues
-- Found Eigen version: 3.3.7
-- Building 3rdparty
-- checking for thread-local storage - found
-- Could NOT find GeographicLib (missing: GeographicLib_LIBRARY_DIRS GeographicLib_LIBRARIES GeographicLib_INCLUDE_DIRS)
-- Building base
-- Building geometry
-- Building inference
-- Building symbolic
-- Building discrete
-- Building linear
-- Building nonlinear
-- Building sam
-- Building sfm
-- Building slam
-- Building smart
-- Building navigation
-- GTSAM Version: 4.0.0
-- Install prefix: /usr/local
-- Building GTSAM - shared: ON
-- Wrote /home/tc34738/samples/gtsam_samples/gtsam-build/GTSAMConfig.cmake
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
-- ===============================================================
-- ================ Configuration Options ======================
-- CMAKE_CXX_COMPILER_ID type : GNU
-- CMAKE_CXX_COMPILER_VERSION : 5.4.0
-- CMake version : 3.5.1
-- CMake generator : Unix Makefiles
-- CMake build tool : /usr/bin/make
-- Build flags
-- Build Tests : Enabled
-- Build examples with 'make all' : Enabled
-- Build timing scripts with 'make all': Disabled
-- Build shared GTSAM libraries : Enabled
-- Put build type in library name : Enabled
-- Build libgtsam_unstable : Disabled
-- Build for native architecture : Enabled
-- Build type : Release
-- C compilation flags : -O3 -DNDEBUG
-- C++ compilation flags : -O3 -DNDEBUG
-- GTSAM_COMPILE_FEATURES_PUBLIC :
-- GTSAM_COMPILE_OPTIONS_PRIVATE : -Wall;$<$CONFIG:Debug:-g;-fno-inline>;$<$CONFIG:Release:-O3>;$<$CONFIG:Timing:-g;-O3>;$<$CONFIG:Profiling:-O3>;$<$CONFIG:RelWithDebInfo:-g;-O3>;-Wno-unused-local-typedefs
-- GTSAM_COMPILE_OPTIONS_PUBLIC : $<$<COMPILE_LANGUAGE:CXX>:-std=c++11>;-march=native
-- GTSAM_COMPILE_DEFINITIONS_PRIVATE : $<$CONFIG:Debug:_DEBUG;EIGEN_INITIALIZE_MATRICES_BY_NAN>;$<$CONFIG:Release:NDEBUG>;$<$CONFIG:Timing:NDEBUG;ENABLE_TIMING>;$<$CONFIG:Profiling:NDEBUG>;$<$CONFIG:RelWithDebInfo:NDEBUG>
-- GTSAM_COMPILE_DEFINITIONS_PUBLIC : BOOST_OPTIONAL_ALLOW_BINDING_TO_RVALUES;BOOST_OPTIONAL_CONFIG_ALLOW_BINDING_TO_RVALUES
-- GTSAM_COMPILE_OPTIONS_PRIVATE_DEBUG : -g;-fno-inline
-- GTSAM_COMPILE_OPTIONS_PUBLIC_DEBUG :
-- GTSAM_COMPILE_DEFINITIONS_PRIVATE_DEBUG : _DEBUG;EIGEN_INITIALIZE_MATRICES_BY_NAN
-- GTSAM_COMPILE_DEFINITIONS_PUBLIC_DEBUG :
-- GTSAM_COMPILE_OPTIONS_PRIVATE_RELEASE : -O3
-- GTSAM_COMPILE_OPTIONS_PUBLIC_RELEASE :
-- GTSAM_COMPILE_DEFINITIONS_PRIVATE_RELEASE : NDEBUG
-- GTSAM_COMPILE_DEFINITIONS_PUBLIC_RELEASE :
-- GTSAM_COMPILE_OPTIONS_PRIVATE_TIMING : -g;-O3
-- GTSAM_COMPILE_OPTIONS_PUBLIC_TIMING :
-- GTSAM_COMPILE_DEFINITIONS_PRIVATE_TIMING : NDEBUG;ENABLE_TIMING
-- GTSAM_COMPILE_DEFINITIONS_PUBLIC_TIMING :
-- GTSAM_COMPILE_OPTIONS_PRIVATE_PROFILING : -O3
-- GTSAM_COMPILE_OPTIONS_PUBLIC_PROFILING :
-- GTSAM_COMPILE_DEFINITIONS_PRIVATE_PROFILING : NDEBUG
-- GTSAM_COMPILE_DEFINITIONS_PUBLIC_PROFILING :
-- GTSAM_COMPILE_OPTIONS_PRIVATE_RELWITHDEBINFO : -g;-O3
-- GTSAM_COMPILE_OPTIONS_PUBLIC_RELWITHDEBINFO :
-- GTSAM_COMPILE_DEFINITIONS_PRIVATE_RELWITHDEBINFO : NDEBUG
-- GTSAM_COMPILE_DEFINITIONS_PUBLIC_RELWITHDEBINFO :
-- GTSAM_COMPILE_OPTIONS_PRIVATE_MINSIZEREL :
-- GTSAM_COMPILE_OPTIONS_PUBLIC_MINSIZEREL :
-- GTSAM_COMPILE_DEFINITIONS_PRIVATE_MINSIZEREL :
-- GTSAM_COMPILE_DEFINITIONS_PUBLIC_MINSIZEREL :
-- Use System Eigen : OFF (Using version: 3.3.7)
-- Use Intel TBB : Yes
-- Eigen will use MKL : MKL found but GTSAM_WITH_EIGEN_MKL is disabled
-- Eigen will use MKL and OpenMP : OpenMP found but GTSAM_WITH_EIGEN_MKL is disabled
-- Default allocator : TBB
-- Build with ccache : No
-- Packaging flags
-- CPack Source Generator : TGZ
-- CPack Generator : TGZ
-- GTSAM flags
-- Quaternions as default Rot3 : Disabled
-- Runtime consistency checking : Disabled
-- Rot3 retract is full ExpMap : Disabled
-- Pose3 retract is full ExpMap : Disabled
-- Deprecated in GTSAM 4 allowed : Enabled
-- Point3 is typedef to Vector3 : Disabled
-- Metis-based Nested Dissection : Enabled
-- Use tangent-space preintegration: Enabled
-- Build Wrap : Disabled
-- MATLAB toolbox flags
-- Install matlab toolbox : Disabled
-- Cython toolbox flags
-- Install Cython toolbox : Disabled
-- ===============================================================
-- Configuring done
-- Generating done
-- Build files have been written to: /home/izzys/samples/gtsam_samples/gtsam-build

dellaert · 2019-07-30T14:00:21Z

@MandyXie could you try to reproduce?

MandyXie · 2019-08-02T16:30:39Z

I ran the example, and got the same issue as you mentioned. I will look into it, and try to figure out what is going on.

ProfFan · 2019-09-23T14:22:54Z

Side note: We can try to integrate a flamegraph library into GTSAM possibly replacing the gttic/toc machinery.

ProfFan · 2019-09-23T14:47:31Z

#121

ProfFan · 2019-09-25T14:27:35Z

My results on macOS 10.14:

numberOfProblems = 1000000
problemSize = 4
With 1 threads:
Without memory allocation, grain size = 1, time = 0.150485
Without memory allocation, grain size = 10, time = 0.15183
Without memory allocation, grain size = 100, time = 0.149489
Without memory allocation, grain size = 1000, time = 0.152419
With memory allocation, grain size = 1, time = 0.351757
With memory allocation, grain size = 10, time = 0.320499
With memory allocation, grain size = 100, time = 0.314284
With memory allocation, grain size = 1000, time = 0.323573

With 4 threads:
Without memory allocation, grain size = 1, time = 0.162687
Without memory allocation, grain size = 10, time = 0.162498
Without memory allocation, grain size = 100, time = 0.146438
Without memory allocation, grain size = 1000, time = 0.150557
With memory allocation, grain size = 1, time = 0.192916
With memory allocation, grain size = 10, time = 0.200336
With memory allocation, grain size = 100, time = 0.196882
With memory allocation, grain size = 1000, time = 0.195918

With 8 threads:
Without memory allocation, grain size = 1, time = 0.160153
Without memory allocation, grain size = 10, time = 0.160778
Without memory allocation, grain size = 100, time = 0.161141
Without memory allocation, grain size = 1000, time = 0.161196
With memory allocation, grain size = 1, time = 0.198829
With memory allocation, grain size = 10, time = 0.199491
With memory allocation, grain size = 100, time = 0.199772
With memory allocation, grain size = 1000, time = 0.201396

Summary of results:
4 threads, without allocation, grain size = 1, speedup = 0.924997
4 threads, without allocation, grain size = 10, speedup = 0.93435
4 threads, without allocation, grain size = 100, speedup = 1.02083
4 threads, without allocation, grain size = 1000, speedup = 1.01237
4 threads, with allocation, grain size = 1, speedup = 1.82337
4 threads, with allocation, grain size = 10, speedup = 1.59981
4 threads, with allocation, grain size = 100, speedup = 1.59631
4 threads, with allocation, grain size = 1000, speedup = 1.65157
8 threads, without allocation, grain size = 1, speedup = 0.939633
8 threads, without allocation, grain size = 10, speedup = 0.944346
8 threads, without allocation, grain size = 100, speedup = 0.927691
8 threads, without allocation, grain size = 1000, speedup = 0.945551
8 threads, with allocation, grain size = 1, speedup = 1.76914
8 threads, with allocation, grain size = 10, speedup = 1.60658
8 threads, with allocation, grain size = 100, speedup = 1.57321
8 threads, with allocation, grain size = 1000, speedup = 1.60665

dellaert · 2019-09-25T14:52:08Z

Wondering whether this is something we can fix by looking at where we lose time. Also the amount of parallelism depends on a good ordering, hence we should investigate whether using Metis for example gives us better bang for the buck. Finally, we could share this in the docs and a possible blog post, reminding people about parallelism in the Bayes tree, and possibly providing a flag to try and use the parallel branch or not...

ProfFan · 2019-09-25T15:04:08Z

Note that the FindTBB.cmake in GTSAM is also out of date (cannot find TBB 2019.U0). Replacing the file from the VTK repo works flawlessly.

ProfFan · 2019-09-26T15:18:03Z

Note that the previous result is wrong. On my mac it is actually working, with max 4 times improvement with TBB. Assuming a bug specific to the environment (Ubuntu 16.04).

Got no time on this currently.

> $ ninja TimeTBB.run
[2/2] cd /Users/proffan/Projects/Development/VISION/gtsam_...n/Projects/Development/VISION/gtsam_build/examples/TimeTBB
/Users/proffan/Projects/Development/VISION/GTSAM/gtsam/3rdparty/Eigen/Eigen/src/Core/functors/UnaryFunctors.h:576:88: runtime error: division by zero
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /Users/proffan/Projects/Development/VISION/GTSAM/gtsam/3rdparty/Eigen/Eigen/src/Core/functors/UnaryFunctors.h:576:88 in
numberOfProblems = 1000000
problemSize = 4
With 1 threads:
/usr/local/include/tbb/internal/../task.h:779:30: runtime error: member call on address 0x000116be3e00 which does not point to an object of type 'tbb::internal::scheduler'
0x000116be3e00: note: object is of type 'tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>'
 00 00 00 00  e8 1e 92 12 01 00 00 00  00 00 00 00 00 00 00 00  60 76 bf 16 01 00 00 00  60 76 bf 16
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/tbb/internal/../task.h:779:30 in
/usr/local/include/tbb/internal/../task.h:1046:23: runtime error: member call on address 0x000116be3e00 which does not point to an object of type 'tbb::internal::scheduler'
0x000116be3e00: note: object is of type 'tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>'
 00 00 00 00  e8 1e 92 12 01 00 00 00  00 00 00 00 00 00 00 00  60 76 bf 16 01 00 00 00  60 76 bf 16
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/tbb/internal/../task.h:1046:23 in
Without memory allocation, grain size = 1, time = 5.46733
Without memory allocation, grain size = 10, time = 5.59286
Without memory allocation, grain size = 100, time = 5.64539
Without memory allocation, grain size = 1000, time = 5.51933
With memory allocation, grain size = 1, time = 8.55949
With memory allocation, grain size = 10, time = 9.07178
With memory allocation, grain size = 100, time = 8.79069
With memory allocation, grain size = 1000, time = 8.66558

With 4 threads:
Without memory allocation, grain size = 1, time = 1.69261
Without memory allocation, grain size = 10, time = 1.68709
Without memory allocation, grain size = 100, time = 1.73469
Without memory allocation, grain size = 1000, time = 1.7691
With memory allocation, grain size = 1, time = 2.58719
With memory allocation, grain size = 10, time = 2.65104
With memory allocation, grain size = 100, time = 2.62247
With memory allocation, grain size = 1000, time = 2.74432

With 8 threads:
Without memory allocation, grain size = 1, time = 1.37712
Without memory allocation, grain size = 10, time = 1.46636
Without memory allocation, grain size = 100, time = 1.46375
Without memory allocation, grain size = 1000, time = 1.45783
With memory allocation, grain size = 1, time = 1.80873
With memory allocation, grain size = 10, time = 1.81393
With memory allocation, grain size = 100, time = 1.8269
With memory allocation, grain size = 1000, time = 1.84683

Summary of results:
4 threads, without allocation, grain size = 1, speedup = 3.23012
4 threads, without allocation, grain size = 10, speedup = 3.31508
4 threads, without allocation, grain size = 100, speedup = 3.25442
4 threads, without allocation, grain size = 1000, speedup = 3.11986
4 threads, with allocation, grain size = 1, speedup = 3.30841
4 threads, with allocation, grain size = 10, speedup = 3.42198
4 threads, with allocation, grain size = 100, speedup = 3.35207
4 threads, with allocation, grain size = 1000, speedup = 3.15765
8 threads, without allocation, grain size = 1, speedup = 3.97013
8 threads, without allocation, grain size = 10, speedup = 3.81411
8 threads, without allocation, grain size = 100, speedup = 3.8568
8 threads, without allocation, grain size = 1000, speedup = 3.78598
8 threads, with allocation, grain size = 1, speedup = 4.73232
8 threads, with allocation, grain size = 10, speedup = 5.00116
8 threads, with allocation, grain size = 100, speedup = 4.81182
8 threads, with allocation, grain size = 1000, speedup = 4.69213

ProfFan · 2019-10-01T16:47:36Z

For the UBSAN panic here, it is a problem with TBB, RcppCore/RcppParallel#36

In light of the code quality, I strongly believe it is a issue with the Ubuntu 16.04 supplied TBB.

@izzys Could you help reproducing this bug on our side? Need your environment, TBB version, compiling command line, etc. Many thanks!

acxz · 2020-05-11T03:13:56Z

I can reproduce the issue:
mkdir build && cd build && cmake .. && make TimeTBB

gist

Hardware: Intel i7-7500U (2) @ 3.5GHz (having only two cores prob affects the times at higher thread counts)
OS: Arch Linux
TBB: 2020.2
GCC: 9.3

ProfFan · 2020-05-11T04:17:24Z

I'll add this to my todo list, but not sure if I really have time on this.

dellaert · 2020-05-11T05:07:22Z

@ProfFan you do not have time for this :-)
@acxz if you're motivated, this particular benchmark might not be the best to benchmark - rather, the other SolverComparer benchmark might.

903694b77 Merge pull request #92 from borglab/fix/global-variables abb74dd26 added support for default args, more tests, and docs cfa104257 Merge pull request #83 from borglab/feature/globalVariables fdd7b8cad fixes d4ceb63c6 add correct namespaces to global variable values 925c02c82 global variables works af62fdef7 unit test for global variable 3d3f3f3c9 add "Variable" to the global parsing rule ecfeb2025 rename "Property" to "Variable" and move into separate file git-subtree-dir: wrap git-subtree-split: 903694b777c4c25bd9cc82f8d3950b3bbc33d8f2

zzodo · 2024-06-10T11:49:02Z

Any updates on this issue?
I still can reproduce this on Ubuntu 22.04 LTS and system-default TBB(2021.5) in both 4.2.0 and develop branches.
The test below was held on develop branch.

$ ./examples/TimeTBB 
numberOfProblems = 1000000
problemSize = 4
With 1 threads:
Without memory allocation, grain size = 1, time = 0.332967
Without memory allocation, grain size = 10, time = 0.328845
Without memory allocation, grain size = 100, time = 0.328481
Without memory allocation, grain size = 1000, time = 0.328192
With memory allocation, grain size = 1, time = 0.369558
With memory allocation, grain size = 10, time = 0.369653
With memory allocation, grain size = 100, time = 0.368168
With memory allocation, grain size = 1000, time = 0.368071

With 4 threads:
Without memory allocation, grain size = 1, time = 2.13116
Without memory allocation, grain size = 10, time = 2.10212
Without memory allocation, grain size = 100, time = 2.11296
Without memory allocation, grain size = 1000, time = 2.11572
With memory allocation, grain size = 1, time = 2.39639
With memory allocation, grain size = 10, time = 2.40664
With memory allocation, grain size = 100, time = 2.43013
With memory allocation, grain size = 1000, time = 2.43989

With 8 threads:
Without memory allocation, grain size = 1, time = 3.15854
Without memory allocation, grain size = 10, time = 3.17693
Without memory allocation, grain size = 100, time = 3.17387
Without memory allocation, grain size = 1000, time = 3.17985
With memory allocation, grain size = 1, time = 3.45604
With memory allocation, grain size = 10, time = 3.50903
With memory allocation, grain size = 100, time = 3.51825
With memory allocation, grain size = 1000, time = 3.52622

Summary of results:
4 threads, without allocation, grain size = 1, speedup = 0.156237
4 threads, without allocation, grain size = 10, speedup = 0.156435
4 threads, without allocation, grain size = 100, speedup = 0.15546
4 threads, without allocation, grain size = 1000, speedup = 0.155121
4 threads, with allocation, grain size = 1, speedup = 0.154214
4 threads, with allocation, grain size = 10, speedup = 0.153597
4 threads, with allocation, grain size = 100, speedup = 0.151501
4 threads, with allocation, grain size = 1000, speedup = 0.150855
8 threads, without allocation, grain size = 1, speedup = 0.105418
8 threads, without allocation, grain size = 10, speedup = 0.10351
8 threads, without allocation, grain size = 100, speedup = 0.103496
8 threads, without allocation, grain size = 1000, speedup = 0.10321
8 threads, with allocation, grain size = 1, speedup = 0.106931
8 threads, with allocation, grain size = 10, speedup = 0.105343
8 threads, with allocation, grain size = 100, speedup = 0.104645
8 threads, with allocation, grain size = 1000, speedup = 0.104381

GTSAM build information:

$ sudo cmake ..
-- GTSAM is a shared library due to GTSAM_FORCE_SHARED_LIB
-- GTSAM_POSE3_EXPMAP=ON, enabling GTSAM_ROT3_EXPMAP as well
-- Found Eigen version: 3.3.7
-- checking for thread-local storage - found
-- Could NOT find MKL (missing: MKL_INCLUDE_DIR MKL_LIBRARIES) 
-- Found Google perftools: 
-- Building 3rdparty
-- Could NOT find GeographicLib (missing: GeographicLib_LIBRARY_DIRS GeographicLib_LIBRARIES GeographicLib_INCLUDE_DIRS) 
-- Building base
-- Building basis
-- Building geometry
-- Building inference
-- Building symbolic
-- Building discrete
-- Building hybrid
-- Building linear
-- Building nonlinear
-- Building sam
-- Building sfm
-- Building slam
-- Building navigation
-- GTSAM Version: 4.3a0
-- Install prefix: /usr/local
-- Building GTSAM - as a SHARED library
-- Wrote /opt/gtsam/build/GTSAMConfig.cmake
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) 
-- ===============================================================
-- ================  Configuration Options  ======================
--  CMAKE_CXX_COMPILER_ID type                       : GNU
--  CMAKE_CXX_COMPILER_VERSION                       : 11.4.0
--  CMake version                                    : 3.22.1
--  CMake generator                                  : Unix Makefiles
--  CMake build tool                                 : /usr/bin/gmake
-- Build flags                                               
--  Build Tests                                      : Disabled
--  Build examples with 'make all'                   : Disabled
--  Build timing scripts with 'make all'             : Disabled
--  Build shared GTSAM libraries                     : Enabled
--  Put build type in library name                   : Enabled
--  Build libgtsam_unstable                          : Disabled
--  Build GTSAM unstable Python                      : Disabled
--  Build MATLAB Toolbox for unstable                : Disabled
--  Build for native architecture                    : Disabled
--  Build type                                       : Release
--  C compilation flags                              :  -O3 -DNDEBUG
--  C++ compilation flags                            :  -O3 -DNDEBUG
--  Enable Boost serialization                       : ON
--  GTSAM_COMPILE_FEATURES_PUBLIC                    : cxx_std_17
--  GTSAM_COMPILE_OPTIONS_PUBLIC                     : 
--  GTSAM_COMPILE_DEFINITIONS_PUBLIC                 : 
--  GTSAM_COMPILE_OPTIONS_PUBLIC_RELEASE             : 
--  GTSAM_COMPILE_DEFINITIONS_PUBLIC_RELEASE         : 
--  Use System Eigen                                 : ON (Using version: 3.3.7)
--  Use System Metis                                 : OFF
--  Using Boost version                              : 1.74.0
--  Use Intel TBB                                    : Yes (Version: 2021.5.0)
--  Eigen will use MKL                               : MKL not found
--  Eigen will use MKL and OpenMP                    : OpenMP found but GTSAM_WITH_EIGEN_MKL is disabled
--  Default allocator                                : TBB
--  Cheirality exceptions enabled                    : YES
--  Build with ccache                                : No
-- Packaging flags
--  CPack Source Generator                           : TGZ
--  CPack Generator                                  : TGZ
-- GTSAM flags                                               
--  Quaternions as default Rot3                      : Disabled
--  Runtime consistency checking                     : Disabled
--  Build with Memory Sanitizer                      : Disabled
--  Rot3 retract is full ExpMap                      : Enabled
--  Pose3 retract is full ExpMap                     : Enabled
--  Enable branch merging in DecisionTree            : Enabled
--  Allow features deprecated in GTSAM 4.3           : Enabled
--  Metis-based Nested Dissection                    : Enabled
--  Use tangent-space preintegration                 : Enabled
-- MATLAB toolbox flags
--  Install MATLAB toolbox                           : Disabled
-- Python toolbox flags                                      
--  Build Python module with pybind                  : Disabled
-- ===============================================================
-- Configuring done
-- Generating done
-- Build files have been written to: /opt/gtsam/build

zzodo · 2024-06-10T12:57:25Z

Another example with SolverComparer that mentioned above

$ ./examples/SolverComparer --incremental -d w10000 -o w_inc --threads 8
Loading dataset w10000
Using 8 threads
Looking for first measurement from step 0
Looks like 0 is the first time step, so adding a prior on it
Playing forward time steps...
chi2 = -nan
Step 0
-Total: 0 CPU (0 times, 0 wall, 0 children, min: 0 max: 0)
|   -Collect measurements: 0 CPU (1 times, 2e-06 wall, 0 children, min: 0 max: 0)
|   -Update ISAM2: 0 CPU (1 times, 2e-06 wall, 0 children, min: 0 max: 0)
|   -chi2: 0 CPU (1 times, 3.4e-05 wall, 0 children, min: 0 max: 0)
chi2 = 0.00172843
Step 1000
-Total: 0 CPU (0 times, 0 wall, 0.71 children, min: 0 max: 0)
|   -Collect measurements: 0.08 CPU (1001 times, 0.030624 wall, 0.08 children, min: 0 max: 0.01)
|   -Update ISAM2: 0.63 CPU (1001 times, 0.11614 wall, 0.63 children, min: 0 max: 0.01)
|   -chi2: 0 CPU (2 times, 0.000611 wall, 0 children, min: 0 max: 0)
chi2 = 0.00175299
Step 2000
-Total: 0 CPU (0 times, 0 wall, 1.85 children, min: 0 max: 0)
|   -Collect measurements: 0.18 CPU (2001 times, 0.093793 wall, 0.18 children, min: 0 max: 0.01)
|   -Update ISAM2: 1.67 CPU (2001 times, 0.334617 wall, 1.67 children, min: 0 max: 0.02)
|   -chi2: 0 CPU (3 times, 0.001946 wall, 0 children, min: 0 max: 0)
chi2 = 0.00177148
Step 3000
-Total: 0 CPU (0 times, 0 wall, 4.29 children, min: 0 max: 0)
|   -Collect measurements: 0.52 CPU (3001 times, 0.358602 wall, 0.52 children, min: 0 max: 0.01)
|   -Update ISAM2: 3.77 CPU (3001 times, 0.948901 wall, 3.77 children, min: 0 max: 0.02)
|   -chi2: 0 CPU (4 times, 0.005088 wall, 0 children, min: 0 max: 0)
chi2 = 0.00177683
Step 4000
-Total: 0 CPU (0 times, 0 wall, 7.88 children, min: 0 max: 0)
|   -Collect measurements: 1.09 CPU (4001 times, 0.882309 wall, 1.09 children, min: 0 max: 0.02)
|   -Update ISAM2: 6.78 CPU (4001 times, 1.9246 wall, 6.78 children, min: 0 max: 0.05)
|   -chi2: 0.01 CPU (5 times, 0.008837 wall, 0.01 children, min: 0.01 max: 0.01)
chi2 = 0.00177331
Step 5000
-Total: 0 CPU (0 times, 0 wall, 11.41 children, min: 0 max: 0)
|   -Collect measurements: 1.47 CPU (5001 times, 1.19369 wall, 1.47 children, min: 0 max: 0.02)
|   -Update ISAM2: 9.93 CPU (5001 times, 2.90427 wall, 9.93 children, min: 0.01 max: 0.05)
|   -chi2: 0.01 CPU (6 times, 0.014129 wall, 0.01 children, min: 0 max: 0.01)
chi2 = 0.00178298
Step 6000
-Total: 0 CPU (0 times, 0 wall, 16.1 children, min: 0 max: 0)
|   -Collect measurements: 2.55 CPU (6001 times, 2.11069 wall, 2.55 children, min: 0 max: 0.02)
|   -Update ISAM2: 13.54 CPU (6001 times, 4.45979 wall, 13.54 children, min: 0.01 max: 0.09)
|   -chi2: 0.01 CPU (7 times, 0.022692 wall, 0.01 children, min: 0 max: 0.01)
chi2 = 0.00177962
Step 7000
-Total: 0 CPU (0 times, 0 wall, 19.68 children, min: 0 max: 0)
|   -Collect measurements: 3.37 CPU (7001 times, 2.9156 wall, 3.37 children, min: 0 max: 0.02)
|   -Update ISAM2: 16.29 CPU (7001 times, 5.65427 wall, 16.29 children, min: 0 max: 0.11)
|   -chi2: 0.02 CPU (8 times, 0.029358 wall, 0.02 children, min: 0.01 max: 0.01)
chi2 = 0.00177708
Step 8000
-Total: 0 CPU (0 times, 0 wall, 23.28 children, min: 0 max: 0)
|   -Collect measurements: 4.16 CPU (8001 times, 3.72301 wall, 4.16 children, min: 0 max: 0.02)
|   -Update ISAM2: 19.09 CPU (8001 times, 6.79453 wall, 19.09 children, min: 0.01 max: 0.11)
|   -chi2: 0.03 CPU (9 times, 0.041096 wall, 0.03 children, min: 0.01 max: 0.01)
chi2 = 0.00177835
Step 9000
-Total: 0 CPU (0 times, 0 wall, 29.51 children, min: 0 max: 0)
|   -Collect measurements: 6.08 CPU (9001 times, 5.58775 wall, 6.08 children, min: 0 max: 0.02)
|   -Update ISAM2: 23.38 CPU (9001 times, 9.15137 wall, 23.38 children, min: 0 max: 0.17)
|   -chi2: 0.05 CPU (10 times, 0.055059 wall, 0.05 children, min: 0.02 max: 0.02)
Writing output file w_inc
unregistered class - derived class not registered or exported

$ ./examples/SolverComparer --incremental -d w10000 -o w_inc --threads 4
Loading dataset w10000
Using 4 threads
Looking for first measurement from step 0
Looks like 0 is the first time step, so adding a prior on it
Playing forward time steps...
chi2 = -nan
Step 0
-Total: 0 CPU (0 times, 0 wall, 0 children, min: 0 max: 0)
|   -Collect measurements: 0 CPU (1 times, 1e-06 wall, 0 children, min: 0 max: 0)
|   -Update ISAM2: 0 CPU (1 times, 1e-06 wall, 0 children, min: 0 max: 0)
|   -chi2: 0 CPU (1 times, 3.3e-05 wall, 0 children, min: 0 max: 0)
chi2 = 0.00172843
Step 1000
-Total: 0 CPU (0 times, 0 wall, 0.36 children, min: 0 max: 0)
|   -Collect measurements: 0.07 CPU (1001 times, 0.030023 wall, 0.07 children, min: 0 max: 0.01)
|   -Update ISAM2: 0.29 CPU (1001 times, 0.108681 wall, 0.29 children, min: 0 max: 0.01)
|   -chi2: 0 CPU (2 times, 0.00063 wall, 0 children, min: 0 max: 0)
chi2 = 0.00175299
Step 2000
-Total: 0 CPU (0 times, 0 wall, 0.98 children, min: 0 max: 0)
|   -Collect measurements: 0.15 CPU (2001 times, 0.091805 wall, 0.15 children, min: 0 max: 0.01)
|   -Update ISAM2: 0.82 CPU (2001 times, 0.31337 wall, 0.82 children, min: 0 max: 0.01)
|   -chi2: 0.01 CPU (3 times, 0.001879 wall, 0.01 children, min: 0.01 max: 0.01)
chi2 = 0.00177148
Step 3000
-Total: 0 CPU (0 times, 0 wall, 2.5 children, min: 0 max: 0)
|   -Collect measurements: 0.37 CPU (3001 times, 0.355865 wall, 0.37 children, min: 0 max: 0.01)
|   -Update ISAM2: 2.11 CPU (3001 times, 0.910435 wall, 2.11 children, min: 0 max: 0.02)
|   -chi2: 0.02 CPU (4 times, 0.00504 wall, 0.02 children, min: 0.01 max: 0.01)
chi2 = 0.00177683
Step 4000
-Total: 0 CPU (0 times, 0 wall, 4.89 children, min: 0 max: 0)
|   -Collect measurements: 0.92 CPU (4001 times, 0.877368 wall, 0.92 children, min: 0 max: 0.01)
|   -Update ISAM2: 3.94 CPU (4001 times, 1.85701 wall, 3.94 children, min: 0 max: 0.04)
|   -chi2: 0.03 CPU (5 times, 0.008958 wall, 0.03 children, min: 0.01 max: 0.01)
chi2 = 0.00177331
Step 5000
-Total: 0 CPU (0 times, 0 wall, 7.08 children, min: 0 max: 0)
|   -Collect measurements: 1.16 CPU (5001 times, 1.18555 wall, 1.16 children, min: 0 max: 0.01)
|   -Update ISAM2: 5.88 CPU (5001 times, 2.80783 wall, 5.88 children, min: 0 max: 0.04)
|   -chi2: 0.04 CPU (6 times, 0.014088 wall, 0.04 children, min: 0.01 max: 0.01)
chi2 = 0.00178298
Step 6000
-Total: 0 CPU (0 times, 0 wall, 10.64 children, min: 0 max: 0)
|   -Collect measurements: 2.02 CPU (6001 times, 2.10637 wall, 2.02 children, min: 0 max: 0.01)
|   -Update ISAM2: 8.57 CPU (6001 times, 4.3782 wall, 8.57 children, min: 0.01 max: 0.09)
|   -chi2: 0.05 CPU (7 times, 0.023237 wall, 0.05 children, min: 0.01 max: 0.01)
chi2 = 0.00177962
Step 7000
-Total: 0 CPU (0 times, 0 wall, 13.63 children, min: 0 max: 0)
|   -Collect measurements: 2.9 CPU (7001 times, 2.91521 wall, 2.9 children, min: 0 max: 0.01)
|   -Update ISAM2: 10.67 CPU (7001 times, 5.61952 wall, 10.67 children, min: 0 max: 0.09)
|   -chi2: 0.06 CPU (8 times, 0.030003 wall, 0.06 children, min: 0.01 max: 0.01)
chi2 = 0.00177708
Step 8000
-Total: 0 CPU (0 times, 0 wall, 16.33 children, min: 0 max: 0)
|   -Collect measurements: 3.77 CPU (8001 times, 3.71799 wall, 3.77 children, min: 0 max: 0.02)
|   -Update ISAM2: 12.49 CPU (8001 times, 6.74208 wall, 12.49 children, min: 0.01 max: 0.09)
|   -chi2: 0.07 CPU (9 times, 0.041187 wall, 0.07 children, min: 0.01 max: 0.01)
chi2 = 0.00177835
Step 9000
-Total: 0 CPU (0 times, 0 wall, 21.79 children, min: 0 max: 0)
|   -Collect measurements: 5.86 CPU (9001 times, 5.61159 wall, 5.86 children, min: 0 max: 0.02)
|   -Update ISAM2: 15.85 CPU (9001 times, 9.07511 wall, 15.85 children, min: 0.01 max: 0.11)
|   -chi2: 0.08 CPU (10 times, 0.055581 wall, 0.08 children, min: 0.01 max: 0.01)
Writing output file w_inc
unregistered class - derived class not registered or exported

My laptop has hybrid CPU Intel Core i7-13700H and I also tried TBB version 2021.12, which is newer than v2021.9.0 that is announced to be compatible with the hybrid CPUs.

dellaert assigned MandyXie Jul 30, 2019

ProfFan added this to the GTSAM 4.1 milestone Sep 23, 2019

dellaert mentioned this issue Jul 7, 2020

4.0.3 Release Tracking Issue #357

Closed

26 tasks

varunagrawal modified the milestones: GTSAM 4.1, GTSAM 4.0.3 Jul 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimeTBB example: no speedup - actually slower by a factor X20 #92

TimeTBB example: no speedup - actually slower by a factor X20 #92

izzys commented Jul 23, 2019 •

edited

Loading

dellaert commented Jul 30, 2019

MandyXie commented Aug 2, 2019

ProfFan commented Sep 23, 2019

ProfFan commented Sep 23, 2019

ProfFan commented Sep 25, 2019

dellaert commented Sep 25, 2019

ProfFan commented Sep 25, 2019

ProfFan commented Sep 26, 2019

ProfFan commented Oct 1, 2019

acxz commented May 11, 2020 •

edited

Loading

ProfFan commented May 11, 2020

dellaert commented May 11, 2020

zzodo commented Jun 10, 2024

zzodo commented Jun 10, 2024

TimeTBB example: no speedup - actually slower by a factor X20 #92

TimeTBB example: no speedup - actually slower by a factor X20 #92

Comments

izzys commented Jul 23, 2019 • edited Loading

Description

Steps to reproduce

Expected behavior

Environment

dellaert commented Jul 30, 2019

MandyXie commented Aug 2, 2019

ProfFan commented Sep 23, 2019

ProfFan commented Sep 23, 2019

ProfFan commented Sep 25, 2019

dellaert commented Sep 25, 2019

ProfFan commented Sep 25, 2019

ProfFan commented Sep 26, 2019

ProfFan commented Oct 1, 2019

acxz commented May 11, 2020 • edited Loading

ProfFan commented May 11, 2020

dellaert commented May 11, 2020

zzodo commented Jun 10, 2024

zzodo commented Jun 10, 2024

izzys commented Jul 23, 2019 •

edited

Loading

acxz commented May 11, 2020 •

edited

Loading