# Reference
- https://web.stanford.edu/class/cs107/guide/callgrind.html
- https://waterprogramming.wordpress.com/2017/06/08/profiling-c-code-with-callgrind/

remove all callgrind output

In [14]:
rm callgrind.out*

In [15]:
ls

cpuload      hello_world.cpp     try_valg01.out         valgrind_output.ipynb
cpuload.cpp  try_gprof_01.ipynb  try_valgrind_01.ipynb
hello_world  try_valg01.cpp      try_valgrind_02.ipynb


Create a program

In [5]:
cat > hello_world.cpp << EOF
#include <iostream>
using namespace std;

int main() 
{
    cout << "Hello, World!";
    return 0;
}

EOF

compile and execute

In [6]:
g++ hello_world.cpp -o hello_world

In [8]:
./hello_world

Hello, World!

# Profiling

In [16]:
valgrind --tool=callgrind ./hello_world

==8038== Callgrind, a call-graph generating cache profiler
==8038== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
==8038== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==8038== Command: ./hello_world
==8038== 
==8038== For interactive control, run 'callgrind_control -h'.
Hello, World!==8038== 
==8038== Events    : Ir
==8038== Collected : 2144907
==8038== 
==8038== I   refs:      2,144,907


In [17]:
ls

callgrind.out.8038  hello_world.cpp     try_valgrind_01.ipynb
cpuload             try_gprof_01.ipynb  try_valgrind_02.ipynb
cpuload.cpp         try_valg01.cpp      valgrind_output.ipynb
hello_world         try_valg01.out


# Profiling --simulate-cache=yes

In order to additionally monitor cache hits/misses, invoke valgrind callgrind with the `--simulate-cache=yes` option

In [18]:
valgrind --tool=callgrind --simulate-cache=yes ./hello_world

==8040== Callgrind, a call-graph generating cache profiler
==8040== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
==8040== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==8040== Command: ./hello_world
==8040== 
==8040== For interactive control, run 'callgrind_control -h'.
Hello, World!==8040== 
==8040== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
==8040== Collected : 2144907 519841 164633 1689 13562 2410 1611 7778 1580
==8040== 
==8040== I   refs:      2,144,907
==8040== I1  misses:        1,689
==8040== LLi misses:        1,611
==8040== I1  miss rate:      0.08%
==8040== LLi miss rate:      0.08%
==8040== 
==8040== D   refs:        684,474  (519,841 rd + 164,633 wr)
==8040== D1  misses:       15,972  ( 13,562 rd +   2,410 wr)
==8040== LLd misses:        9,358  (  7,778 rd +   1,580 wr)
==8040== D1  miss rate:       2.3% (    2.6%   +     1.5%  )
==8040== LLd miss rate:       1.4% (    1.5%   +     1.0%  )
==8040== 
==8040== LL refs:  

In [19]:
ls

callgrind.out.8038  cpuload.cpp      try_gprof_01.ipynb  try_valgrind_01.ipynb
callgrind.out.8040  hello_world      try_valg01.cpp      try_valgrind_02.ipynb
cpuload             hello_world.cpp  try_valg01.out      valgrind_output.ipynb


# Read the output

the output of the callgrind `callgrind.out.pid`

In [20]:
callgrind_annotate --auto=yes callgrind.out.8038

--------------------------------------------------------------------------------
Profile data file 'callgrind.out.8038' (creator: callgrind-3.13.0)
--------------------------------------------------------------------------------
I1 cache: 
D1 cache: 
LL cache: 
Timerange: Basic block 0 - 356790
Trigger: Program termination
Profiled target:  ./hello_world (PID 8038, part 1)
Events recorded:  Ir
Events shown:     Ir
Event sort order: Ir
Thresholds:       99
Include dirs:     
User annotated:   
Auto-annotation:  on

--------------------------------------------------------------------------------
       Ir 
--------------------------------------------------------------------------------
2,144,907  PROGRAM TOTALS

--------------------------------------------------------------------------------
     Ir  file:function
--------------------------------------------------------------------------------
928,198  /build/glibc-OTsEL5/glibc-2.27/elf/dl-lookup.c:_dl_lookup_symbol_x [/lib/x86_64-linux-

  /build/glibc-OTsEL5/glibc-2.27/elf/dl-hwcaps.c
  /build/glibc-OTsEL5/glibc-2.27/string/../sysdeps/x86_64/multiarch/../strchr.S
  /build/glibc-OTsEL5/glibc-2.27/elf/dl-misc.c
  /build/glibc-OTsEL5/glibc-2.27/string/../sysdeps/x86_64/multiarch/../strlen.S
  /build/glibc-OTsEL5/glibc-2.27/elf/get-dynamic-info.h
  /build/glibc-OTsEL5/glibc-2.27/string/../sysdeps/x86_64/strcmp.S
  /build/glibc-OTsEL5/glibc-2.27/misc/../sysdeps/unix/sysv/linux/mmap64.c
  /build/glibc-OTsEL5/glibc-2.27/wcsmbs/./wcsmbsload.h
  /build/glibc-OTsEL5/glibc-2.27/string/../bits/stdlib-bsearch.h
  /build/glibc-OTsEL5/glibc-2.27/iconv/gconv_simple.c
  /build/glibc-OTsEL5/glibc-2.27/malloc/malloc.c
  /build/glibc-OTsEL5/glibc-2.27/elf/dl-sort-maps.c
  /build/glibc-OTsEL5/glibc-2.27/elf/dl-object.c
  /build/glibc-OTsEL5/glibc-2.27/elf/dl-minimal.c
  /build/glibc-OTsEL5/glibc-2.27/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
  /build/glibc-OTsEL5/glibc-2.27/elf/dl-deps.c
  /build/glibc-OTsEL5/glibc-2.

The Ir counts are basically the count of assembly instructions executed. A single C statement can translate to 1, 2, or several assembly instructions.

```
--------------------------------------------------------------------------------
       Ir 
--------------------------------------------------------------------------------
2,144,907  PROGRAM TOTALS
```

In [21]:
callgrind_annotate --auto=yes callgrind.out.8040

--------------------------------------------------------------------------------
Profile data file 'callgrind.out.8040' (creator: callgrind-3.13.0)
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 8388608 B, 64 B, 16-way associative
Timerange: Basic block 0 - 356790
Trigger: Program termination
Profiled target:  ./hello_world (PID 8040, part 1)
Events recorded:  Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Events shown:     Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Event sort order: Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Thresholds:       99 0 0 0 0 0 0 0 0
Include dirs:     
User annotated:   
Auto-annotation:  on

--------------------------------------------------------------------------------
       Ir      Dr      Dw  I1mr   D1mr  D1mw  ILmr  DLmr  DLmw 
--------------------------------------------------------------------------------
2,144,910 519,841 164,633 1,690

    582       0      0    1     0     0    1     .    .  /build/glibc-OTsEL5/glibc-2.27/string/../bits/stdlib-bsearch.h:intel_check_word.isra.0
    575     140     65    9     2     0    9     1    .  /build/glibc-OTsEL5/glibc-2.27/stdlib/cxa_finalize.c:__cxa_finalize [/lib/x86_64-linux-gnu/libc-2.27.so]
    575      73     81   23     0     5   23     0    5  /build/glibc-OTsEL5/glibc-2.27/elf/dl-hwcaps.c:_dl_important_hwcaps [/lib/x86_64-linux-gnu/ld-2.27.so]
    560     658    714    4     0     7    4     0    1  /build/glibc-OTsEL5/glibc-2.27/elf/../sysdeps/x86_64/dl-trampoline.h:_dl_runtime_resolve_xsave [/lib/x86_64-linux-gnu/ld-2.27.so]
    556     117     96    3     0     5    3     0    5  /build/glibc-OTsEL5/glibc-2.27/misc/../sysdeps/unix/sysv/linux/mmap64.c:mmap [/lib/x86_64-linux-gnu/ld-2.27.so]
    540     210    146    8     4     0    8     1    .  ???:__cxxabiv1::__vmi_class_type_info::__do_dyncast(long, __cxxabiv1::__class_type_info::__sub_kind, __cxxabiv1::__class_

- Ir: I cache reads (instructions executed)
- I1mr: I1 cache read misses (instruction wasn't in I1 cache but was in L2)
- I2mr: L2 cache instruction read misses (instruction wasn't in I1 or L2 cache, had to be fetched from memory)
- Dr: D cache reads (memory reads)
- D1mr: D1 cache read misses (data location not in D1 cache, but in L2)
- D2mr: L2 cache data read misses (location not in D1 or L2)
- Dw: D cache writes (memory writes)
- D1mw: D1 cache write misses (location not in D1 cache, but in L2)
- D2mw: L2 cache data write misses (location not in D1 or L2)

### You can limit callgrind to only count instructions within named functions by using the option --toggle-collect=function_name