Commits
multi-tcg-v2-2…
Name already in use
Commits on Jul 12, 2017
-
qht: return existing entry when qht_insert fails
The meaning of "existing" is now changed to "matches in hash and ht->cmp result". This is saner than just checking the pointer value. Note that we do not just return void * (returning the inserted pointer or the existing pointer) because it could be that the existing entry is the same pointer we wanted to insert. So we still need another return parameter (the bool) to know for sure that the insertion succeeded. Signed-off-by: Emilio G. Cota <cota@braap.org>
-
qht: require a default comparison function
qht_lookup now uses the default cmp function. qht_lookup_custom is defined to retain the old behaviour, that is a cmp function is explicitly provided. qht_insert will gain use the default cmp in the next patch. Signed-off-by: Emilio G. Cota <cota@braap.org>
-
gen-icount: fold exitreq_label into TCGContext
Before we make TCGContext thread-local. Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
tcg: take .helpers out of TCGContext
Before TCGContext is made thread-local. The hash table becomes read-only after it is filled in, so we can save space by keeping just a global pointer to it. Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
tcg: take tb_ctx out of TCGContext
Before TCGContext is made thread-local. Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
translate-all: report correct avg host TB size
Since commit 6e3b2bf ("tcg: allocate TB structs before the corresponding translated code") we are not fully utilizing code_gen_buffer for translated code, and therefore are incorrectly reporting the amount of translated code as well as the average host TB size. Address this by: - Making the conscious choice of misreporting the total translated code; doing otherwise would mislead users into thinking "-tb-size" is not honoured. - Expanding tb_tree_stats to accurately count the bytes of translated code on the host, and using this for reporting the average tb host size, as well as the expansion ratio. In the future we might want to consider reporting the accurate numbers for the total translated code, together with a "bookkeeping/overhead" field to account for the TB structs. Signed-off-by: Emilio G. Cota <cota@braap.org>
-
exec-all: rename tb_free to tb_remove
We don't really free anything in this function anymore; we just remove the TB from the binary search tree. Suggested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
translate-all: use a binary search tree to track TBs in TBContext
This is a prerequisite for having threads generate code on separate buffers, which will help scalability when booting multiple cores under MTTCG. For this we need a new field (.size) in struct tb_tc to keep track of the size of the translated code. This field adds a 4-byte hole to the struct (and therefore to TranslationBlock), but we can live with that. The comparison function we use is optimized for the common case: insertions. Profiling shows that upon booting debian-arm, 98% of comparisons are between existing tb's (i.e. a->size and b->size are both !0), which happens during insertions (and removals, but those are rare). The remaining cases are lookups. From reading the glib sources we see that the first key is always the lookup key. However, the code does not assume this to always be the case because this behaviour is not guaranteed in the glib docs. However, we embed this knowledge in the code as a branch hint for the compiler. Note that tb_free does not free space in the code_gen_buffer anymore, since we cannot easily know whether the tb is the last one inserted in code_gen_buffer. The next patch in this series renames tb_free to tb_remove to reflect this. Performance-wise, lookups in tb_find_pc are the same as before: O(log n). However, insertions are O(log n) instead of O(1), which results in a small slowdown when booting debian-arm: Performance counter stats for 'build/arm-softmmu/qemu-system-arm \ -machine type=virt -nographic -smp 1 -m 4096 \ -netdev user,id=unet,hostfwd=tcp::2222-:22 \ -device virtio-net-device,netdev=unet \ -drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \ -device virtio-blk-device,drive=myblock \ -kernel img/arm/aarch32-current-linux-kernel-only.img \ -append console=ttyAMA0 root=/dev/vda1 \ -name arm,debug-threads=on -smp 1' (10 runs): - Before: 8048.598422 task-clock (msec) # 0.931 CPUs utilized ( +- 0.28% ) 16,974 context-switches # 0.002 M/sec ( +- 0.12% ) 0 cpu-migrations # 0.000 K/sec 10,125 page-faults # 0.001 M/sec ( +- 1.23% ) 35,144,901,879 cycles # 4.367 GHz ( +- 0.14% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 65,758,252,643 instructions # 1.87 insns per cycle ( +- 0.33% ) 10,871,298,668 branches # 1350.707 M/sec ( +- 0.41% ) 192,322,212 branch-misses # 1.77% of all branches ( +- 0.32% ) 8.640869419 seconds time elapsed ( +- 0.57% ) - After: 8146.242027 task-clock (msec) # 0.923 CPUs utilized ( +- 1.23% ) 17,016 context-switches # 0.002 M/sec ( +- 0.40% ) 0 cpu-migrations # 0.000 K/sec 18,769 page-faults # 0.002 M/sec ( +- 0.45% ) 35,660,956,120 cycles # 4.378 GHz ( +- 1.22% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 65,095,366,607 instructions # 1.83 insns per cycle ( +- 1.73% ) 10,803,480,261 branches # 1326.192 M/sec ( +- 1.95% ) 195,601,289 branch-misses # 1.81% of all branches ( +- 0.39% ) 8.828660235 seconds time elapsed ( +- 0.38% ) Signed-off-by: Emilio G. Cota <cota@braap.org> -
exec-all: extract tb->tc_* into a separate struct tc_tb
In preparation for adding tc.size to be able to keep track of TB's using the binary search tree implementation from glib. Signed-off-by: Emilio G. Cota <cota@braap.org>
-
translate-all: define and use DEBUG_TB_CHECK_GATE
This prevents bit rot by ensuring the debug code is compiled when building a user-mode target. Unfortunately the helpers are user-mode-only so we cannot fully get rid of the ifdef checks. Add a comment to explain this. Suggested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
translate-all: define and use DEBUG_TB_INVALIDATE_GATE
This gets rid of an ifdef check while ensuring that the debug code is compiled, which prevents bit rot. Suggested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
translate-all: define and use DEBUG_TB_FLUSH_GATE
This gets rid of some ifdef checks while ensuring that the debug code is compiled, which prevents bit rot. Suggested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
tcg: bring parallel_cpus to tb->cflags and use it for TB hashing
This allows us to avoid flushing TB's when parallel_cpus is set. Note that the declaration of parallel_cpus is brought to exec-all.h to be able to define there the inlines. The inlines use an unnecessary temp variable that is there just to make it easier to add more bits to the mask in the future. Signed-off-by: Emilio G. Cota <cota@braap.org>
-
cpu-exec: make tb_htable_lookup static
No need to export it now that we have tb_lookup__cpu_state. Signed-off-by: Emilio G. Cota <cota@braap.org>
-
tcg: consolidate TB lookups in tb_lookup__cpu_state
This avoids duplicating code. cpu_exec_step will also use the new common function once we integrate parallel_cpus into tb->cflags. Signed-off-by: Emilio G. Cota <cota@braap.org>
-
tcg: remove addr argument from lookup_tb_ptr
It is unlikely that we will ever want to call this helper passing an argument other than the current PC. So just remove the argument, and use the pc we already get from cpu_get_tb_cpu_state. This change paves the way to having a common "tb_lookup" function. Signed-off-by: Emilio G. Cota <cota@braap.org>
-
exec-all: bring tb->invalid into tb->cflags
This gets rid of a hole in struct TranslationBlock. Signed-off-by: Emilio G. Cota <cota@braap.org>
-
translate-all: guarantee that tb_hash only holds valid TBs
This gets rid of the need to check the tb->invalid bit during lookups. After this change we do not need atomics to operate on tb->invalid: setting and checking its value is serialised with tb_lock. Signed-off-by: Emilio G. Cota <cota@braap.org>
-
tcg/mips: constify tcg_target_callee_save_regs
Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
tcg/i386: constify tcg_target_callee_save_regs
Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
cpu-exec: rename have_tb_lock to acquired_tb_lock in tb_find
Reusing the have_tb_lock name, which is also defined in translate-all.c, makes code reviewing unnecessarily harder. Avoid potential confusion by renaming the local have_tb_lock variable to something else. Signed-off-by: Emilio G. Cota <cota@braap.org>
-
translate-all: make have_tb_lock static
It is only used by this object, and it's not exported to any other. Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
exec-all: fix typos in TranslationBlock's documentation
Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
tcg: fix corruption of code_time profiling counter upon tb_flush
Whenever there is an overflow in code_gen_buffer (e.g. we run out of space in it and have to flush it), the code_time profiling counter ends up with an invalid value (that is, code_time -= profile_getclock(), without later on getting += profile_getclock() due to the goto). Fix it by using the ti variable, so that we only update code_time when there is no overflow. Note that in case there is an overflow we fail to account for the elapsed coding time, but this is quite rare so we can probably live with it. "info jit" before/after, roughly at the same time during debian-arm bootup: - before: Statistics: TB flush count 1 TB invalidate count 4665 TLB flush count 998 JIT cycles -615191529184601 (-256329.804 s at 2.4 GHz) translated TBs 302310 (aborted=0 0.0%) avg ops/TB 48.4 max=438 deleted ops/TB 8.54 avg temps/TB 32.31 max=38 avg host code/TB 361.5 avg search data/TB 24.5 cycles/op -42014693.0 cycles/in byte -121444900.2 cycles/out byte -5629031.1 cycles/search byte -83114481.0 gen_interm time -0.0% gen_code time 100.0% optim./code time -0.0% liveness/code time -0.0% cpu_restore count 6236 avg cycles 110.4 - after: Statistics: TB flush count 1 TB invalidate count 4665 TLB flush count 1010 JIT cycles 1996899624 (0.832 s at 2.4 GHz) translated TBs 297961 (aborted=0 0.0%) avg ops/TB 48.5 max=438 deleted ops/TB 8.56 avg temps/TB 32.31 max=38 avg host code/TB 361.8 avg search data/TB 24.5 cycles/op 138.2 cycles/in byte 398.4 cycles/out byte 18.5 cycles/search byte 273.1 gen_interm time 14.0% gen_code time 86.0% optim./code time 19.4% liveness/code time 10.3% cpu_restore count 6372 avg cycles 111.0 Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
cputlb: bring back tlb_flush_count under !TLB_DEBUG
Commit f0aff0f ("cputlb: add assert_cpu_is_self checks") buried the increment of tlb_flush_count under TLB_DEBUG. This results in "info jit" always (mis)reporting 0 TLB flushes when !TLB_DEBUG. Besides, under MTTCG tlb_flush_count is updated by several threads, so in order not to lose counts we'd either have to use atomic ops or distribute the counter, which is more scalable. This patch does the latter by embedding tlb_flush_count in CPUArchState. The global count is then easily obtained by iterating over the CPU list. Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
translate-all: remove redundant !tcg_enabled check in dump_exec_info
This check is redundant because it is already performed by the only caller of dump_exec_info -- the caller was updated by b7da97e ("monitor: Check whether TCG is enabled before running the "info jit" code"). Checking twice wouldn't necessarily be too bad, but here the check also returns with tb_lock held. So we can either do the check before tb_lock is acquired, or just get rid of it. Given that it is redundant, I am going for the latter option. Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
Commits on Jul 11, 2017
-
Commit e7b161d ("vl: add tcg_enabled() for tcg related code") adds a check to exit the program when !tcg_enabled() while parsing the -tb-size flag. It turns out that when the -tb-size flag is evaluated, tcg_enabled() can only return 0, since it is set (or not) much later by configure_accelerator(). Fix it by unconditionally exiting if the flag is passed to a QEMU binary built with !CONFIG_TCG. Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org>
-
scripts: add "git.orderfile" for ordering diff hunks by pathname patt…
…erns When passed to git-diff (and to every other git command producing diffs and/or diffstats) with "-O" or "diff.orderFile", this list of patterns will place the more declarative / abstract hunks first, while changes to imperative code / details will be near the end of the patches. This saves on scrolling / searching and makes for easier reviewing. We intend to advise contributors in the Wiki to run git config diff.orderFile scripts/git.orderfile once, as part of their initial setup, before formatting their first (or, for repeat contributors, next) patches. See the "-O" option and the "diff.orderFile" configuration variable in git-diff(1) and git-config(1). Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Eric Blake <eblake@redhat.com> Cc: Fam Zheng <famz@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: John Snow <jsnow@redhat.com> Cc: Max Reitz <mreitz@redhat.com> Cc: Stefan Hajnoczi <stefanha@gmail.com> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
-
trace: [trivial] Statically enable all guest events
The existing optimizations makes it feasible to have them available on all builds. Some quick'n'dirty numbers with 400.perlbench (SPECcpu2006) on the train input (medium size - suns.pl) and the guest_mem_before event: * vanilla, statically disabled real 0m2,259s user 0m2,252s sys 0m0,004s * vanilla, statically enabled (overhead: 2.18x) real 0m4,921s user 0m4,912s sys 0m0,008s * multi-tb, statically disabled (overhead: 0.99x) [within noise range] real 0m2,228s user 0m2,216s sys 0m0,008s * multi-tb, statically enabled (overhead: 0.99x) [within noise range] real 0m2,229s user 0m2,224s sys 0m0,004s Now enabling all events when booting an ARM system that immediately shuts down (https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg04085.html): * vanilla, statically disabled real 0m32,153s user 0m31,276s sys 0m0,108s * vanilla, statically enabled (overhead: 1.35x) real 0m43,507s user 0m42,680s sys 0m0,168s * multi-tb, statically disabled (overhead: 1.03x) real 0m32,993s user 0m32,516s sys 0m0,104s * multi-tb, statically enabled (overhead: 1.00x) [within noise range] real 0m32,110s user 0m31,176s sys 0m0,156s And finally enabling all events using Emilio's dbt-bench (where orig == vanilla, new == multi-tb): NBench score; higher is better 180 +-+--------+----------+----------+---------+----------+----------+----------+----------+----------+---------+----------+--------+-+ | | | *** $$$$%% orig | 160 +-+....................................*.*.$..$.%............................................................orig-enabled +-+ | * * $ $ % new | 140 +-+....................................*.*.$..$.%............................................................new-disabled.......+-+ | * * $ $ % | | * * $ $ % | 120 +-+....................................*.*.$..$.%...............................................................................+-+ | * * $ $ % | | * * $ $ % | 100 +-+....................................*.*.$..$.%.....$$$%%%....................................................................+-+ | * * $ $ % *** $ $ % *** $$$%% | 80 +-+....................................*.*.$..$.%.*.*.$.$..%.*.*.$.$.%..........................................................+-+ | * * $ $ % * * $ $ % * * $ $ % | | * * $ $ % * * $ $ % * * $ $ % | 60 +-+.........................***..$$$%%.*.*##..$.%.*.*.$.$..%.*.*.$.$.%..***.$$$%%...............................................+-+ | **** $$$%% * * $ $ % * * # $ % * *## $ % * * $ $ % * * $ $ % | | * * $ $ % * * $ $ % * * # $ % * * # $ % * *## $ % * * $ $ % | 40 +-+..............*..*.$.$.%.*.*..$.$.%.*.*.#..$.%.*.*.#.$..%.*.*.#.$.%..*.*.$.$.%...............................................+-+ | * * $ $ % * * $ $ % * * # $ % * * # $ % * * # $ % * *## $ % *** $$$%%% | 20 +-+....***.$$$%%.*..*##.$.%.*.*###.$.%.*.*.#..$.%.*.*.#.$..%.*.*.#.$.%..*.*.#.$.%..................................*.*.$.$..%...+-+ | * *## $ % * * # $ % * * # $ % * * # $ % * * # $ % * * # $ % * * # $ % * *## $ % | | * * # $ % * * # $ % * * # $ % * * # $ % * * # $ % * * # $ % * * # $ % ***###$$%% ***##$$$%% * * # $ % | 0 +-+----***##$$%%-****##$$%%-***###$$%%-***##$$$%%-***##$$%%%-***##$$%%--***##$$%%-****##$$%%-***###$$%%-***##$$$%%-***##$$%%%---+-+ NUMERIC SORTSTRING SORT BITFIEFP EMULATION ASSIGNMENT IDEA HUFFMAN FOURIER NEURLU DECOMPOSITION gmean png: http://imgur.com/a/8XG5S Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu> Reviewed-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Lluís Vilanova <address@hidden> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-id: 149915849243.6295.4484103824675839071.stgit@frigg.lan Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Lluís Vilanova authored and Stefan Hajnoczi committedJul 11, 2017 -
trace: [tcg, trivial] Re-align generated code
Last patch removed a nesting level in generated code. Re-align all code generated by backends to be 4-column aligned. Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu> Signed-off-by: Lluís Vilanova <address@hidden> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-id: 149915824586.6295.17820926011082409033.stgit@frigg.lan Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Lluís Vilanova authored and Stefan Hajnoczi committedJul 11, 2017 -
trace: [tcg] Do not generate TCG code to trace dynamically-disabled e…
…vents If an event is dynamically disabled, the TCG code that calls the execution-time tracer is not generated. Removes the overheads of execution-time tracers for dynamically disabled events. As a bonus, also avoids checking the event state when the execution-time tracer is called from TCG-generated code (since otherwise TCG would simply not call it). Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu> Signed-off-by: Lluís Vilanova <address@hidden> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-id: 149915799921.6295.13067154430923434035.stgit@frigg.lan Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Lluís Vilanova authored and Stefan Hajnoczi committedJul 11, 2017 -
exec: [tcg] Use different TBs according to the vCPU's dynamic tracing…
… state Every vCPU now uses a separate set of TBs for each set of dynamic tracing event state values. Each set of TBs can be used by any number of vCPUs to maximize TB reuse when vCPUs have the same tracing state. This feature is later used by tracetool to optimize tracing of guest code events. The maximum number of TB sets is defined as 2^E, where E is the number of events that have the 'vcpu' property (their state is stored in CPUState->trace_dstate). For this to work, a change on the dynamic tracing state of a vCPU will force it to flush its virtual TB cache (which is only indexed by address), and fall back to the physical TB cache (which now contains the vCPU's dynamic tracing state as part of the hashing function). Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu> Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Lluís Vilanova <address@hidden> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-id: 149915775266.6295.10060144081246467690.stgit@frigg.lan Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Lluís Vilanova authored and Stefan Hajnoczi committedJul 11, 2017 -
trace: [tcg] Delay changes to dynamic state when translating
This keeps consistency across all decisions taken during translation when the dynamic state of a vCPU is changed in the middle of translating some guest code. Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu> Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Lluís Vilanova <address@hidden> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-id: 149915750615.6295.3713699402253529487.stgit@frigg.lan Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Lluís Vilanova authored and Stefan Hajnoczi committedJul 11, 2017 -
trace: Allocate cpu->trace_dstate in place
There's little point in dynamically allocating the bitmap if we know at compile-time the max number of events we want to support. Thus, make room in the struct for the bitmap, which will make things easier later: this paves the way for upcoming changes, in which we'll use a u32 to fully capture cpu->trace_dstate. This change also increases performance by saving a dereference and improving locality--note that this is important since upcoming work makes reading this bitmap fairly common. Signed-off-by: Emilio G. Cota <cota@braap.org> Reviewed-by: Lluís Vilanova <vilanova@ac.upc.edu> Signed-off-by: Lluís Vilanova <address@hidden> Message-id: 149915725977.6295.15069969323605305641.stgit@frigg.lan Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Lluís Vilanova authored and Stefan Hajnoczi committedJul 11, 2017 -
backends: remove empty trace-events file
The content of the backends/trace-events file was entirely removed in commit 6b10e57 Author: Marc-André Lureau <marcandre.lureau@redhat.com> Date: Mon May 29 12:39:42 2017 +0400 char: move char devices to chardev/ Leaving the empty file around, causes tracetool to generate an empty .dtrace file which makes the dtrace compiler throw a syntax error. Signed-off-by: Daniel P. Berrange <berrange@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Message-id: 20170629162046.4135-1-berrange@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>