Skip to content

Commits

Permalink
multi-tcg
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Commits on Jul 9, 2017

  1. translate-all: do not hold tb_lock during code generation in softmmu

    Each vCPU can now generate code with TCG in parallel. Thus,
    drop tb_lock around code generation in softmmu.
    
    Note that we still have to take tb_lock after code translation,
    since there is global state that we have to update.
    
    Nonetheless holding tb_lock for less time provides significant performance
    improvements to workloads that are translation-heavy. A good example
    of this is booting Linux; in my measurements, bootup+shutdown time of
    debian-arm is reduced by 20% before/after this entire patchset, when
    using -smp 8 and MTTCG on a machine with >= 8 real cores:
    
     Host: Intel(R) Xeon(R) CPU E5-2690 @ 2.90GHz
     Performance counter stats for 'qemu/build/arm-softmmu/qemu-system-arm \
    	-machine type=virt -nographic -smp 1 -m 4096 \
    	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
    	-device virtio-net-device,netdev=unet \
    	-drive file=foobar.qcow2,id=myblock,index=0,if=none \
    	-device virtio-blk-device,drive=myblock \
    	-kernel /foobar.img -append console=ttyAMA0 root=/dev/vda1 \
    	-name arm,debug-threads=on -smp 8' (3 runs):
    
    Before:
          28764.018852 task-clock                #    1.663 CPUs utilized            ( +-  0.30% )
               727,490 context-switches          #    0.025 M/sec                    ( +-  0.68% )
                 2,429 CPU-migrations            #    0.000 M/sec                    ( +- 11.36% )
                14,042 page-faults               #    0.000 M/sec                    ( +-  1.00% )
        70,644,349,920 cycles                    #    2.456 GHz                      ( +-  0.96% ) [83.42%]
        37,129,806,098 stalled-cycles-frontend   #   52.56% frontend cycles idle     ( +-  1.27% ) [83.20%]
        26,620,190,524 stalled-cycles-backend    #   37.68% backend  cycles idle     ( +-  1.29% ) [66.50%]
        85,528,287,892 instructions              #    1.21  insns per cycle
                                                 #    0.43  stalled cycles per insn  ( +-  0.62% ) [83.40%]
        14,417,482,689 branches                  #  501.233 M/sec                    ( +-  0.49% ) [83.36%]
           321,182,192 branch-misses             #    2.23% of all branches          ( +-  1.17% ) [83.53%]
    
          17.297750583 seconds time elapsed                                          ( +-  1.08% )
    
    After:
          28690.888633 task-clock                #    2.069 CPUs utilized            ( +-  1.54% )
               473,947 context-switches          #    0.017 M/sec                    ( +-  1.32% )
                 2,793 CPU-migrations            #    0.000 M/sec                    ( +- 18.74% )
                22,634 page-faults               #    0.001 M/sec                    ( +-  1.20% )
        69,314,663,510 cycles                    #    2.416 GHz                      ( +-  1.08% ) [83.50%]
        36,114,710,208 stalled-cycles-frontend   #   52.10% frontend cycles idle     ( +-  1.64% ) [83.26%]
        25,519,842,658 stalled-cycles-backend    #   36.82% backend  cycles idle     ( +-  1.70% ) [66.77%]
        84,588,443,638 instructions              #    1.22  insns per cycle
                                                 #    0.43  stalled cycles per insn  ( +-  0.78% ) [83.44%]
        14,258,100,183 branches                  #  496.956 M/sec                    ( +-  0.87% ) [83.32%]
           324,984,804 branch-misses             #    2.28% of all branches          ( +-  0.51% ) [83.17%]
    
          13.870347754 seconds time elapsed                                          ( +-  1.65% )
    
    That is, a speedup of 17.29/13.87=1.24X.
    
    Similar numbers on a slower machine:
    
    Host: AMD Opteron(tm) Processor 6376:
    
    Before:
          74765.850569      task-clock (msec)         #    1.956 CPUs utilized            ( +-  1.42% )
               841,430      context-switches          #    0.011 M/sec                    ( +-  2.50% )
                18,228      cpu-migrations            #    0.244 K/sec                    ( +-  2.87% )
                26,565      page-faults               #    0.355 K/sec                    ( +-  9.19% )
        98,775,815,944      cycles                    #    1.321 GHz                      ( +-  1.40% )  (83.44%)
        26,325,365,757      stalled-cycles-frontend   #   26.65% frontend cycles idle     ( +-  1.96% )  (83.26%)
        17,270,620,447      stalled-cycles-backend    #   17.48% backend  cycles idle     ( +-  3.45% )  (33.32%)
        82,998,905,540      instructions              #    0.84  insns per cycle
                                                      #    0.32  stalled cycles per insn  ( +-  0.71% )  (50.06%)
        14,209,593,402      branches                  #  190.055 M/sec                    ( +-  1.01% )  (66.74%)
           571,258,648      branch-misses             #    4.02% of all branches          ( +-  0.20% )  (83.40%)
    
          38.220740889 seconds time elapsed                                          ( +-  0.72% )
    
    After:
          73281.226761      task-clock (msec)         #    2.415 CPUs utilized            ( +-  0.29% )
               571,984      context-switches          #    0.008 M/sec                    ( +-  1.11% )
                14,301      cpu-migrations            #    0.195 K/sec                    ( +-  2.90% )
                42,635      page-faults               #    0.582 K/sec                    ( +-  7.76% )
        98,478,185,775      cycles                    #    1.344 GHz                      ( +-  0.32% )  (83.39%)
        25,555,945,935      stalled-cycles-frontend   #   25.95% frontend cycles idle     ( +-  0.47% )  (83.37%)
        15,174,223,390      stalled-cycles-backend    #   15.41% backend  cycles idle     ( +-  0.83% )  (33.26%)
        81,939,511,983      instructions              #    0.83  insns per cycle
                                                      #    0.31  stalled cycles per insn  ( +-  0.12% )  (49.95%)
        13,992,075,918      branches                  #  190.937 M/sec                    ( +-  0.16% )  (66.65%)
           580,790,655      branch-misses             #    4.15% of all branches          ( +-  0.20% )  (83.26%)
    
          30.340574988 seconds time elapsed                                          ( +-  0.39% )
    
    That is, a speedup of 1.25X.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 9, 2017
    Copy the full SHA
    50aee48 View commit details
    Browse the repository at this point in the history
  2. tcg: enable per-thread TCG for softmmu

    This allows us to generate TCG code in parallel. MTTCG already uses
    it, although the next commit pushes down a lock to actually
    perform parallel generation.
    
    User-mode is kept out of this: contention due to concurrent translation
    is more commonly found in full-system mode.
    
    This patch is fairly small due to the preparation work done in previous
    patches.
    
    Note that targets do not need any conversion: the TCGContext set up
    during initialization (i.e. where globals are set) is then cloned
    by the vCPU threads, which also double as TCG threads.
    
    I searched for globals under tcg/ that might have to be converted
    to thread-local. I converted the ones that I saw, and I wrote down the
    ones that I found are non-const globals that are only set at init-time:
    
    Only written by tcg_context_init:
    - indirect_reg_alloc_order
    - tcg_op_defs
    Only written by tcg_target_init (called from tcg_context_init):
    - tcg_target_available_regs
    - tcg_target_call_clobber_regs
    - arm: arm_arch, use_idiv_instructions
    - i386: have_cmov, have_bmi1, have_bmi2, have_lzcnt,
            have_movbe, have_popcnt
    - mips: use_movnz_instructions, use_mips32_instructions,
            use_mips32r2_instructions, got_sigill (tcg_target_detect_isa)
    - ppc: have_isa_2_06, have_isa_3_00, tb_ret_addr
    - s390: tb_ret_addr, s390_facilities
    - sparc: qemu_ld_trampoline, qemu_st_trampoline (build_trampolines),
             use_vis3_instructions
    
    Only written by tcg_prologue_init:
    - 'struct jit_code_entry one_entry'
    - aarch64: tb_ret_addr
    - arm: tb_ret_addr
    - i386: tb_ret_addr, guest_base_flags
    - ia64: tb_ret_addr
    - mips: tb_ret_addr, bswap32_addr, bswap32u_addr, bswap64_addr
    
    I was not sure about tci_regs. From code inspection it seems that
    they have to be per-thread, so I converted them, but I do not think
    anyone has ever tried to get MTTCG working with TCI.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 9, 2017
    Copy the full SHA
    eb44461 View commit details
    Browse the repository at this point in the history
  3. tcg: dynamically allocate from code_gen_buffer using equally-sized re…

    …gions
    
    In preparation for having multiple TCG threads.
    
    The naive solution here is to split code_gen_buffer statically
    among the TCG threads; this however results in poor utilization
    if translation needs are different across TCG threads.
    
    What we do here is to add an extra layer of indirection, assigning
    regions that act just like pages do in virtual memory allocation.
    (BTW if you are wondering about the chosen naming, I did not want
    to use blocks or pages because those are already heavily used in QEMU).
    
    The effectiveness of this approach is clear after seeing some numbers.
    I used the bootup+shutdown of debian-arm with '-tb-size 80' as a benchmark.
    Note that I'm evaluating this after enabling per-thread TCG (which
    is done by a subsequent commit).
    
    * -smp 1, 1 region (entire buffer):
        qemu: flush code_size=83885014 nb_tbs=154739 avg_tb_size=357
        qemu: flush code_size=83884902 nb_tbs=153136 avg_tb_size=363
        qemu: flush code_size=83885014 nb_tbs=152777 avg_tb_size=364
        qemu: flush code_size=83884950 nb_tbs=150057 avg_tb_size=373
        qemu: flush code_size=83884998 nb_tbs=150234 avg_tb_size=373
        qemu: flush code_size=83885014 nb_tbs=154009 avg_tb_size=360
        qemu: flush code_size=83885014 nb_tbs=151007 avg_tb_size=370
        qemu: flush code_size=83885014 nb_tbs=151816 avg_tb_size=367
    
    That is, 8 flushes.
    
    * -smp 8, 32 regions (80/32 MB per region) [i.e. this patch]:
    
        qemu: flush code_size=76328008 nb_tbs=141040 avg_tb_size=356
        qemu: flush code_size=75366534 nb_tbs=138000 avg_tb_size=361
        qemu: flush code_size=76864546 nb_tbs=140653 avg_tb_size=361
        qemu: flush code_size=76309084 nb_tbs=135945 avg_tb_size=375
        qemu: flush code_size=74581856 nb_tbs=132909 avg_tb_size=375
        qemu: flush code_size=73927256 nb_tbs=135616 avg_tb_size=360
        qemu: flush code_size=78629426 nb_tbs=142896 avg_tb_size=365
        qemu: flush code_size=76667052 nb_tbs=138508 avg_tb_size=368
    
    Again, 8 flushes. Note how buffer utilization is not 100%, but it
    is close. Smaller region sizes would yield higher utilization,
    but we want region allocation to be rare (it acquires a lock), so
    we do not want to go too small.
    
    * -smp 8, static partitioning of 8 regions (10 MB per region):
        qemu: flush code_size=21936504 nb_tbs=40570 avg_tb_size=354
        qemu: flush code_size=11472174 nb_tbs=20633 avg_tb_size=370
        qemu: flush code_size=11603976 nb_tbs=21059 avg_tb_size=365
        qemu: flush code_size=23254872 nb_tbs=41243 avg_tb_size=377
        qemu: flush code_size=28289496 nb_tbs=52057 avg_tb_size=358
        qemu: flush code_size=43605160 nb_tbs=78896 avg_tb_size=367
        qemu: flush code_size=45166552 nb_tbs=82158 avg_tb_size=364
        qemu: flush code_size=63289640 nb_tbs=116494 avg_tb_size=358
        qemu: flush code_size=51389960 nb_tbs=93937 avg_tb_size=362
        qemu: flush code_size=59665928 nb_tbs=107063 avg_tb_size=372
        qemu: flush code_size=38380824 nb_tbs=68597 avg_tb_size=374
        qemu: flush code_size=44884568 nb_tbs=79901 avg_tb_size=376
        qemu: flush code_size=50782632 nb_tbs=90681 avg_tb_size=374
        qemu: flush code_size=39848888 nb_tbs=71433 avg_tb_size=372
        qemu: flush code_size=64708840 nb_tbs=119052 avg_tb_size=359
        qemu: flush code_size=49830008 nb_tbs=90992 avg_tb_size=362
        qemu: flush code_size=68372408 nb_tbs=123442 avg_tb_size=368
        qemu: flush code_size=33555560 nb_tbs=59514 avg_tb_size=378
        qemu: flush code_size=44748344 nb_tbs=80974 avg_tb_size=367
        qemu: flush code_size=37104248 nb_tbs=67609 avg_tb_size=364
    
    That is, 20 flushes. Note how a static partitioning approach uses
    the code buffer poorly, leading to many unnecessary flushes.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 9, 2017
    Copy the full SHA
    f8c893e View commit details
    Browse the repository at this point in the history
  4. tcg: introduce tcg_context_clone

    Before we make TCGContext thread-local.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 9, 2017
    Copy the full SHA
    bab3ed9 View commit details
    Browse the repository at this point in the history
  5. tcg: define TCG_HIGHWATER

    Will come in handy very soon.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 9, 2017
    Copy the full SHA
    5490a2b View commit details
    Browse the repository at this point in the history
  6. tcg: distribute profiling counters across TCGContext's

    TCGContext is about to be made thread-local. To avoid scalability issues
    when profiling info is enabled, this patch makes the profiling info counters
    distributed via the following changes:
    
    1) Consolidate profile info into its own struct, TCGProfile, which
       TCGContext also includes. Note that tcg_table_op_count is brought
       into TCGProfile after dropping the tcg_ prefix.
    2) Iterate over the TCG contexts in the system to obtain the total counts.
    
    Note that this change also requires updating the accessors to TCGProfile
    fields to use atomic_read/set whenever there may be concurrent accesses
    to them.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 9, 2017
    Copy the full SHA
    7dacefa View commit details
    Browse the repository at this point in the history
  7. tcg: keep a list of TCGContext's

    Before we make TCGContext thread-local. Once that is done, iterating
    over all TCG contexts will be quite useful; for instance we
    will need it to gather profiling info from each TCGContext.
    
    A possible alternative would be to keep an array of TCGContext pointers.
    However this option however is not that trivial, because vCPUs are spawned in
    parallel. So let's just keep it simple and use a list protected by a lock.
    
    Note that this lock will soon be used for other purposes, hence the
    generic "tcg_lock" name.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 9, 2017
    Copy the full SHA
    c309f6c View commit details
    Browse the repository at this point in the history

Commits on Jul 8, 2017

  1. gen-icount: fold exitreq_label into TCGContext

    Before we make TCGContext thread-local.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    ae0224a View commit details
    Browse the repository at this point in the history
  2. tcg: take .helpers out of TCGContext

    Before TCGContext is made thread-local.
    
    The hash table becomes read-only after it is filled in,
    so we can save space by keeping just a global pointer to it.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    18ae22c View commit details
    Browse the repository at this point in the history
  3. tcg: take tb_ctx out of TCGContext

    Before TCGContext is made thread-local.
    
    Reviewed-by: Richard Henderson <rth@twiddle.net>
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    0ebec1e View commit details
    Browse the repository at this point in the history
  4. translate-all: report correct avg host TB size

    Since commit 6e3b2bf ("tcg: allocate TB structs before the
    corresponding translated code") we are not fully utilizing
    code_gen_buffer for translated code, and therefore are
    incorrectly reporting the amount of translated code as well as
    the average host TB size. Address this by:
    
    - Making the conscious choice of misreporting the total translated code;
      doing otherwise would mislead users into thinking "-tb-size" is not
      honoured.
    
    - Expanding tb_tree_stats to accurately count the bytes of translated code on
      the host, and using this for reporting the average tb host size,
      as well as the expansion ratio.
    
    In the future we might want to consider reporting the accurate numbers for
    the total translated code, together with a "bookkeeping/overhead" field to
    account for the TB structs.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    678f14e View commit details
    Browse the repository at this point in the history
  5. translate-all: use a binary search tree to track TBs in TBContext

    This is a prerequisite for having threads generate code on separate
    buffers, which will help scalability when booting multiple cores
    under MTTCG.
    
    For this we need a new field (.tc_size) in TranslationBlock to keep
    track of the size of the translated code. This field is added into
    a 4-byte hole that the previous commit created.
    
    In order to use glib's binary search tree we embed a helper struct
    in TranslationBlock to allow us to compare tb's based on their
    tc_ptr as well as their tc_size fields. We use an anonymous struct
    in TranslationBlock to minimize churn; the alternatives I can
    see are to (a) just add a comment and cross our fingers, (b) use
    -fms-extensions, and (c) embed the struct and update all calling
    code. I think using an anonymous struct is superior, but I can be
    persuaded otherwise.
    
    The comparison function we use is optimized for the common case:
    insertions. Profiling shows that upon booting debian-arm, 98%
    of comparisons are between existing tb's (i.e. a->size and b->size
    are both !0), which happens during insertions (and removals, but
    those are rare). The remaining cases are lookups. From reading the glib
    sources we see that the first key is always the lookup key. However,
    the code does not assume this to always be the case because this
    behaviour is not guaranteed in the glib docs. However, we embed
    this knowledge in the code as a branch hint for the compiler.
    
    Note that tb_free does not free space in the code_gen_buffer anymore,
    since we cannot easily know whether the tb is the last one inserted
    in code_gen_buffer.
    
    Performance-wise, lookups in tb_find_pc are the same as before:
    O(log n). However, insertions are O(log n) instead of O(1), which
    results in a small slowdown when booting debian-arm:
    
    Performance counter stats for 'build/arm-softmmu/qemu-system-arm \
    	-machine type=virt -nographic -smp 1 -m 4096 \
    	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
    	-device virtio-net-device,netdev=unet \
    	-drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \
    	-device virtio-blk-device,drive=myblock \
    	-kernel img/arm/aarch32-current-linux-kernel-only.img \
    	-append console=ttyAMA0 root=/dev/vda1 \
    	-name arm,debug-threads=on -smp 1' (10 runs):
    
    - Before:
    
           8048.598422      task-clock (msec)         #    0.931 CPUs utilized            ( +-  0.28% )
                16,974      context-switches          #    0.002 M/sec                    ( +-  0.12% )
                     0      cpu-migrations            #    0.000 K/sec
                10,125      page-faults               #    0.001 M/sec                    ( +-  1.23% )
        35,144,901,879      cycles                    #    4.367 GHz                      ( +-  0.14% )
       <not supported>      stalled-cycles-frontend
       <not supported>      stalled-cycles-backend
        65,758,252,643      instructions              #    1.87  insns per cycle          ( +-  0.33% )
        10,871,298,668      branches                  # 1350.707 M/sec                    ( +-  0.41% )
           192,322,212      branch-misses             #    1.77% of all branches          ( +-  0.32% )
    
           8.640869419 seconds time elapsed                                          ( +-  0.57% )
    
    - After:
           8146.242027      task-clock (msec)         #    0.923 CPUs utilized            ( +-  1.23% )
                17,016      context-switches          #    0.002 M/sec                    ( +-  0.40% )
                     0      cpu-migrations            #    0.000 K/sec
                18,769      page-faults               #    0.002 M/sec                    ( +-  0.45% )
        35,660,956,120      cycles                    #    4.378 GHz                      ( +-  1.22% )
       <not supported>      stalled-cycles-frontend
       <not supported>      stalled-cycles-backend
        65,095,366,607      instructions              #    1.83  insns per cycle          ( +-  1.73% )
        10,803,480,261      branches                  # 1326.192 M/sec                    ( +-  1.95% )
           195,601,289      branch-misses             #    1.81% of all branches          ( +-  0.39% )
    
           8.828660235 seconds time elapsed                                          ( +-  0.38% )
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    a48acf6 View commit details
    Browse the repository at this point in the history
  6. exec-all: move tb->invalid to the end of the struct

    This opens up a 4-byte hole to be used by upcoming work.
    
    Note that moving this field to the 2nd cache line of the struct
    does not affect performance: tb->page_addr is in the 2nd cache
    line as well, and both are accessed during code lookup. Besides,
    the tb->invalid check is easily predicted.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    fdc2860 View commit details
    Browse the repository at this point in the history
  7. exec-all: shrink tb->invalid to uint8_t

    To avoid wasting a byte. I don't have any use in mind for this byte,
    but I think it's good to leave this byte explicitly free for future use.
    See this discussion for how the u16 came to be:
      https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04564.html
    We could use a bool but in some systems that would take > 1 byte.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    1be7f7d View commit details
    Browse the repository at this point in the history
  8. tcg/mips: constify tcg_target_callee_save_regs

    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    5883d55 View commit details
    Browse the repository at this point in the history
  9. tcg/i386: constify tcg_target_callee_save_regs

    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    eb628b5 View commit details
    Browse the repository at this point in the history
  10. translate-all: make have_tb_lock static

    It is only used by this object, and it's not exported to any other.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    bdbc258 View commit details
    Browse the repository at this point in the history
  11. exec-all: fix typos in TranslationBlock's documentation

    Reviewed-by: Richard Henderson <rth@twiddle.net>
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    afa773a View commit details
    Browse the repository at this point in the history
  12. tcg: fix corruption of code_time profiling counter upon tb_flush

    Whenever there is an overflow in code_gen_buffer (e.g. we run out
    of space in it and have to flush it), the code_time profiling counter
    ends up with an invalid value (that is, code_time -= profile_getclock(),
    without later on getting += profile_getclock() due to the goto).
    
    Fix it by using the ti variable, so that we only update code_time
    when there is no overflow. Note that in case there is an overflow
    we fail to account for the elapsed coding time, but this is quite rare
    so we can probably live with it.
    
    "info jit" before/after, roughly at the same time during debian-arm bootup:
    
    - before:
    Statistics:
    TB flush count      1
    TB invalidate count 4665
    TLB flush count     998
    JIT cycles          -615191529184601 (-256329.804 s at 2.4 GHz)
    translated TBs      302310 (aborted=0 0.0%)
    avg ops/TB          48.4 max=438
    deleted ops/TB      8.54
    avg temps/TB        32.31 max=38
    avg host code/TB    361.5
    avg search data/TB  24.5
    cycles/op           -42014693.0
    cycles/in byte      -121444900.2
    cycles/out byte     -5629031.1
    cycles/search byte     -83114481.0
      gen_interm time   -0.0%
      gen_code time     100.0%
    optim./code time    -0.0%
    liveness/code time  -0.0%
    cpu_restore count   6236
      avg cycles        110.4
    
    - after:
    Statistics:
    TB flush count      1
    TB invalidate count 4665
    TLB flush count     1010
    JIT cycles          1996899624 (0.832 s at 2.4 GHz)
    translated TBs      297961 (aborted=0 0.0%)
    avg ops/TB          48.5 max=438
    deleted ops/TB      8.56
    avg temps/TB        32.31 max=38
    avg host code/TB    361.8
    avg search data/TB  24.5
    cycles/op           138.2
    cycles/in byte      398.4
    cycles/out byte     18.5
    cycles/search byte     273.1
      gen_interm time   14.0%
      gen_code time     86.0%
    optim./code time    19.4%
    liveness/code time  10.3%
    cpu_restore count   6372
      avg cycles        111.0
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    e3243de View commit details
    Browse the repository at this point in the history
  13. cputlb: bring back tlb_flush_count under !TLB_DEBUG

    Commit f0aff0f ("cputlb: add assert_cpu_is_self checks") buried
    the increment of tlb_flush_count under TLB_DEBUG. This results in
    "info jit" always (mis)reporting 0 TLB flushes when !TLB_DEBUG.
    
    Besides, under MTTCG tlb_flush_count is updated by several threads,
    so in order not to lose counts we'd either have to use atomic ops
    or distribute the counter, which is more scalable.
    
    This patch does the latter by embedding tlb_flush_count in CPUArchState.
    The global count is then easily obtained by iterating over the CPU list.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    a6e811e View commit details
    Browse the repository at this point in the history
  14. translate-all: remove redundant !tcg_enabled check in dump_exec_info

    This check is redundant because it is already performed by the only
    caller of dump_exec_info -- the caller was updated by b7da97e
    ("monitor: Check whether TCG is enabled before running the "info jit"
    code").
    
    Checking twice wouldn't necessarily be too bad, but here the check also
    returns with tb_lock held. So we can either do the check before tb_lock is
    acquired, or just get rid of it. Given that it is redundant, I am going
    for the latter option.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    8b8ff64 View commit details
    Browse the repository at this point in the history
  15. vl: fix breakage of -tb-size

    Commit e7b161d ("vl: add tcg_enabled() for tcg related code") adds
    a check to exit the program when !tcg_enabled() while parsing the -tb-size
    flag.
    
    It turns out that when the -tb-size flag is evaluated, tcg_enabled() can
    only return 0, since it is set (or not) much later by configure_accelerator().
    
    Fix it by unconditionally exiting if the flag is passed to a QEMU binary
    built with !CONFIG_TCG.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>
    cota committed Jul 8, 2017
    Copy the full SHA
    c3f882e View commit details
    Browse the repository at this point in the history
  16. scripts: add "git.orderfile" for ordering diff hunks by pathname patt…

    …erns
    
    When passed to git-diff (and to every other git command producing diffs
    and/or diffstats) with "-O" or "diff.orderFile", this list of patterns
    will place the more declarative / abstract hunks first, while changes to
    imperative code / details will be near the end of the patches. This saves
    on scrolling / searching and makes for easier reviewing.
    
    We intend to advise contributors in the Wiki to run
    
      git config diff.orderFile scripts/git.orderfile
    
    once, as part of their initial setup, before formatting their first (or,
    for repeat contributors, next) patches.
    
    See the "-O" option and the "diff.orderFile" configuration variable in
    git-diff(1) and git-config(1).
    
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Eric Blake <eblake@redhat.com>
    Cc: Fam Zheng <famz@redhat.com>
    Cc: Gerd Hoffmann <kraxel@redhat.com>
    Cc: John Snow <jsnow@redhat.com>
    Cc: Max Reitz <mreitz@redhat.com>
    Cc: Stefan Hajnoczi <stefanha@gmail.com>
    Signed-off-by: Laszlo Ersek <lersek@redhat.com>
    lersek authored and cota committed Jul 8, 2017
    Copy the full SHA
    103589c View commit details
    Browse the repository at this point in the history

Commits on Jul 6, 2017

  1. Merge remote-tracking branch 'remotes/borntraeger/tags/s390x-20170706…

    …' into staging
    
    s390x/kvm/migration: fixes, enhancements and cleanups
    
    - new email address for Cornelia
    - Fixes: 3270, flic, virtio-scsi-ccw, ipl
    - Enhancements, cpumodel, migration
    
    # gpg: Signature made Thu 06 Jul 2017 08:18:19 BST
    # gpg:                using RSA key 0x117BBC80B5A61C7C
    # gpg: Good signature from "Christian Borntraeger (IBM) <borntraeger@de.ibm.com>"
    # Primary key fingerprint: F922 9381 A334 08F9 DBAB  FBCA 117B BC80 B5A6 1C7C
    
    * remotes/borntraeger/tags/s390x-20170706:
      hw/s390x/ipl: Fix endianness problem with netboot_start_addr
      virtio-scsi-ccw: use ioeventfd even when KVM is disabled
      s390x: return unavailable features via query-cpu-definitions
      s390x/MAINTAINERS: Update my email address
      s390x: fix realize inheritance for kvm-flic
      s390x: fix error propagation in kvm-flic's realize
      s390x/3270: fix instruction interception handler
      s390x: vmstatify config migration for virtio-ccw
    
    Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
    pm215 committed Jul 6, 2017
    Copy the full SHA
    b113658 View commit details
    Browse the repository at this point in the history
  2. Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into…

    … staging
    
    * qemu-thread portability improvement (Fam)
    * virtio-scsi IOMMU fix (Jason)
    * poisoning and common-obj-y cleanups (Thomas)
    * initial Hypervisor.framework refactoring (Sergio)
    * x86 TCG interrupt injection fixes (Wu Xiang, me)
    * --disable-tcg support for x86 (Yang Zhong, me)
    * various other bugfixes and cleanups (Daniel, Peter, Thomas)
    
    # gpg: Signature made Wed 05 Jul 2017 08:12:56 BST
    # gpg:                using RSA key 0xBFFBD25F78C7AE83
    # gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>"
    # gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>"
    # Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4  E2F7 7E15 100C CD36 69B1
    #      Subkey fingerprint: F133 3857 4B66 2389 866C  7682 BFFB D25F 78C7 AE83
    
    * remotes/bonzini/tags/for-upstream: (42 commits)
      target/i386: add the CONFIG_TCG into Makefiles
      target/i386: add the tcg_enabled() in target/i386/
      target/i386: move TLB refill function out of helper.c
      target/i386: split cpu_set_mxcsr() and make cpu_set_fpuc() inline
      target/i386: make cpu_get_fp80()/cpu_set_fp80() static
      target/i386: move cpu_sync_bndcs_hflags() function
      tcg: add the CONFIG_TCG into Makefiles
      tcg: add CONFIG_TCG guards in headers
      exec: elide calls to tb_lock and tb_unlock
      tcg: move tb_lock out of translate-all.h
      tcg: add the tcg-stub.c file into accel/stubs/
      vapic: use tcg_enabled
      monitor: disable "info jit" and "info opcount" if !TCG
      tcg: make tcg_allowed global
      cpu: move interrupt handling out of translate-common.c
      tcg: move page_size_init() function
      vl: add tcg_enabled() for tcg related code
      vl: convert -tb-size to qemu_strtoul
      configure: add --disable-tcg configure option
      configure: early test for supported targets
      ...
    
    Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
    pm215 committed Jul 6, 2017
    Copy the full SHA
    67b9c5d View commit details
    Browse the repository at this point in the history

Commits on Jul 5, 2017

  1. hw/s390x/ipl: Fix endianness problem with netboot_start_addr

    The start address has to be stored in big endian byte order
    in the iplb.ccw block for the guest.
    
    Signed-off-by: Thomas Huth <thuth@redhat.com>
    Message-Id: <1499268345-12552-1-git-send-email-thuth@redhat.com>
    Reviewed-by: Cornelia Huck <cohuck@redhat.com>
    Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
    huth authored and borntraeger committed Jul 5, 2017
    Copy the full SHA
    1045e3c View commit details
    Browse the repository at this point in the history
  2. virtio-scsi-ccw: use ioeventfd even when KVM is disabled

    This patch is based on a similar patch from Stefan Hajnoczi -
    commit c324fd0 ("virtio-pci: use ioeventfd even when KVM is disabled")
    
    Do not check kvm_eventfds_enabled() when KVM is disabled since it
    always returns 0.  Since commit 8c56c1a
    ("memory: emulate ioeventfd") it has been possible to use ioeventfds in
    qtest or TCG mode.
    
    This patch makes -device virtio-scsi-ccw,iothread=iothread0 work even
    when KVM is disabled.
    Currently we don't have an equivalent to "memory: emulate ioeventfd"
    for ccw yet, but that this doesn't hurt and qemu-iotests 068 can pass with
    skipping iothread arguments.
    
    I have tested that virtio-scsi-ccw works under tcg both with and without
    iothread.
    
    This patch fixes qemu-iotests 068, which was accidentally merged early
    despite the dependency on ioeventfd.
    
    Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
    Reviewed-by: Cornelia Huck <cohuck@redhat.com>
    Message-Id: <20170704132350.11874-2-haoqf@linux.vnet.ibm.com>
    Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
    Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
    SeanHQF authored and borntraeger committed Jul 5, 2017
    Copy the full SHA
    cda3c19 View commit details
    Browse the repository at this point in the history
  3. s390x: return unavailable features via query-cpu-definitions

    The response for query-cpu-definitions didn't include the
    unavailable-features field, which is used by libvirt to figure
    out whether a certain cpu model is usable on the host.
    
    The unavailable features are now computed by obtaining the host CPU
    model and comparing it against the known CPU models. The comparison
    takes into account the generation, the GA level and the feature
    bitmaps. In the case of a CPU generation/GA level mismatch
    a feature called "type" is reported to be missing.
    
    As a result, the output of virsh domcapabilities would change
    from something like
     ...
         <mode name='custom' supported='yes'>
          <model usable='unknown'>z10EC-base</model>
          <model usable='unknown'>z9EC-base</model>
          <model usable='unknown'>z196.2-base</model>
          <model usable='unknown'>z900-base</model>
          <model usable='unknown'>z990</model>
     ...
    to
     ...
         <mode name='custom' supported='yes'>
          <model usable='yes'>z10EC-base</model>
          <model usable='yes'>z9EC-base</model>
          <model usable='no'>z196.2-base</model>
          <model usable='yes'>z900-base</model>
          <model usable='yes'>z990</model>
     ...
    
    Signed-off-by: Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com>
    Message-Id: <1499082529-16970-1-git-send-email-mihajlov@linux.vnet.ibm.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Acked-by: Cornelia Huck <cohuck@redhat.com>
    Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
    Viktor Mihajlovski authored and borntraeger committed Jul 5, 2017
    Copy the full SHA
    38cba1f View commit details
    Browse the repository at this point in the history
  4. s390x/MAINTAINERS: Update my email address

    Signed-off-by: Cornelia Huck <cohuck@redhat.com>
    Message-Id: <20170704092215.13742-2-cohuck@redhat.com>
    Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
    cohuck authored and borntraeger committed Jul 5, 2017
    Copy the full SHA
    c1976ae View commit details
    Browse the repository at this point in the history
  5. s390x: fix realize inheritance for kvm-flic

    Commit f6f4ce4211 ("s390x: add property adapter_routes_max_batch",
    2016-12-09) introduces a common realize (intended to be common for all
    the subclasses) for flic, but fails to make sure the kvm-flic which had
    its own is actually calling this common realize.
    
    This omission fortunately does not result in a grave problem. The common
    realize was only supposed to catch a possible programming mistake by
    validating a value of a property set via the compat machine macros. Since
    there was no programming mistake we don't need this fixed for stable.
    
    Let's fix this problem by making sure kvm flic honors the realize of its
    parent class.
    
    Let us also improve on the error message we would hypothetically emit
    when the validation fails.
    
    Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com>
    Fixes: f6f4ce4211 ("s390x: add property adapter_routes_max_batch")
    Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
    Reviewed-by: Yi Min Zhao <zyimin@linux.vnet.ibm.com>
    Reviewed-by: Cornelia Huck <cohuck@redhat.com>
    Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
    Halil Pasic authored and borntraeger committed Jul 5, 2017
    Copy the full SHA
    5cbab1b View commit details
    Browse the repository at this point in the history
  6. s390x: fix error propagation in kvm-flic's realize

    From the moment it was introduced by commit a2875e6f98 ("s390x/kvm:
    implement floating-interrupt controller device", 2013-07-16) the kvm-flic
    is not making realize fail properly in case it's impossible to create the
    KVM device which basically serves as a backend and is absolutely
    essential for having an operational kvm-flic.
    
    Let's fix this by making sure we do proper error propagation in realize.
    
    Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com>
    Fixes: a2875e6f98 "s390x/kvm: implement floating-interrupt controller device"
    Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
    Reviewed-by: Yi Min Zhao <zyimin@linux.vnet.ibm.com>
    Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
    Halil Pasic authored and borntraeger committed Jul 5, 2017
    Copy the full SHA
    f62f210 View commit details
    Browse the repository at this point in the history
  7. s390x/3270: fix instruction interception handler

    Commit bab482d ("s390x/css: ccw translation infrastructure")
    introduced instruction interception handler for different types of
    subchannels. For emulated 3270 devices, we should assign the virtual
    subchannel handler to them during device realization process, or 3270
    will not work.
    
    Fixes: bab482d ("s390x/css: ccw translation infrastructure")
    
    Reviewed-by: Jing Liu <liujbjl@linux.vnet.ibm.com>
    Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com>
    Reviewed-by: Cornelia Huck <cohuck@redhat.com>
    Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
    Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
    Dong Jia Shi authored and borntraeger committed Jul 5, 2017
    Copy the full SHA
    1728cff View commit details
    Browse the repository at this point in the history
  8. s390x: vmstatify config migration for virtio-ccw

    Let's vmstatify virtio_ccw_save_config and virtio_ccw_load_config for
    flexibility (extending using subsections) and for fun.
    
    To achieve this we need to hack the config_vector, which is VirtIODevice
    (that is common virtio) state, in the middle of the VirtioCcwDevice state
    representation.  This is somewhat ugly, but we have no choice because the
    stream format needs to be preserved.
    
    Almost no changes in behavior. Exception is everything that comes with
    vmstate like extra bookkeeping about what's in the stream, and maybe some
    extra checks and better error reporting.
    
    Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com>
    Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
    Reviewed-by: Juan Quintela <quintela@redhat.com>
    Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
    Message-Id: <20170703213414.94298-1-pasic@linux.vnet.ibm.com>
    Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
    Halil Pasic authored and borntraeger committed Jul 5, 2017
    Copy the full SHA
    517ff12 View commit details
    Browse the repository at this point in the history
  9. target/i386: add the CONFIG_TCG into Makefiles

    Add the CONFIG_TCG for frontend and backend's files in the related
    Makefiles.
    
    Signed-off-by: Yang Zhong <yang.zhong@intel.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    yangzhon authored and bonzini committed Jul 5, 2017
    Copy the full SHA
    44eff67 View commit details
    Browse the repository at this point in the history
  10. target/i386: add the tcg_enabled() in target/i386/

    Add the tcg_enabled() where the x86 target needs to disable
    TCG-specific code.
    
    Signed-off-by: Yang Zhong <yang.zhong@intel.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    yangzhon authored and bonzini committed Jul 5, 2017
    Copy the full SHA
    79c664f View commit details
    Browse the repository at this point in the history
Older