Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detection of vtable stubs #890

Closed
qwwdfsad opened this issue Feb 11, 2024 · 5 comments
Closed

Detection of vtable stubs #890

qwwdfsad opened this issue Feb 11, 2024 · 5 comments

Comments

@qwwdfsad
Copy link
Contributor

qwwdfsad commented Feb 11, 2024

Setup

Consider the following benchmark, where two methods are supposedly identical:

@Warmup(iterations = 7, time = 1)  
@Measurement(iterations = 5, time = 1)  
@BenchmarkMode(Mode.AverageTime)  
@OutputTimeUnit(TimeUnit.NANOSECONDS)  
@State(Scope.Benchmark)  
@Fork(1)//jvm = "/Users/qwwdfsad/workspace/jdk17u-dev/build/macosx-aarch64-server-release/images/jdk/bin/java")  
open class PerformancePuzzler {  
  
    private val states = arrayOf(emptyList(), listOf(""), listOf("", ""))  

    @Benchmark  
    fun filter(bh: Blackhole) {  
        repeat(100) {  
            for (state in states) {  
                // filter is Kotlin 'inline' function that is inlined on a bytecode level
                bh.consume(state.filter { true })  
            }  
        }  
    }  
  
    @Benchmark  
    fun manualFilter(bh: Blackhole) {  
        repeat(100) {  
            for (state in states) {  
                val destination = ArrayList<String>()  
                for (i in state) {  
                    if (true) {  
                        destination.add(i)  
                    }  
                }  
                bh.consume(destination)  
            }  
        }  
    }  
}

On my machine, it yields the following result:

Benchmark                        Mode  Cnt     Score    Error  Units
PerformancePuzzler.filter        avgt    5  3803.760 ± 56.948  ns/op
PerformancePuzzler.manualFilter  avgt    5  3011.223 ± 54.237  ns/op

Surprisingly, their score differs by 25%.
Yet, if profiled with async profiler (even if ran with -XX:+DebugNonSafepoints), the resulting flamegraphs are almost totally identical

Flamegraphs image image

Root cause

The two methods, in fact, differ in a few extra CHECKCAST instructions that upcast the list type and prevent Hotspot from devirtualizing Iterator methods.
According to dtraceasm, the extra 25% of time is spent in unknown section that is, from the look on it, just a vtable stub.

Bytecode
  public final filter(Lorg/openjdk/jmh/infra/Blackhole;)V
@Lorg/openjdk/jmh/annotations/Benchmark;()
  // annotable parameter count: 1 (invisible)
  @Lorg/jetbrains/annotations/NotNull;() // invisible, parameter 0
 L0
  ALOAD 1
  LDC "bh"
  INVOKESTATIC kotlin/jvm/internal/Intrinsics.checkNotNullParameter (Ljava/lang/Object;Ljava/lang/String;)V
 L1
  LINENUMBER 19 L1
  BIPUSH 100
  ISTORE 2
  ICONST_0
  ISTORE 3
 L2
 FRAME APPEND [I I]
  ILOAD 3
  ILOAD 2
  IF_ICMPGE L3
  ILOAD 3
  ISTORE 4
 L4
  ICONST_0
  ISTORE 5
 L5
  LINENUMBER 20 L5
  ALOAD 0
  GETFIELD benchmarks/PerformancePuzzler.states : [Ljava/util/List;
  ASTORE 6
  ALOAD 6
  ARRAYLENGTH
  ISTORE 7
  ICONST_0
  ISTORE 8
 L6
  LINENUMBER 20 L6
 FRAME FULL [benchmarks/PerformancePuzzler org/openjdk/jmh/infra/Blackhole I I I I [Ljava/util/List; I I] []
  ILOAD 8
  ILOAD 7
  IF_ICMPGE L7
  ALOAD 6
  ILOAD 8
  AALOAD
  ASTORE 9
 L8
  LINENUMBER 21 L8
  ALOAD 1
  ALOAD 9
  CHECKCAST java/lang/Iterable
  ASTORE 10
  ASTORE 11
 L9
  ICONST_0
  ISTORE 12
 L10
  LINENUMBER 45 L10
  ALOAD 10
  ASTORE 13
  NEW java/util/ArrayList
  DUP
  INVOKESPECIAL java/util/ArrayList.<init> ()V
  CHECKCAST java/util/Collection
  ASTORE 14
 L11
  ICONST_0
  ISTORE 15
 L12
  LINENUMBER 46 L12
  ALOAD 13
  INVOKEINTERFACE java/lang/Iterable.iterator ()Ljava/util/Iterator; (itf)
  ASTORE 16
 L13
 FRAME FULL [benchmarks/PerformancePuzzler org/openjdk/jmh/infra/Blackhole I I I I [Ljava/util/List; I I java/util/List java/lang/Iterable org/openjdk/jmh/infra/Blackhole I java/lang/Iterable java/util/Collection I java/util/Iterator] []
  ALOAD 16
  INVOKEINTERFACE java/util/Iterator.hasNext ()Z (itf)
  IFEQ L14
  ALOAD 16
  INVOKEINTERFACE java/util/Iterator.next ()Ljava/lang/Object; (itf)
  ASTORE 17
 L15
  ALOAD 17
  CHECKCAST java/lang/String
  ASTORE 18
 L16
  ICONST_0
  ISTORE 19
 L17
  LINENUMBER 21 L17
  ICONST_1
 L18
  LINENUMBER 46 L18
  IFEQ L13
  ALOAD 14
  ALOAD 17
  INVOKEINTERFACE java/util/Collection.add (Ljava/lang/Object;)Z (itf)
  POP
 L19
  GOTO L13
 L14
  LINENUMBER 47 L14
 FRAME SAME
  ALOAD 14
 L20
  CHECKCAST java/util/List
 L21
  LINENUMBER 45 L21
  NOP
 L22
  ASTORE 20
  ALOAD 11
  ALOAD 20
 L23
  LINENUMBER 21 L23
  INVOKEVIRTUAL org/openjdk/jmh/infra/Blackhole.consume (Ljava/lang/Object;)V
 L24
  LINENUMBER 20 L24
  IINC 8 1
 L25
  GOTO L6
 L7
  LINENUMBER 23 L7
 FRAME FULL [benchmarks/PerformancePuzzler org/openjdk/jmh/infra/Blackhole I I I I [Ljava/util/List; I I] []
  NOP
 L26
  LINENUMBER 19 L26
  IINC 3 1
  GOTO L2
 L3
  LINENUMBER 24 L3
 FRAME FULL [benchmarks/PerformancePuzzler org/openjdk/jmh/infra/Blackhole I I] []
  RETURN
 L27
  LOCALVARIABLE it Ljava/lang/String; L16 L18 18
  LOCALVARIABLE $i$a$-filter-PerformancePuzzler$filter$1$1 I L17 L18 19
  LOCALVARIABLE element$iv$iv Ljava/lang/Object; L15 L19 17
  LOCALVARIABLE $i$f$filterTo I L12 L20 15
  LOCALVARIABLE $this$filterTo$iv$iv Ljava/lang/Iterable; L11 L20 13
  LOCALVARIABLE destination$iv$iv Ljava/util/Collection; L11 L20 14
  LOCALVARIABLE $i$f$filter I L10 L22 12
  LOCALVARIABLE $this$filter$iv Ljava/lang/Iterable; L9 L22 10
  LOCALVARIABLE state Ljava/util/List; L8 L25 9
  LOCALVARIABLE it I L4 L26 4
  LOCALVARIABLE $i$a$-repeat-PerformancePuzzler$filter$1 I L5 L26 5
  LOCALVARIABLE this Lbenchmarks/PerformancePuzzler; L0 L27 0
  LOCALVARIABLE bh Lorg/openjdk/jmh/infra/Blackhole; L0 L27 1
  MAXSTACK = 2
  MAXLOCALS = 21

// access flags 0x11
public final manualFilter(Lorg/openjdk/jmh/infra/Blackhole;)V
@Lorg/openjdk/jmh/annotations/Benchmark;()
  // annotable parameter count: 1 (invisible)
  @Lorg/jetbrains/annotations/NotNull;() // invisible, parameter 0
 L0
  ALOAD 1
  LDC "bh"
  INVOKESTATIC kotlin/jvm/internal/Intrinsics.checkNotNullParameter (Ljava/lang/Object;Ljava/lang/String;)V
 L1
  LINENUMBER 28 L1
  BIPUSH 100
  ISTORE 2
  ICONST_0
  ISTORE 3
 L2
 FRAME APPEND [I I]
  ILOAD 3
  ILOAD 2
  IF_ICMPGE L3
  ILOAD 3
  ISTORE 4
 L4
  ICONST_0
  ISTORE 5
 L5
  LINENUMBER 29 L5
  ALOAD 0
  GETFIELD benchmarks/PerformancePuzzler.states : [Ljava/util/List;
  ASTORE 6
  ALOAD 6
  ARRAYLENGTH
  ISTORE 7
  ICONST_0
  ISTORE 8
 L6
  LINENUMBER 29 L6
 FRAME FULL [benchmarks/PerformancePuzzler org/openjdk/jmh/infra/Blackhole I I I I [Ljava/util/List; I I] []
  ILOAD 8
  ILOAD 7
  IF_ICMPGE L7
  ALOAD 6
  ILOAD 8
  AALOAD
  ASTORE 9
 L8
  LINENUMBER 30 L8
  NEW java/util/ArrayList
  DUP
  INVOKESPECIAL java/util/ArrayList.<init> ()V
  ASTORE 10
 L9
  LINENUMBER 31 L9
  ALOAD 9
  INVOKEINTERFACE java/util/List.iterator ()Ljava/util/Iterator; (itf)
  ASTORE 11
 L10
  LINENUMBER 31 L10
 FRAME APPEND [java/util/List java/util/ArrayList java/util/Iterator]
  ALOAD 11
  INVOKEINTERFACE java/util/Iterator.hasNext ()Z (itf)
  IFEQ L11
  ALOAD 11
  INVOKEINTERFACE java/util/Iterator.next ()Ljava/lang/Object; (itf)
  CHECKCAST java/lang/String
  ASTORE 12
 L12
  LINENUMBER 32 L12
  NOP
 L13
  LINENUMBER 33 L13
  ALOAD 10
  ALOAD 12
  INVOKEVIRTUAL java/util/ArrayList.add (Ljava/lang/Object;)Z
  POP
 L14
  LINENUMBER 31 L14
  GOTO L10
 L11
  LINENUMBER 36 L11
 FRAME SAME
  ALOAD 1
  ALOAD 10
  INVOKEVIRTUAL org/openjdk/jmh/infra/Blackhole.consume (Ljava/lang/Object;)V
 L15
  LINENUMBER 29 L15
  IINC 8 1
 L16
  GOTO L6
 L7
  LINENUMBER 38 L7
 FRAME CHOP 3
  NOP
 L17
  LINENUMBER 28 L17
  IINC 3 1
  GOTO L2
 L3
  LINENUMBER 39 L3
 FRAME FULL [benchmarks/PerformancePuzzler org/openjdk/jmh/infra/Blackhole I I] []
  RETURN
 L18
  LOCALVARIABLE i Ljava/lang/String; L12 L14 12
  LOCALVARIABLE destination Ljava/util/ArrayList; L9 L15 10
  LOCALVARIABLE state Ljava/util/List; L8 L16 9
  LOCALVARIABLE it I L4 L17 4
  LOCALVARIABLE $i$a$-repeat-PerformancePuzzler$manualFilter$1 I L5 L17 5
  LOCALVARIABLE this Lbenchmarks/PerformancePuzzler; L0 L18 0
  LOCALVARIABLE bh Lorg/openjdk/jmh/infra/Blackhole; L0 L18 1
  MAXSTACK = 2
  MAXLOCALS = 13

// access flags 0x1
public <init>()V
 L0
  LINENUMBER 14 L0
  ALOAD 0
  INVOKESPECIAL java/lang/Object.<init> ()V
 L1
  LINENUMBER 16 L1
  ALOAD 0
  ICONST_3
  ANEWARRAY java/util/List
  DUP
  ICONST_0
  INVOKESTATIC kotlin/collections/CollectionsKt.emptyList ()Ljava/util/List;
  AASTORE
  DUP
  ICONST_1
  LDC ""
  INVOKESTATIC kotlin/collections/CollectionsKt.listOf (Ljava/lang/Object;)Ljava/util/List;
  AASTORE
  DUP
  ICONST_2
  ICONST_2
  ANEWARRAY java/lang/String
  DUP
  ICONST_0
  LDC ""
  AASTORE
  DUP
  ICONST_1
  LDC ""
  AASTORE
  INVOKESTATIC kotlin/collections/CollectionsKt.listOf ([Ljava/lang/Object;)Ljava/util/List;
  AASTORE
  PUTFIELD benchmarks/PerformancePuzzler.states : [Ljava/util/List;
  RETURN
 L2
  LOCALVARIABLE this Lbenchmarks/PerformancePuzzler; L0 L2 0
  MAXSTACK = 8
  MAXLOCALS = 1
}

It would be nice if vtable stubs were detected and somehow reported in the resulting profile along with the methods that are dispatched, so the root cause of such a difference would be obvious.
Also, while this benchmark is very specific, unfortunately, the problem itself is quite widespread [for Kotlin codebases], and the benchmark is a boiled-down hot path from kotlinc.

Additional details

OS/CPU: Mac OS X Sonoma 14.2.3 M3
Async-profiler version: 3.0
JDK version: OpenJDK Runtime Environment Corretto-17.0.7.7.1

@apangin
Copy link
Collaborator

apangin commented Feb 12, 2024

Async-profiler already shows vtable stubs, so I'm not completely sure what this request is about.
Could you please provide an equivalent Java test case for me to double check this?

@qwwdfsad
Copy link
Contributor Author

qwwdfsad commented Feb 12, 2024

It took a while.

Here is the benchmark:

@Warmup(iterations = 7, time = 1)
@Measurement(iterations = 5, time = 1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Fork(value = 1)
public class JavaBenchmark {
    private final List[] states = new List[]{Collections.emptyList(), Collections.singletonList(""), new ArrayList() {{
        add("");
        add("");
    }}};

    boolean almostContant = true;

    @Benchmark
    public final void filter(Blackhole bh) {
        int var2;
        int var3;
        var2 = 100;
        var3 = 0;

        while (var3 < var2) {
            int var4 = var3;
            int var5 = 0;

            List[] var6 = states;
            int var7 = 0;
            int var8 = var6.length;
            while (var7 < var8) {
                List var9 = var6[var7];
                Iterable var10 = (Iterable) (Object) var9;
                Blackhole var11 = bh;
                int var12 = 0;
                Iterable var13 = var10;
                Collection var14 = (Collection) (Object) new ArrayList();
                int var15 = 0;
                Iterator var16 = var13.iterator();
                while (var16.hasNext()) {
                    Object var17 = var16.next();
                    String var18 = (String) var17;
                    int var19 = 0;
                    if (this.isTrue()) {
                        var14.add(var17);
                        continue;
                    }
                }
                var11.consume((List) (Object) var14);
                var7++;
            }
            var3++;
        }
    }

    private boolean isTrue() {
        return true;
    }

    @Benchmark
    public final void manualFilter(Blackhole bh) {
        for (int i = 0; i < 100; ++i) {
            List[] states = this.states;
            for (int j = 0; j < states.length; ++j) {
                List state = states[j];
                ArrayList destination = new ArrayList();
                Iterator var11 = state.iterator();

                while (var11.hasNext()) {
                    String it = (String) var11.next();
                    if (isTrue()) {
                        destination.add(it);
                    }
                }

                bh.consume(destination);
            }
        }
    }
}

Their score is respectively:

Benchmark                   Mode  Cnt     Score    Error  Units
JavaBenchmark.filter        avgt    5  3711.843 ± 28.255  ns/op
JavaBenchmark.manualFilter  avgt    5  3131.606 ± 32.580  ns/op

Yet for both of them, async profiler yields almost identical profiles. I suspect the root cause is somewhere around itable stubs and checkcasts.

A few pointers that might be helpful:

  • Commenting out continue makes the loop reducible and drops the score to ~3300
  • Getting rid of forced checkcasts in the form of (Type) (Object) drops it further to the baseline

@apangin
Copy link
Collaborator

apangin commented Feb 13, 2024

Thank you for the update. I experimented with your benchmark on M1 and Graviton3 machines and observed performance difference around 10% on both. Unfortunately, it's not that big as in your case, so it's harder to make a solid conclusion. However, I can assure vtable/itable stubs are not related to the issue.

In both benchmarks, async-profiler correctly shows an interface call to List.iterator() and Iterator.hasNext() and in both cases it shows that Iterator.next() method is successfully inlined for two receivers: next is obviously not called for Collections.emptyList().

I attach sample profiles for the reference - profiles.zip

The commands I used to generate profiles:

asprof -d 5 -f filter.jfr -F vtable jps
java -cp converter.jar jfr2flame --bci filter.jfr filter.html

A few hints how to make profiles more verbose:

  1. -F vtable option records target classes of vtable/itable calls, see Show targets of vtable/itable calls #736
  2. Record profiles in JFR format and then add --bci converter option to display bytecode indices.
    Unfortunately, HotSpot cannot always map instructions of a compiled method back to bytecodes, so the indices are not always accurate, but this may still give an idea what parts of the method take most CPU time. By the way, JDK 21 is a lot better in such translation.

Async-profiler generally outlines a performance picture on a method level. --lines and --bci options can increase the profile resolution to some extent, with a caveat that the accuracy of instruction-to-bytecode mapping is limited by the quality of the debug information HotSpot maintains with the compiled methods.

In your benchmark, however, performance variance is not explained by the shape of the callstacks. It is rather a result of C2 ability to optimize code within one compilation unit. To analyze generated code of a microbenchmark, JMH with perfasm profiler would be a more appropriate choice. Instruction-level profiling is already on my list of the next big features for async-profiler, but to be honest, being a very niche feature, it's not the highest priority right now.

@qwwdfsad
Copy link
Contributor Author

Thanks for the quick response!

observed performance difference around 10% on both. Unfortunately, it's not that big as in your case

I observe the same (~10%) difference with Java 21. Also, the difference on Java 17 disappears with -XX:-LoopUnswitching

A few hints how to make profiles more verbose

Thanks for the advice, it indeed makes the profile more helpful and hints where to look at.

It is rather a result of C2 ability to optimize code within one compilation unit

Sure, it indeed seems to be a very compilation strategy-specific and does not generally affect the callstack.
I assume there is not much to do, and the issue can be closed.

However, I can assure vtable/itable stubs are not related to the issue.

We still suspect it can be. On Java 21, both methods are compiled to basically the same assembly, and the difference seems to lie in the compiled stubs, though it requires closer investigation. We'll keep pursuing it as it affects kotlinc compilation strategy; if you see any value in it, we can keep you posted here, but it's clearly an offtopic :)

@apangin
Copy link
Collaborator

apangin commented Feb 14, 2024

OK, let me close the issue for now.
Feel free to reopen if you have evidence of async-profiler not showing something you believe it should.
Thanks.

@apangin apangin closed this as completed Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants