New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add faster JVM baseline for MMM #1

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
2 participants
@richardstartin

richardstartin commented Dec 31, 2017

Regarding the JVM flags: in 3.4 (Experimental Setup) > Evaluation (page 7) you state that the forked JVM instance has -XX:CompileThreshold=100 but don't actually do so in the benchmarks. The second JVM argument (-XX:-TieredCompilation) ensures you use C2 since I saw tiered compilation happening during the measurements without it.

I also provide an example of a matrix multiplication implementation which performs about 6x better than the blocked algorithm used as a baseline, and better than the LMS generated code when the pair of matrices cannot reside fully in L3 cache. The implementation is slightly convoluted because it seems that the superword vectoriser in Hotspot bails when you evaluate SAXPY for offset slices of an array (this might be a bug, and if it is, it's still present in JDK9 and JDK10). Copying the slices into buffers enables the saxpy core loop to vectorise, yielding better throughput. In later JDK versions, this code uses the widest register width available and matches the throughput of LMS generated code. It's impossible to say how much LMS would profit from a JDK upgrade without upgrading scala-lang.virtualized.* to 2.12.*, however.

====================================================
Benchmarking MMM.jMMM.fast (JVM implementation)
----------------------------------------------------
    Size (N) | Flops / Cycle
----------------------------------------------------
           8 | 0.2500390686
          32 | 0.7710872405
          64 | 1.1302489072
         128 | 2.5113453810
         192 | 2.9525859816
         256 | 3.1180920385
         320 | 3.1081563593
         384 | 3.1458423577
         448 | 3.0493148252
         512 | 3.0551158263
         576 | 3.1430376938
         640 | 3.2169923048
         704 | 3.1026513283
         768 | 2.4190053777
         832 | 3.3358586705
         896 | 3.0755689237
         960 | 2.9996690697
        1024 | 2.2935654309
====================================================

====================================================
Benchmarking MMM.nMMM.blocked (LMS generated)
----------------------------------------------------
    Size (N) | Flops / Cycle
----------------------------------------------------
           8 | 1.0001562744
          32 | 5.3330416826
          64 | 5.8180867784
         128 | 5.1717318641
         192 | 5.1639907462
         256 | 4.3418618628
         320 | 5.2536572701
         384 | 4.0801359215
         448 | 4.1337007093
         512 | 3.2678160754
         576 | 3.7973028890
         640 | 3.3557513664
         704 | 4.0103133240
         768 | 3.4188362575
         832 | 3.2189488327
         896 | 3.2316685219
         960 | 2.9985655539
        1024 | 1.7750946796
====================================================

====================================================
Benchmarking MMM.jMMM.blocked (JVM implementation)
----------------------------------------------------
    Size (N) | Flops / Cycle
----------------------------------------------------
           8 | 0.5000781372
          32 | 0.6400028000
          64 | 0.5682568807
         128 | 0.5996199747
         192 | 0.6239961323
         256 | 0.5864838394
         320 | 0.5991560286
         384 | 0.5701005084
         448 | 0.5740271299
         512 | 0.5466428270
         576 | 0.5493065307
         640 | 0.5538750175
         704 | 0.5577268857
         768 | 0.5512993033
         832 | 0.5565532937
         896 | 0.5406809266
         960 | 0.5047555546
        1024 | 0.4042536740
====================================================

====================================================
Benchmarking MMM.jMMM.baseline (JVM implementation)
----------------------------------------------------
    Size (N) | Flops / Cycle
----------------------------------------------------
           8 | 0.3333854248
          32 | 0.5333379167
          64 | 0.6984999135
         128 | 0.6008512163
         192 | 0.5607890505
         256 | 0.5088439822
         320 | 0.5295694027
         384 | 0.4928785458
         448 | 0.5096290685
         512 | 0.3481941156
         576 | 0.5171241182
         640 | 0.4276359551
         704 | 0.4904493737
         768 | 0.4035653402
         832 | 0.4496286979
         896 | 0.3901097359
         960 | 0.4263629702
        1024 | 0.1689477908
====================================================
@astojanov

This comment has been minimized.

Show comment
Hide comment
@astojanov

astojanov Jan 10, 2018

Owner

Lets start with the easy arguments.

-XX:CompileThreshold=100 should be there, which is a great catch. I am willing to include -XX:-TieredCompilation if that improves the results even further. In fact I will include both in the codebase, although I have to comment that this did not cause any discrepancies in the results published in the paper.

At some point in time you mention:

In later JDK versions, this code uses the widest register width available and matches the throughput of LMS generated code. It's impossible to say how much LMS would profit from a JDK upgrade without upgrading scala-lang.virtualized.* to 2.12.*, however.

The assumption here is that LMS will not work if JDK is updated. That's true, assuming that we continue using scala-lang.virtualized. However LMS has another branch that uses Scala Macros and is called macrovirt and this branch is not implemented on a forked compiler. I believe that this was already mentioned in a previous tweet (mirror) by Tiark. It is in fact my personal choice to use scala-lang.virtualized and the current codebase can be adjusted to the new branch, if necessary.

Concerning the results, you are not disclosing what CPU you use. Looking at the data the results are equivalent to the ones in your blog post (as u tend to change content and delete tweets, I took the liberty to take a snapshot and create a mirror on the WebArchive). On your blog you mention that you use a Skylake machine at 2.6 GHz without stating the model. As you are breaching 5 F/C on some test cases with LMS (which is great), I need to ask you to verify whether the artefact instructions are followed to the fullest and whether Turbo Boost was disabled while testing, otherwise F/C calculation might be incorrect.

Some flaws in your analysis

Running your new implementation on the same setup as explained in the paper and using HotSpot 64-Bit Server 25.144-b01, I obtain the following:

====================================================
Benchmarking MMM.jMMM.fast (JVM implementation)
----------------------------------------------------
    Size (N) | Flops / Cycle
----------------------------------------------------
           8 | 0.2025548817
          32 | 0.9505490546
          64 | 1.5440060332
         128 | 1.9360151691
         192 | 2.2807233488
         256 | 2.4275490009
         320 | 2.3886886092
         384 | 2.4544359612
         448 | 2.4197389678
         512 | 2.4484295661
         576 | 2.6239068039
         640 | 2.6194253099
         704 | 2.6164823769
         768 | 2.5033714620
         832 | 2.5713442497
         896 | 2.6562038198
         960 | 2.5909380888
        1024 | 2.2196145678
        1536 | 2.0560739099
        2048 | 1.4863463918
        2560 | 1.5392511027
        3072 | 1.3746071575
        3584 | 1.4390068283
        4096 | 1.2191563044
====================================================

You mention:

I also provide an example of a matrix multiplication implementation which performs about 6x better than the blocked algorithm used as a baseline, and better than the LMS generated code when the pair of matrices cannot reside fully in L3 cache.

Furthermore on your blog:

I provide a simple and cache-efficient Java implementation (with the same asymptotic complexity, the improvement is just technical) and benchmark these implementations using JDK8 and the soon to be released JDK10 separately.

This could be (in principle) further verified and validated, although the theory is not on your side.
You attempt to solve MMM problem using SAXPY, completely disregarding memory hierarchy. In fact your code is not cache efficient. It will be cache efficient as long as bBuffer and cBuffer fit inside the cache. The moment they become too big, you are out of luck. Look at the results above, the moment bBuffer and cBufferreach 4096 elements (2x16KB = 32KB) and barely fit L1 cache of Intel Xeon CPU E3-1285L v3 3.10GHz, you have a performance drop of 2x. This clearly debunks the claim of cache efficient algorithm.

But none of that is in fact important. Let's assume for a second that this implementation that you did is capable of matching this particular LMS code that does MMM. Now, we can argue whether this is the fastest code for MMM that we can develop with LMS. In fact looking it closely we can conclude that this code is not utilizing the memory hierarchy to the fullest. We can further improve it with all lessons learned from Is Search Really Necessary to Generate High-Performance BLAS and even reach the theoretical maximum with proper vectorization and prove that we have reached that maximum.

See the difference? The approach of the lms-intrinsics allows me to do so. On the other hand the JIT is limiting me in achieving the best performance of the particular machine because I have to write the code in a specific way in order to jump-start SLP. While it might be that your algorithm is the best you can get using JDK8, this is definitely not the best native LMS-generated code can do on that machine. (Can you imagine what could happen if we choose other benchmarks for numerical computations and use LGen to generate code?).

Let's go into the essence of the argument

So what can this pull request contribute for this artefact? Do you like to make the point that the JVM can be faster by writing code in forms of hacks? By writing code that the JVM can digest better? If that is the point then I take it. After all, the task in this paper is not to compete with the JVM.

See, the idea of using lms-intrinsics is to give the opportunity to think about platform specifics, ignoring JIT implementation while giving all high level features of Java / Scala. On the other hand the approach of Java development is (I am quoting you from the blog):

traditional JVM approach of pairing dumb programmers with a (hopefully) smart JIT compiler

In that sense, this paper, and these benchmarks show the two different approaches - take the control and do everything yourself vs leave everything to the JIT. Your approach on the other hand goes in a completely different direction. You hack and think about everything. You hack the JVM, because you are not a dumb programmer and you try to be ahead of the JIT. You are not only a domain and architecture expert, but as well JVM and JIT expert for a particular version of the product, capable of understanding what the JIT will do if you write slightly different code to jump-start optimizations, observing and analyzing performance over different JVM implementations. I am not sure whether this is expected from every JVM developer, and whether it is in fact in the spirit of the Java development.

To conclude ...

Nevertheless, you do achieve valid results and I am willing to include your results in the artefact. They do in fact show the lengths that JVM developers are willing to go to in order to achieve performance. As the camera ready version deadline for the CGO paper is already passed I doubt that I can include this bench and its foreword analysis in the paper. But I could discuss the lesons learned during CGO.

But before I do so, and because you demonstrated unlimited will power and incentive to test this artefact, would you be willing to give it a shot into implementing tiled version of MMM that kick-starts SLP in JDK8? That would be a great benchmark to have if you can actually achieve it. Would as well be a great insight in case you can't kick-start SLP to understand why it doesn't happen. You will also have a chance to fix the false claims in your blog post.

Owner

astojanov commented Jan 10, 2018

Lets start with the easy arguments.

-XX:CompileThreshold=100 should be there, which is a great catch. I am willing to include -XX:-TieredCompilation if that improves the results even further. In fact I will include both in the codebase, although I have to comment that this did not cause any discrepancies in the results published in the paper.

At some point in time you mention:

In later JDK versions, this code uses the widest register width available and matches the throughput of LMS generated code. It's impossible to say how much LMS would profit from a JDK upgrade without upgrading scala-lang.virtualized.* to 2.12.*, however.

The assumption here is that LMS will not work if JDK is updated. That's true, assuming that we continue using scala-lang.virtualized. However LMS has another branch that uses Scala Macros and is called macrovirt and this branch is not implemented on a forked compiler. I believe that this was already mentioned in a previous tweet (mirror) by Tiark. It is in fact my personal choice to use scala-lang.virtualized and the current codebase can be adjusted to the new branch, if necessary.

Concerning the results, you are not disclosing what CPU you use. Looking at the data the results are equivalent to the ones in your blog post (as u tend to change content and delete tweets, I took the liberty to take a snapshot and create a mirror on the WebArchive). On your blog you mention that you use a Skylake machine at 2.6 GHz without stating the model. As you are breaching 5 F/C on some test cases with LMS (which is great), I need to ask you to verify whether the artefact instructions are followed to the fullest and whether Turbo Boost was disabled while testing, otherwise F/C calculation might be incorrect.

Some flaws in your analysis

Running your new implementation on the same setup as explained in the paper and using HotSpot 64-Bit Server 25.144-b01, I obtain the following:

====================================================
Benchmarking MMM.jMMM.fast (JVM implementation)
----------------------------------------------------
    Size (N) | Flops / Cycle
----------------------------------------------------
           8 | 0.2025548817
          32 | 0.9505490546
          64 | 1.5440060332
         128 | 1.9360151691
         192 | 2.2807233488
         256 | 2.4275490009
         320 | 2.3886886092
         384 | 2.4544359612
         448 | 2.4197389678
         512 | 2.4484295661
         576 | 2.6239068039
         640 | 2.6194253099
         704 | 2.6164823769
         768 | 2.5033714620
         832 | 2.5713442497
         896 | 2.6562038198
         960 | 2.5909380888
        1024 | 2.2196145678
        1536 | 2.0560739099
        2048 | 1.4863463918
        2560 | 1.5392511027
        3072 | 1.3746071575
        3584 | 1.4390068283
        4096 | 1.2191563044
====================================================

You mention:

I also provide an example of a matrix multiplication implementation which performs about 6x better than the blocked algorithm used as a baseline, and better than the LMS generated code when the pair of matrices cannot reside fully in L3 cache.

Furthermore on your blog:

I provide a simple and cache-efficient Java implementation (with the same asymptotic complexity, the improvement is just technical) and benchmark these implementations using JDK8 and the soon to be released JDK10 separately.

This could be (in principle) further verified and validated, although the theory is not on your side.
You attempt to solve MMM problem using SAXPY, completely disregarding memory hierarchy. In fact your code is not cache efficient. It will be cache efficient as long as bBuffer and cBuffer fit inside the cache. The moment they become too big, you are out of luck. Look at the results above, the moment bBuffer and cBufferreach 4096 elements (2x16KB = 32KB) and barely fit L1 cache of Intel Xeon CPU E3-1285L v3 3.10GHz, you have a performance drop of 2x. This clearly debunks the claim of cache efficient algorithm.

But none of that is in fact important. Let's assume for a second that this implementation that you did is capable of matching this particular LMS code that does MMM. Now, we can argue whether this is the fastest code for MMM that we can develop with LMS. In fact looking it closely we can conclude that this code is not utilizing the memory hierarchy to the fullest. We can further improve it with all lessons learned from Is Search Really Necessary to Generate High-Performance BLAS and even reach the theoretical maximum with proper vectorization and prove that we have reached that maximum.

See the difference? The approach of the lms-intrinsics allows me to do so. On the other hand the JIT is limiting me in achieving the best performance of the particular machine because I have to write the code in a specific way in order to jump-start SLP. While it might be that your algorithm is the best you can get using JDK8, this is definitely not the best native LMS-generated code can do on that machine. (Can you imagine what could happen if we choose other benchmarks for numerical computations and use LGen to generate code?).

Let's go into the essence of the argument

So what can this pull request contribute for this artefact? Do you like to make the point that the JVM can be faster by writing code in forms of hacks? By writing code that the JVM can digest better? If that is the point then I take it. After all, the task in this paper is not to compete with the JVM.

See, the idea of using lms-intrinsics is to give the opportunity to think about platform specifics, ignoring JIT implementation while giving all high level features of Java / Scala. On the other hand the approach of Java development is (I am quoting you from the blog):

traditional JVM approach of pairing dumb programmers with a (hopefully) smart JIT compiler

In that sense, this paper, and these benchmarks show the two different approaches - take the control and do everything yourself vs leave everything to the JIT. Your approach on the other hand goes in a completely different direction. You hack and think about everything. You hack the JVM, because you are not a dumb programmer and you try to be ahead of the JIT. You are not only a domain and architecture expert, but as well JVM and JIT expert for a particular version of the product, capable of understanding what the JIT will do if you write slightly different code to jump-start optimizations, observing and analyzing performance over different JVM implementations. I am not sure whether this is expected from every JVM developer, and whether it is in fact in the spirit of the Java development.

To conclude ...

Nevertheless, you do achieve valid results and I am willing to include your results in the artefact. They do in fact show the lengths that JVM developers are willing to go to in order to achieve performance. As the camera ready version deadline for the CGO paper is already passed I doubt that I can include this bench and its foreword analysis in the paper. But I could discuss the lesons learned during CGO.

But before I do so, and because you demonstrated unlimited will power and incentive to test this artefact, would you be willing to give it a shot into implementing tiled version of MMM that kick-starts SLP in JDK8? That would be a great benchmark to have if you can actually achieve it. Would as well be a great insight in case you can't kick-start SLP to understand why it doesn't happen. You will also have a chance to fix the false claims in your blog post.

@richardstartin

This comment has been minimized.

Show comment
Hide comment
@richardstartin

richardstartin Jan 10, 2018

The higher performance implementation isn't cache efficient, and the blog post never claims so, it's using unnecessary memory to work around a compiler bug. The (scalar) implementation I refer to as "cache efficient" is actually twice as fast as your best Java implementation, according to my JMH based measurements.

This PR is just here to demonstrate two things:

  • you make a fraudulent claim in your paper about your experimental set up.
  • you obviously didn't try very hard to find better Java baselines.

I couldn't care less whether you merge this or not.

richardstartin commented Jan 10, 2018

The higher performance implementation isn't cache efficient, and the blog post never claims so, it's using unnecessary memory to work around a compiler bug. The (scalar) implementation I refer to as "cache efficient" is actually twice as fast as your best Java implementation, according to my JMH based measurements.

This PR is just here to demonstrate two things:

  • you make a fraudulent claim in your paper about your experimental set up.
  • you obviously didn't try very hard to find better Java baselines.

I couldn't care less whether you merge this or not.

@astojanov

This comment has been minimized.

Show comment
Hide comment
@astojanov

astojanov Jan 11, 2018

Owner

Allow me to refresh your memory. This is your blog post snapshot taken on 2nd of January.

I provide a simple and cache-efficient Java implementation (with the same asymptotic complexity, the improvement is just technical) and benchmark these implementations using JDK8 and the soon to be released JDK10 separately.

This argument of cache non-efficiency refers to both algorithms, as none of them is cache efficient, the one with the buffers, as well as this one here:

public void fast(float[] a, float[] b, float[] c, int n) {
   int in = 0;
   for (int i = 0; i < n; ++i) {
       int kn = 0;
       for (int k = 0; k < n; ++k) {
           float aik = a[in + k];
           for (int j = 0; j < n; ++j) {
               c[in + j] += aik * b[kn + j];
           }
           kn += n;
       }
       in += n;
    }
}

You do understand that right? So you did claim to have cache efficient algorithm! I do acknowledge that is fast. My measurements are in fact 2x or more on some cases.

Let me quote myself again from the previous reply:

although I have to comment that this did not cause any discrepancies in the results published in the paper.

I rest my case concerning that argument.

I like this point you obviously didn't try very hard to find better Java baselines and I don't agree with it. The simplest triple loop is a valid MMM baseline. It is the worst one, but it is a valid one. We are very transparent on what kind of MMM algorithm we perform the benchmark. Now I have options to use MMM library written in Java, to use the buffers to copy, I could as well, write a generator that generate Java code such that it automatically tiles the matrix with different tiling decisions, packs the inner loop into a separate function so SLP kicks-in and on top of that has a feedback loop that measures the timing and picks the fastest one by trying all tiling decisions. But why would I do that?

Picking the fastest Java MMM algorithm (as u call it a baseline) does not really make the point in this paper. And the point is (again I am telling this) explicit vs implicit vectiorzation. Control vs letting the JIT do everything by itself. In fact the blocked version and NGen version are almost equivalent. Both have the same block size - 8. The only difference is that in NGen I shuffle and vectorize, and in Java blocked I let the JIT do the job. This goes the other way around - if you make the claim that I did not do much effort to do the Java baseline, I have really not made an effort to do the fastest LMS version.

So no, it's not that I did not bother to try to do a better baseline, but quite the contrary, I believe I gave exactly what I need to depict the point.

Owner

astojanov commented Jan 11, 2018

Allow me to refresh your memory. This is your blog post snapshot taken on 2nd of January.

I provide a simple and cache-efficient Java implementation (with the same asymptotic complexity, the improvement is just technical) and benchmark these implementations using JDK8 and the soon to be released JDK10 separately.

This argument of cache non-efficiency refers to both algorithms, as none of them is cache efficient, the one with the buffers, as well as this one here:

public void fast(float[] a, float[] b, float[] c, int n) {
   int in = 0;
   for (int i = 0; i < n; ++i) {
       int kn = 0;
       for (int k = 0; k < n; ++k) {
           float aik = a[in + k];
           for (int j = 0; j < n; ++j) {
               c[in + j] += aik * b[kn + j];
           }
           kn += n;
       }
       in += n;
    }
}

You do understand that right? So you did claim to have cache efficient algorithm! I do acknowledge that is fast. My measurements are in fact 2x or more on some cases.

Let me quote myself again from the previous reply:

although I have to comment that this did not cause any discrepancies in the results published in the paper.

I rest my case concerning that argument.

I like this point you obviously didn't try very hard to find better Java baselines and I don't agree with it. The simplest triple loop is a valid MMM baseline. It is the worst one, but it is a valid one. We are very transparent on what kind of MMM algorithm we perform the benchmark. Now I have options to use MMM library written in Java, to use the buffers to copy, I could as well, write a generator that generate Java code such that it automatically tiles the matrix with different tiling decisions, packs the inner loop into a separate function so SLP kicks-in and on top of that has a feedback loop that measures the timing and picks the fastest one by trying all tiling decisions. But why would I do that?

Picking the fastest Java MMM algorithm (as u call it a baseline) does not really make the point in this paper. And the point is (again I am telling this) explicit vs implicit vectiorzation. Control vs letting the JIT do everything by itself. In fact the blocked version and NGen version are almost equivalent. Both have the same block size - 8. The only difference is that in NGen I shuffle and vectorize, and in Java blocked I let the JIT do the job. This goes the other way around - if you make the claim that I did not do much effort to do the Java baseline, I have really not made an effort to do the fastest LMS version.

So no, it's not that I did not bother to try to do a better baseline, but quite the contrary, I believe I gave exactly what I need to depict the point.

@richardstartin

This comment has been minimized.

Show comment
Hide comment
@richardstartin

richardstartin Jan 11, 2018

For the record, there are two implementations in the post. I refer to the implementation above (fast) as cache efficient, I think this is a reasonable claim. The faster implementation sacrifices cache efficiency, I apologise if that isn't clear enough, but please don't fixate on it, it's not the point.

The point is, if most Java programmers look at your paper and see <1f/c was the best you could do, it might look as if you had an agenda. I'm offering you a real baseline so you can quantify the real performance trade-off in giving up secuity, safety and GC stability (hint: it's comparable to what you could get from just upgrading Java).

richardstartin commented Jan 11, 2018

For the record, there are two implementations in the post. I refer to the implementation above (fast) as cache efficient, I think this is a reasonable claim. The faster implementation sacrifices cache efficiency, I apologise if that isn't clear enough, but please don't fixate on it, it's not the point.

The point is, if most Java programmers look at your paper and see <1f/c was the best you could do, it might look as if you had an agenda. I'm offering you a real baseline so you can quantify the real performance trade-off in giving up secuity, safety and GC stability (hint: it's comparable to what you could get from just upgrading Java).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment