-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Performance on K80 + P100 + V100 #2
Comments
Constant memory benchmark involves unwanted computations (to avoid code elimination by the compiler) that potentially degrade performance. So, measurements could possibly be affected. Regarding the shared memory benchmark, could you please try using a larger workload (by increasing VECTOR_SIZE in main.cpp)? Perhaps current setting is a small one for volta. To my experience shared memory b/w can be measured quite accurately with this tool. Could you be more specific on your texture benchmark observations? Update: Thank you for your kind comments. |
Thanks for your quick response. The shared memory throughput improved by increasing P100Shared Memory throughput
using 32bit operations : 8653.17 GB/sec (2163.29 billion accesses/sec)
using 64bit operations : 9066.01 GB/sec (1133.25 billion accesses/sec)
using 128bit operations : 9276.18 GB/sec (579.76 billion accesses/sec)
# peak bandwidth is: 9519.104 GB/s V100Shared Memory throughput
using 32bit operations : 9628.86 GB/sec (2407.21 billion accesses/sec)
using 64bit operations :11495.49 GB/sec (1436.94 billion accesses/sec)
using 128bit operations :12154.74 GB/sec (759.67 billion accesses/sec)
# peak bandwidth is: 14131.2 GB/s # so still some gap Regarding the texture cache performance. Regarding the constant memory. Have you already tried benchmarking
Other than that, the throughput improvements on the new cards are quite impressive. |
Perhaps it has to do with the GPU core frequency scaling. I see that you estimate peak bandwidth based on the boost frequency. Is it safe to say that this is the sustained frequency during the experiments? Maybe the issue it is due to the ramp up during the execution. On my experiments on recent consumer GPUs (e.g. GTX-1060) the max L1 b/w is similar to max texture b/w (e.g. 1188.20 vs 1158.31 GB/sec for GTX-1060) due to the unified L1/texture cache as you already said. I'm not sure though that this should also apply to shared memory performance. What is the b/w that you observe using L1 and texture memory for P100 & V100? Truth is that I haven't tried using inline PTX for constant memory. I'm afraid that if I still don't utilize the fetched data in code the compiler will eliminate the constant memory accesses. |
I would assume that the clock frequency could affect performance only on a real small percentage as runtimes are too short to heat up. Haven't checked whether clock throttling happened during the runs though. I'll look into this next days. On the V100 the 4 texture units per SM use the L1 data cache. So I would expect the same throughput. However, gpumembench only reaches ~1/4 of the peak bandwidth at the moment. |
I've observed that sometimes the problem is that the warm up stage is not sufficient. The duration of the execution is too short and as such the clock frequencies do not have the chance to stabilize to boost clock frequencies. Could you duplicate some particular kernel execution calls to ensure that you get uniform execution times? So, on V100 you get a quarter of bandwidth by using texture memory compared to L1/L2 cached global memory. I find this strange. I haven't read yet the volta architecture details as I don't have access on a Volta GPU :( |
ok, clock rate and kernels are fine, maybe the benchmark code must be updated and there are some lines which looks like being written in the days of Fermis (cudaThreadSynchronize, bindTexture, ..).
(also checked K80, where the results have been consistent at first glance). Unfortunately, for const cache I could not find a metric, and I doubt
|
According to texture cache do you request a 128-byte row per warp? A texture cache loves 2D access patterns, so I would expect a better bandwidth, when a warp requests squares of memory. Maybe the tex cache implements some kind of space-filling curves where you would loose performance if you just fetch a 1D line of memory. |
Yes, texture memory is managed in the traditional way as the benchmark had to run on at least Fermi GPUs. Each thread requests either a 32, 64 or 128 bit element which entails 128, 256 or 512 byte per warp accesses. Of course, texture cache favors 2D locality accesses but that does not explain getting constantly lower performance. Could you profile the performance metrics tex_cache_transactions, l2_read_transactions & dram_read_transactions (element size:16bytes)? How do these compare to GP100 results? |
Finally found some time to measure the metrics :) For full measurement results see attachment. Hope it helps.
V100
P100
K40
|
Thanks for the data. I did investigated them but on second thought I believe that for the Volta case it is not the just the texture throughput that is reduced but the L1 throughput (~13343) might be overestimated. Can you verify that this is consistent with values provided my the tex_cache_throughput (I guess this metric also measures L1 throughput as on Pascal GPU since it is a unified cache)? Next, could you test if tex_utilization reaches Max(10) in case we get the highest throughput? |
ah yes,
I have rerun the benchmarks for For V100 tex cache utilization only reaches Mid(5) with ~6 TB/s at maximum (with cache hit rate 100%), while P100 reaches Max(10) with 2TB/s tex cache throughput. This means, the Max(10) would refer to a throughput of about 12 TB/s on the V100. This also means, that the output is not consistent to the profiler values, more just like the half of the nvprof measured values. With V100's L1+SMem+Tex cache being unified, I calculate the peak bandwidth by:
which would almost fit the K40
P100
V100
I know it is not easy for you just to guess without playing around with the actual hardware. At the moment I only can try things out by running the profilings and providing the data. Not sure how much code change is required to get more memory per thread requested and to get the output consistent with the profiler. Hope the measurements help :) |
I can say that when using the 32bit int type the doubled profiled throughput is caused by the larger texture access granularity. Just see that tex_cache_hit_rate approaches 50% when accessing large arrays, e.g. for kernel "void benchmark_func<int, bool=1, int=256, int=64, int=0>(int*)". Normally it should be 0% so that means that texture elements are physically accessed by a minimum of 64bit per thread. So when accessing 32bit elements, the 1st access performs the initial texture element access and the 2nd performs a cached access but both of them are accounted for tex_cache_throughput by using the double of the requested size. Regarding the max V100 throughput, I also have the same question. I would next investigate the rest of the utilization metrics, i.e. all metrics of form XXXX_utilization. If one of them reaches Max(10) then that could be a key of a bottleneck. |
quick reply, on the V100 I see Max(10) for the |
Are you sure about this? I see on your latest V100 results that DRAM read transactions are just 1034. |
ok, the Max values have been achieved on other instances, not for ´int, bool=1, int=256, int=1, int=8192´. Here, every utilization metric reported Low except the Mid of the tex utilization. I rerender the results on V100 with all metrics, stay tuned... |
it looks like there is not enough data to fully utilize the texture cache as I cannot see bottlenecks in the mentioned case above. Attached all metrics measured on V100. |
err sry, found it, the Edit:
|
Now, that's an exhaustive profiling execution. Thanks. |
Hi,
I have used your nice benchmark tool again to compare Kepler K80, Pascal P100 and Volta V100 memory bandwidths.
I will look into it, but maybe you already know the reasons, so we can discuss possible benchmark changes here.
Selected Peak Comparisons
The text was updated successfully, but these errors were encountered: