-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AGMS VTune Hotspot Analysis #19
Comments
For this test we consider samples of data coming from a Zipfian, Uniform and Discrete Normal distributions respectively. Each distribution had two different sample sizes - one hundred thousand and one million integers in size. This is done partly to monitor if any changes in behavior occur when the algorithm is left to run longer, while the bigger sample size also provides means for more precise analytics via VTune since in some cases, the 100k sample didn't provide sufficient time for the program to gather enough data and do the profiling. When it comes to the update vector size, we used a size of 16384 tuples which equates to 128 rows and 128 buckets respectively. It is important to note that while AGMS' throughput depends heavily on the size of its update vector, the buckets size alone scales linearly, while for Fast-AGMS throughput is dependent on whether buckets size is a power of 2, due to a truncation function used for its buckets hashing. That function is used as a substitute to the modulo operation which in our testing proved to be more compute-heavy [REF]. Looking at the generated profiles, it is apparent that each run yelded somewhat similar results. The CPU utilization graph on VTune points towards a CPU bottleneck fort AGMS as the whole graph was brown, indicating that the CPU was computing the whole time the program was running. [PIC] For a more detailed analysis, we first quickly look at the summary tab. For the zipfian distribution with 100k samples, total execution time was 6.413s and the Top Hotspots section showed the parts of the code that took the most compute time - the functions [PIC]
[SNIPPET] As pointed out previously, the rest of the distributions showed very similar results with execution times dependent on the data sample size. For this reason we chose to show diagrams of only the zipfian distribution when it comes to this analysis, but in case there is major deviation in any of the other distributions, it will be noted and shown additionally. Over the tabs menu there is a dropdown menu for additional overview information. We select the HPC Performance Characterization category to reveal a vector-performance analysis. According to this view, the compiler has seen the loops containing floating-point variables and has decided not to vectorize them, so all of the three most time-consuming functions were left as scalar operations. This suggests that there is a potential speed-up to be gained from optimizing via SIMD. Switching back to the Hotspots overview from the dropdown menu, the Bottom-up/Top-down Tree tabs are useful to get a better picture of what each of the functions is doing. The focus here is the Microarchitecture Usage column which can give hints if we have performance-related issues like branch mispredictions or CPU stalls due to memory latency. The values are represented in percentage and the general rule of thumb is to not get any cells colored in red, as it is the case here. This concludes that the algorithm is mainly compute-bound on this platform for the given parameters. |
No description provided.
The text was updated successfully, but these errors were encountered: