Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AGMS VTune Hotspot Analysis #19

Closed
beepy0 opened this issue Apr 18, 2019 · 1 comment
Closed

AGMS VTune Hotspot Analysis #19

beepy0 opened this issue Apr 18, 2019 · 1 comment
Assignees

Comments

@beepy0
Copy link
Owner

beepy0 commented Apr 18, 2019

No description provided.

@beepy0 beepy0 self-assigned this Apr 18, 2019
@beepy0 beepy0 mentioned this issue Apr 18, 2019
7 tasks
@beepy0 beepy0 changed the title VTune Hotspot Analysis AGMS VTune Hotspot Analysis Apr 18, 2019
@beepy0 beepy0 added this to the Benchmark/Optimization milestone Jun 18, 2019
@beepy0
Copy link
Owner Author

beepy0 commented Jun 27, 2019

For this test we consider samples of data coming from a Zipfian, Uniform and Discrete Normal distributions respectively. Each distribution had two different sample sizes - one hundred thousand and one million integers in size. This is done partly to monitor if any changes in behavior occur when the algorithm is left to run longer, while the bigger sample size also provides means for more precise analytics via VTune since in some cases, the 100k sample didn't provide sufficient time for the program to gather enough data and do the profiling.

When it comes to the update vector size, we used a size of 16384 tuples which equates to 128 rows and 128 buckets respectively. It is important to note that while AGMS' throughput depends heavily on the size of its update vector, the buckets size alone scales linearly, while for Fast-AGMS throughput is dependent on whether buckets size is a power of 2, due to a truncation function used for its buckets hashing. That function is used as a substitute to the modulo operation which in our testing proved to be more compute-heavy [REF].

Looking at the generated profiles, it is apparent that each run yelded somewhat similar results. The CPU utilization graph on VTune points towards a CPU bottleneck fort AGMS as the whole graph was brown, indicating that the CPU was computing the whole time the program was running.

[PIC]

For a more detailed analysis, we first quickly look at the summary tab. For the zipfian distribution with 100k samples, total execution time was 6.413s and the Top Hotspots section showed the parts of the code that took the most compute time - the functions AGMS_Sketch::Update_Sketch, seq_xor and EH3.

[PIC]

Xi_EH3::element also takes some time but in this case it is neglectable, as the function is mostly used to call EH3 as shown in [REF: SNIPPET].

[SNIPPET]

As pointed out previously, the rest of the distributions showed very similar results with execution times dependent on the data sample size. For this reason we chose to show diagrams of only the zipfian distribution when it comes to this analysis, but in case there is major deviation in any of the other distributions, it will be noted and shown additionally.

Over the tabs menu there is a dropdown menu for additional overview information. We select the HPC Performance Characterization category to reveal a vector-performance analysis. According to this view, the compiler has seen the loops containing floating-point variables and has decided not to vectorize them, so all of the three most time-consuming functions were left as scalar operations. This suggests that there is a potential speed-up to be gained from optimizing via SIMD.

image

Switching back to the Hotspots overview from the dropdown menu, the Bottom-up/Top-down Tree tabs are useful to get a better picture of what each of the functions is doing. The focus here is the Microarchitecture Usage column which can give hints if we have performance-related issues like branch mispredictions or CPU stalls due to memory latency. The values are represented in percentage and the general rule of thumb is to not get any cells colored in red, as it is the case here. This concludes that the algorithm is mainly compute-bound on this platform for the given parameters.

@beepy0 beepy0 closed this as completed Oct 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant