Skip to content

Intermediate Results

Stefan Dietiker edited this page May 22, 2014 · 2 revisions

Performance

ref/bst_ref.c is the reference implementation. Not surprisingly, it performs badly for large sizes.

opt/bst_001_transposed.c is the first optimization we did. It mirrors the triangular table (storing every value in the e-table twice) such that columns in the table become rows.

opt/bst_101_scalar_rep.c (blue line at the bottom) has a different access pattern in the triangular table. That is, the table is built up from bottom up instead of along the diagonals as in the reference implementation.

Now, this implementation performs worse for large sizes than the reference implementation. The reason is probably that there is less spatial locality for elements in the bottom of the table.

opt/bst_106_scalar_rep_transp.c however, if we apply the same trick as before and mirror the triangular table along the diagonal, we can significantly increase the performance.

opt/bst_103_why_would_you_block Now, here we reserve twice the memory which is needed, just as in the reference implementation, we build the table bottom-up, but we swap the r- and j-loop, meaning that the innermost loop accesses the table row wise.

opt/bst_112_new_index (misleading name, I admit) is the same as above, just that now we use half the memory. The performance thus increases, probably due to the fact that the hardware prefetcher does not load unused memory into the cache as it is likely the case for bst_103. This observation is also supported by the last-level cache misses which are slightly higher for bst_103 than for bst_112.

opt/bst_113_strength_red and opt/bst_117_index_opt are essentially the same. The difference to bst_112 is that complex indices-calculation is taking out from the inner loops.

Runtime

Cache Misses