New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup frame conversion #1200
Speedup frame conversion #1200
Conversation
@tammojan, could you review it, or could you assign someone for review? Thanks. |
The changes overall look good to me, though the extra |
@tnakazato , question: Your new code creates |
@dpetry, my understanding is that
|
Could you do a test where you call your new routine thousands of times and monitor the memory footprint? |
Hi @dpetry, I've created small program that performs By the way, I found that this program produces a lot of memory error from wcslib and casacore itself when it uses OpenMP (run
|
Excellent! I think this is fine then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Thanks! |
I created a new ticket for this, #1217. Thanks for the thorough review, and thanks for the analysis and improvements @tnakazato ! |
These two functions use a static 1-element Vector to avoid allocations on each invocation; however these static vectors are shared across multiple threads, leading to race conditions on unlocked data access and therefore unexpected results. This commit adds the thread_local specifier on these static variables such that in multi-threaded scenarios each thread ends up with its own static copy of the variable, thus avoiding unlocked data shared across threads. This issue was originally reported in casacore#1200, and then more specifically in casacore#1217. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
These two functions use a static 1-element Vector to avoid allocations on each invocation; however these static vectors are shared across multiple threads, leading to race conditions on unlocked data access and therefore unexpected results. This commit adds the thread_local specifier on these static variables such that in multi-threaded scenarios each thread ends up with its own static copy of the variable, thus avoiding unlocked data shared across threads. This issue was originally reported in #1200, and then more specifically in #1217. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Changes aim to speedup frequency frame conversion by reducing memory allocation and deallocation. I'm sorry that many whitespace changes are included, especially in Coordinate.cc. Major changes of Coordinate.cc is L977-1036.
I carried out hot spot analysis on the imaging (convolutional gridding) of single dish data using CASA's tsdimaging task. Test data contains about 70000 rows with 2048 channels and 2 polarizations. Roughly 10^8 frequency conversions are required. The screenshots below are the result of hot spot analysis of original code. You can see that significant fraction of CPU time was spent for
operator new
(please see Summary figure), which is very likely to accompany extra CPU time to execute constructor code. Indeed, constructor was one of the bottleneck of frequency conversion. The Breakdown figure for the original code shows that about 23% of CPU time was spent for frequency conversion (MCFrequency::doConvert
), which competes to the CPU time for gridding (ggridsd_
), and about half of frequency conversion was construction/destruction or copy ofMVPosition
andMVDirection
objects. This is really inefficient.In the improved code, I tried to reduce using
operator new
(Coordinate.cc) and to avoid using constructor/destructor to update another object (MCFrequency.cc and MeasMath.cc). You can see in the figures for branch code that CPU time foroperator new
was less than half of the original analysis, and fraction of the CPU time for frequency conversion was reduced to 15%, which means performance of frequency conversion was significantly improved although it didn't reached to double-speed. In terms of total CPU time, improvement was about 10% (actually performance was improved about 20% including the changes in casa6 repo).The change is currently a trade-off between performance and the "beauty" of the object oriented programming because the change somehow breaks encapsulation of the object by exposing implementation detail of
MVDirection
andMVPosition
outside these classes. But 10% improvement in terms of total execution time is quite beneficial.Hot spot analysis of original code
Summary
Breakdown
How spot analysis of branch code
Summary
Breakdown