Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup frame conversion #1200

Merged
merged 2 commits into from Aug 28, 2022

Conversation

tnakazato
Copy link
Contributor

Changes aim to speedup frequency frame conversion by reducing memory allocation and deallocation. I'm sorry that many whitespace changes are included, especially in Coordinate.cc. Major changes of Coordinate.cc is L977-1036.

I carried out hot spot analysis on the imaging (convolutional gridding) of single dish data using CASA's tsdimaging task. Test data contains about 70000 rows with 2048 channels and 2 polarizations. Roughly 10^8 frequency conversions are required. The screenshots below are the result of hot spot analysis of original code. You can see that significant fraction of CPU time was spent for operator new (please see Summary figure), which is very likely to accompany extra CPU time to execute constructor code. Indeed, constructor was one of the bottleneck of frequency conversion. The Breakdown figure for the original code shows that about 23% of CPU time was spent for frequency conversion (MCFrequency::doConvert), which competes to the CPU time for gridding (ggridsd_), and about half of frequency conversion was construction/destruction or copy of MVPosition and MVDirection objects. This is really inefficient.

In the improved code, I tried to reduce using operator new (Coordinate.cc) and to avoid using constructor/destructor to update another object (MCFrequency.cc and MeasMath.cc). You can see in the figures for branch code that CPU time for operator new was less than half of the original analysis, and fraction of the CPU time for frequency conversion was reduced to 15%, which means performance of frequency conversion was significantly improved although it didn't reached to double-speed. In terms of total CPU time, improvement was about 10% (actually performance was improved about 20% including the changes in casa6 repo).

The change is currently a trade-off between performance and the "beauty" of the object oriented programming because the change somehow breaks encapsulation of the object by exposing implementation detail of MVDirection and MVPosition outside these classes. But 10% improvement in terms of total execution time is quite beneficial.

Hot spot analysis of original code

Summary

conversion-benchmark-summary-original

Breakdown

conversion-benchmark-breakdown-original

How spot analysis of branch code

Summary

conversion-benchmark-summary-tuned

Breakdown

conversion-benchmark-breakdown-tuned

@tnakazato
Copy link
Contributor Author

@tammojan, could you review it, or could you assign someone for review? Thanks.

@tammojan
Copy link
Contributor

The changes overall look good to me, though the extra ifs make the code a little bit less readable. @dpetry, could you have a look at this perhaps?

@dpetry dpetry self-assigned this Aug 11, 2022
@dpetry
Copy link
Contributor

dpetry commented Aug 11, 2022

@tnakazato , question: Your new code creates
thread_local static double imgCrd[NAXES_THRESHOLD]
When is this memory released again?
Does this work without consuming increasing amounts of memory when called repeatedly?

@tnakazato
Copy link
Contributor Author

@dpetry, my understanding is that imgCrd obeys thread storage duration.

The storage duration is the entire execution of the thread in which it was created, and the value stored in the object is initialized when the thread is started.

https://en.cppreference.com/w/c/language/storage_duration

@dpetry
Copy link
Contributor

dpetry commented Aug 23, 2022

Could you do a test where you call your new routine thousands of times and monitor the memory footprint?

@tnakazato
Copy link
Contributor Author

Hi @dpetry,

I've created small program that performs toPixel as well as toWorld many times. Source code with Makefile is uploaded (experiment.zip). I measured memory usage using valgrind. I compared memory usage (heap and stack) for "modest" and "large" number of conversions (runing make prof will produce raw data for the result below). For "modest" number of conversions case, number of conversions is 204800 for each (204800 times for toPixel and 204800 times for toWorld). The "large" case, the number is 10 times larger than "modest" case. Below is a result. Please note that sampling interval is normalized by valgrind automatically so that number of samples is less than ~80. Thus, total execution duration is longer for "large" case in spite of less number of samples. You can see that both heap and stack usage is almost constant against number of conversions. I think memory pile up didn't happen.

By the way, I found that this program produces a lot of memory error from wcslib and casacore itself when it uses OpenMP (run make prof_stackMT to reproduce). It seems that wcslib is not thread-safe. According to the documentation, wcslib before 5.18 is not thread-safe. But casacore accepts older version (cmake requires 4.7 or higher and my wcslib version is 5.15). If casacore world-pixel conversion doesn't assume multi-threading, thread_local keyword is useless. Should I remove thread_local?

---------------------------------------------------------------------------------------------------------------------------
        MODEST CASE                                                   LARGE CASE
  n     total(B)   useful-heap(B) extra-heap(B)    stacks(B)          total(B)   useful-heap(B) extra-heap(B)    stacks(B)
---------------------------------------------------------------------------------------------------------------------------
  0            0                0             0            0                 0                0             0            0
  1        5,432                0             0        5,432             5,432                0             0        5,432
  2        5,240                0             0        5,240         1,248,448          944,147       290,693       13,608
  3        5,440                0             0        5,440           990,016          705,823       278,857        5,336
  4        5,240                0             0        5,240           990,016          705,823       278,857        5,336
  5        5,240                0             0        5,240           989,992          705,819       278,837        5,336
  6        5,432                0             0        5,432           989,992          705,819       278,837        5,336
  7        5,760                0             0        5,760           989,760          705,819       278,837        5,104
  8        5,432                0             0        5,432           989,760          705,819       278,837        5,104
  9        5,240                0             0        5,240           989,760          705,819       278,837        5,104
 10        4,936                0             0        4,936           989,760          705,819       278,837        5,104
 11       24,016           19,408           792        3,816           990,016          705,823       278,857        5,336
 12       78,568           65,513         9,743        3,312           990,080          705,819       278,837        5,424
 13      604,664          363,263       231,729        9,672           989,584          705,819       278,837        4,928
 14    1,110,392          837,964       263,788        8,640           989,584          705,819       278,837        4,928
 15    1,087,776          795,436       273,548       18,792           989,584          705,819       278,837        4,928
 16    1,001,120          700,238       281,874       19,008           989,584          705,819       278,837        4,928
 17    1,248,448          944,147       290,693       13,608           989,584          705,819       278,837        4,928
 18      990,088          705,819       278,837        5,432           989,584          705,819       278,837        4,928
 19      990,016          705,823       278,857        5,336           989,992          705,819       278,837        5,336
 20      989,744          705,819       278,837        5,088           989,792          705,819       278,837        5,136
 21      989,576          705,819       278,837        4,920           989,992          705,819       278,837        5,336
 22      989,528          705,819       278,837        4,872           989,792          705,819       278,837        5,136
 23      989,752          705,819       278,837        5,096           989,992          705,819       278,837        5,336
 24      990,040          705,819       278,837        5,384           989,792          705,819       278,837        5,136
 25      990,080          705,819       278,837        5,424           989,992          705,819       278,837        5,336
 26      989,584          705,819       278,837        4,928           989,792          705,819       278,837        5,136
 27      989,744          705,819       278,837        5,088           990,104          705,823       278,857        5,424
 28      990,040          705,819       278,837        5,384           990,016          705,823       278,857        5,336
 29      990,080          705,819       278,837        5,424           990,136          705,879       278,905        5,352
 30      989,584          705,819       278,837        4,928           989,680          705,875       278,885        4,920
 31      989,752          705,819       278,837        5,096           989,848          705,875       278,885        5,088
 32      990,104          705,823       278,857        5,424           989,680          705,875       278,885        4,920
 33      989,992          705,819       278,837        5,336           989,848          705,875       278,885        5,088
 34      989,584          705,819       278,837        4,928           989,680          705,875       278,885        4,920
 35      989,992          705,819       278,837        5,336           989,848          705,875       278,885        5,088
 36      989,584          705,819       278,837        4,928           989,680          705,875       278,885        4,920
 37      989,992          705,819       278,837        5,336           989,848          705,875       278,885        5,088
 38      989,584          705,819       278,837        4,928           989,680          705,875       278,885        4,920
 39      989,992          705,819       278,837        5,336           989,848          705,875       278,885        5,088
 40      989,584          705,819       278,837        4,928           989,680          705,875       278,885        4,920
 41      989,992          705,819       278,837        5,336           989,848          705,875       278,885        5,088
 42      989,584          705,819       278,837        4,928           989,680          705,875       278,885        4,920
 43      990,136          705,879       278,905        5,352           989,848          705,875       278,885        5,088
 44      989,832          705,875       278,885        5,072           989,680          705,875       278,885        4,920
 45      989,840          705,875       278,885        5,080           989,848          705,875       278,885        5,088
 46      989,728          705,875       278,885        4,968           989,680          705,875       278,885        4,920
 47      990,176          705,879       278,905        5,392           989,848          705,875       278,885        5,088
 48      989,840          705,875       278,885        5,080           989,680          705,875       278,885        4,920
 49      989,728          705,875       278,885        4,968           990,072          705,875       278,885        5,312
 50      989,704          705,875       278,885        4,944           990,176          705,879       278,905        5,392
 51      989,832          705,875       278,885        5,072           990,088          705,879       278,905        5,304
 52      989,888          705,875       278,885        5,128           990,136          705,879       278,905        5,352
 53      989,888          705,875       278,885        5,128           989,920          705,875       278,885        5,160
 54      989,856          705,875       278,885        5,096           989,632          705,875       278,885        4,872
 55      990,064          705,875       278,885        5,304           989,680          705,875       278,885        4,920
 56      990,176          705,879       278,905        5,392           989,704          705,875       278,885        4,944
 57      990,080          705,879       278,905        5,296           989,832          705,875       278,885        5,072
 58      990,088          705,879       278,905        5,304           989,888          705,875       278,885        5,128
 59      990,136          705,879       278,905        5,352           989,880          705,875       278,885        5,120
 60      990,152          705,875       278,885        5,392           990,160          705,875       278,885        5,400
 61      989,832          705,875       278,885        5,072           990,112          705,879       278,905        5,328
 62      989,632          705,875       278,885        4,872
 63      989,640          705,875       278,885        4,880
 64      989,728          705,875       278,885        4,968
 65      989,704          705,875       278,885        4,944
 66      989,832          705,875       278,885        5,072
 67      989,888          705,875       278,885        5,128
 68      989,888          705,875       278,885        5,128
 69      989,856          705,875       278,885        5,096
 70      990,064          705,875       278,885        5,304
 71      990,176          705,879       278,905        5,392
 72      990,080          705,879       278,905        5,296
 73      990,088          705,879       278,905        5,304
 74      990,136          705,879       278,905        5,352
 75      990,152          705,875       278,885        5,392
 76      989,832          705,875       278,885        5,072
 77      989,632          705,875       278,885        4,872
 78      989,640          705,875       278,885        4,880
 79      989,728          705,875       278,885        4,968
 80      989,704          705,875       278,885        4,944

@dpetry
Copy link
Contributor

dpetry commented Aug 25, 2022

Excellent! I think this is fine then.
@tammojan , what do you think about the wcslib thread-safety issue which Takeshi found?
Could casacore require wcslib 5.18?
In any case, dealing with that would be a different ticket.
I suggest to approve this pull request and leave the thread_local in place. It doesn't hurt as far as I can see.

Copy link
Contributor

@dpetry dpetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@dpetry dpetry merged commit b763b54 into casacore:master Aug 28, 2022
@tnakazato
Copy link
Contributor Author

Thanks!

@tammojan
Copy link
Contributor

@tammojan , what do you think about the wcslib thread-safety issue which Takeshi found?
Could casacore require wcslib 5.18?
In any case, dealing with that would be a different ticket.

I created a new ticket for this, #1217. Thanks for the thorough review, and thanks for the analysis and improvements @tnakazato !

rtobar added a commit to rtobar/casacore that referenced this pull request Nov 1, 2022
These two functions use a static 1-element Vector to avoid allocations
on each invocation; however these static vectors are shared across
multiple threads, leading to race conditions on unlocked data access and
therefore unexpected results.

This commit adds the thread_local specifier on these static variables
such that in multi-threaded scenarios each thread ends up with its own
static copy of the variable, thus avoiding unlocked data shared across
threads.

This issue was originally reported in casacore#1200, and then more specifically
in casacore#1217.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
tammojan pushed a commit that referenced this pull request Dec 21, 2022
These two functions use a static 1-element Vector to avoid allocations
on each invocation; however these static vectors are shared across
multiple threads, leading to race conditions on unlocked data access and
therefore unexpected results.

This commit adds the thread_local specifier on these static variables
such that in multi-threaded scenarios each thread ends up with its own
static copy of the variable, thus avoiding unlocked data shared across
threads.

This issue was originally reported in #1200, and then more specifically
in #1217.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants