# Hardware
### DGX Station
V100 SMX2 16GB

Intel Xeon E5-2968 v4 @ 2.20GHz

# Software
CUDA 11.0

cuSignal 0.19

SciPy 1.6.3

Numba 0.53.1

# Profiling
[Nsight Systems](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)

## CPU Baseline
Baseline Scipy's Signal Lombscargle function

In [1]:
!nsys profile -s none -o scipy_v1 -f true --stats=true python3 icassp_scipy_v1.py float32 5

Collecting data...
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-a0a1-dd95-ddbd-66b2.qdstrm" file to disk...
Creating final output files...

Saved report file to "/tmp/nsys-report-a0a1-dd95-ddbd-66b2.qdrep"

Exported successfully to
/tmp/nsys-report-a0a1-dd95-ddbd-66b2.sqlite

Generating CUDA API Statistics...
CUDA API Statistics (nanoseconds)




CUDA trace data was not collected.


Generating Operating System Runtime API Statistics...
Operating System Runtime API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   98.9      1290356236      641357          2011.9            1000         5836680  sched_yield                                        

### Single Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,987.8 | 51,098.9    | 1.0      |
| Numba (Baseline)      |           |          |             |          |
| Numba (User Cache)    |           |          |             |          |
| Numba (Data Type)     |           |          |             |          |
| Numba (Fast Math)     |           |          |             |          |
| Numba (Max Registers) |           |          |             |          |

### Double Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,835.6 | 50,879.4    | 1.0      |
| Numba (Baseline)      |           |          |             |          |
| Numba (User Cache)    |           |          |             |          |
| Numba (Data Type)     |           |          |             |          |
| Numba (Fast Math)     |           |          |             |          |
| Numba (Max Registers) |           |          |             |          |

## GPU Scenario #1
1. Numba implementation of Lombscargle function

In [2]:
!nsys profile -s none -o numba_v1 -f true --stats=true python3 icassp_numba_v1.py float32 2000 0

Collecting data...
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-ff16-3ea3-093f-daa6.qdstrm" file to disk...
Creating final output files...

Saved report file to "/tmp/nsys-report-ff16-3ea3-093f-daa6.qdrep"

Exported successfully to
/tmp/nsys-report-ff16-3ea3-093f-daa6.sqlite

Generating CUDA API Statistics...
CUDA API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   96.6     26863363272        2001      13424969.2        12898256        13497451  cuCtxSynchronize                                                                
    1.9       527733336        4005        131768.6           10834        14797231  cuMemAlloc_v2                     

In [3]:
#!nsys stats --report gpukernsum --report nvtxppsum --report gputrace numba_v1.qdrep # Use to get register usage
!nsys stats --report gpukernsum --report nvtxppsum numba_v1.qdrep

Using numba_v1.sqlite file for stats and reports.
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpukernsum numba_v1.sqlite] to console... 

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                                                    Name                                                
 -------  ---------------  ---------  ------------  ----------  ----------  ----------------------------------------------------------------------------------------------------
   100.0   26,929,874,752      2,001  13,458,208.3  12,932,078  13,529,040  cudapy::__main__::_numba_lombscargle$241(Array<float, 1, C, mutable, aligned>, Array<float, 1, C, m…

Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/nvtxppsum numba_v1.sqlite] to console... 

 Time(%)  Total Time (ns)  Instances     Average       Minimum      Maximum            Range         
 -------  ---------------  ---------  -------------  -----------  -----------  ------

### Single Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,987.8 | 51,098.9    | 1.0      |
| Numba (Baseline)      | 58        | 428.9    | 16.0        | 3193.7   |
| Numba (User Cache)    |           |          |             |          |
| Numba (Data Type)     |           |          |             |          |
| Numba (Fast Math)     |           |          |             |          |
| Numba (Max Registers) |           |          |             |          |


### Double Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,835.6 | 50,879.4    | 1.0      |
| Numba (Baseline)      | 66        | 421.7    | 29.6        | 1718.9   |
| Numba (User Cache)    |           |          |             |          |
| Numba (Data Type)     |           |          |             |          |
| Numba (Fast Math)     |           |          |             |          |
| Numba (Max Registers) |           |          |             |          |

## GPU Scenario #2
1. Numba implementation of Lombscargle function
2. Added user cache

In [4]:
!nsys profile -s none -o numba_v2 -f true python3 icassp_numba_v2.py float32 2000 0

Collecting data...
Registers 58
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-da44-82eb-1e84-d47c.qdstrm" file to disk...
Creating final output files...

Saved report file to "/tmp/nsys-report-da44-82eb-1e84-d47c.qdrep"
Report file moved to "/home/odysseus/workStuff/cusignal-icassp-tutorial/notebooks/numba_cuda/numba_v2.qdrep"


In [5]:
!nsys stats --report gpukernsum --report nvtxppsum numba_v2.qdrep

Generate SQLite file numba_v2.sqlite from numba_v2.qdrep
Using numba_v2.sqlite file for stats and reports.
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpukernsum numba_v2.sqlite] to console... 

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                                                    Name                                                
 -------  ---------------  ---------  ------------  ----------  ----------  ----------------------------------------------------------------------------------------------------
   100.0   26,937,979,145      2,001  13,462,258.4  12,938,960  13,901,395  cudapy::__main__::_numba_lombscargle$241(Array<float, 1, C, mutable, aligned>, Array<float, 1, C, m…

Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/nvtxppsum numba_v2.sqlite] to console... 

 Time(%)  Total Time (ns)  Instances     Average       Minimum      Maximum            Range         
 -------  ---------------  -

### Single Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,987.8 | 51,098.9    | 1.0      |
| Numba (Baseline)      | 58        | 428.9    | 16.0        | 3193.7   |
| Numba (User Cache)    | 58        | 431.6    | 16.1        | 3173.8   |
| Numba (Data Type)     |           |          |             |          |
| Numba (Fast Math)     |           |          |             |          |
| Numba (Max Registers) |           |          |             |          |

### Double Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,835.6 | 50,879.4    | 1.0      |
| Numba (Baseline)      | 66        | 421.7    | 29.6        | 1718.9   |
| Numba (User Cache)    | 66        | 428.5    | 29.4        | 1730.6   |
| Numba (Data Type)     |           |          |             |          |
| Numba (Fast Math)     |           |          |             |          |
| Numba (Max Registers) |           |          |             |          |

## GPU Scenario #3
1. Numba implementation of Lombscargle function
2. Added user cache
3. Specific functions per data type

In [6]:
!nsys profile -s none -o numba_v3 -f true python3 icassp_numba_v3.py float32 2000 0

Collecting data...
Registers 40
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-4f00-3887-9db4-2448.qdstrm" file to disk...
Creating final output files...

Saved report file to "/tmp/nsys-report-4f00-3887-9db4-2448.qdrep"
Report file moved to "/home/odysseus/workStuff/cusignal-icassp-tutorial/notebooks/numba_cuda/numba_v3.qdrep"


In [7]:
!nsys stats --report gpukernsum --report nvtxppsum numba_v3.qdrep

Generate SQLite file numba_v3.sqlite from numba_v3.qdrep
Using numba_v3.sqlite file for stats and reports.
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpukernsum numba_v3.sqlite] to console... 

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                                                    Name                                                
 -------  ---------------  ---------  ------------  ----------  ----------  ----------------------------------------------------------------------------------------------------
   100.0   21,979,219,486      2,001  10,984,117.7  10,663,816  11,620,523  cudapy::__main__::_numba_lombscargle_32$241(Array<float, 1, C, mutable, aligned>, Array<float, 1, C…

Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/nvtxppsum numba_v3.sqlite] to console... 

 Time(%)  Total Time (ns)  Instances     Average       Minimum      Maximum            Range         
 -------  ---------------  -

### Single Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,987.8 | 51,098.9    | 1.0      |
| Numba (Baseline)      | 58        | 428.9    | 16.0        | 3193.7   |
| Numba (User Cache)    | 58        | 431.6    | 16.1        | 3173.8   |
| Numba (Data Type)     | 40        | 554.9    | 13.3        | 3842.0   |
| Numba (Fast Math)     |           |          |             |          |
| Numba (Max Registers) |           |          |             |          |

### Double Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,835.6 | 50,879.4    | 1.0      |
| Numba (Baseline)      | 66        | 421.7    | 29.6        | 1718.9   |
| Numba (User Cache)    | 66        | 428.5    | 29.4        | 1730.6   |
| Numba (Data Type)     | 58        | 428.9    | 29.2        | 1742.4   |
| Numba (Fast Math)     |           |          |             |          |
| Numba (Max Registers) |           |          |             |          |

## GPU Scenario #4
1. Numba implementation of Lombscargle function
2. Added user cache
3. Specific functions per data type
4. Utilize fast_math compiler flag

In [8]:
!nsys profile -s none -o numba_v4 -f true python3 icassp_numba_v4.py float32 2000 0

Collecting data...
Registers 35
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-9b55-f5fc-3916-fb5b.qdstrm" file to disk...
Creating final output files...

Saved report file to "/tmp/nsys-report-9b55-f5fc-3916-fb5b.qdrep"
Report file moved to "/home/odysseus/workStuff/cusignal-icassp-tutorial/notebooks/numba_cuda/numba_v4.qdrep"


In [9]:
!nsys stats --report gpukernsum --report nvtxppsum numba_v4.qdrep

Generate SQLite file numba_v4.sqlite from numba_v4.qdrep
Using numba_v4.sqlite file for stats and reports.
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpukernsum numba_v4.sqlite] to console... 

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                                                    Name                                                
 -------  ---------------  ---------  ------------  ----------  ----------  ----------------------------------------------------------------------------------------------------
   100.0   21,834,204,824      2,001  10,911,646.6  10,656,904  11,666,092  cudapy::__main__::_numba_lombscargle_32$241(Array<float, 1, C, mutable, aligned>, Array<float, 1, C…

Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/nvtxppsum numba_v4.sqlite] to console... 

 Time(%)  Total Time (ns)  Instances     Average       Minimum      Maximum            Range         
 -------  ---------------  -

### Single Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,987.8 | 51,098.9    | 1.0      |
| Numba (Baseline)      | 58        | 428.9    | 16.0        | 3193.7   |
| Numba (User Cache)    | 58        | 431.6    | 16.1        | 3173.8   |
| Numba (Data Type)     | 40        | 554.9    | 13.3        | 3842.0   |
| Numba (Fast Math)     | 35        | 442.9    | 13.1        | 3900.7   |
| Numba (Max Registers) |           |          |             |          |

### Double Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,835.6 | 50,879.4    | 1.0      |
| Numba (Baseline)      | 66        | 421.7    | 29.6        | 1718.9   |
| Numba (User Cache)    | 66        | 428.5    | 29.4        | 1730.6   |
| Numba (Data Type)     | 58        | 428.9    | 29.2        | 1742.4   |
| Numba (Fast Math)     | 58        | 507.3    | 29.1        | 1748.4   |
| Numba (Max Registers) |           |          |             |          |

# GPU Scenario #5
1. Numba implementation of Lombscargle function
2. Added user cache
3. Specific functions per data type
4. Utilize fast_math compiler flag
5. Utilize max registers compiler switch


In [10]:
!nsys profile -s none -o numba_v5 -f true python3 icassp_numba_v5.py float32 2000 0

Collecting data...
Registers 32
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-525c-d8da-e12b-ef07.qdstrm" file to disk...
Creating final output files...

Saved report file to "/tmp/nsys-report-525c-d8da-e12b-ef07.qdrep"
Report file moved to "/home/odysseus/workStuff/cusignal-icassp-tutorial/notebooks/numba_cuda/numba_v5.qdrep"


In [11]:
!nsys stats --report gpukernsum --report nvtxppsum numba_v5.qdrep

Generate SQLite file numba_v5.sqlite from numba_v5.qdrep
Using numba_v5.sqlite file for stats and reports.
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpukernsum numba_v5.sqlite] to console... 

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                                                    Name                                                
 -------  ---------------  ---------  ------------  ----------  ----------  ----------------------------------------------------------------------------------------------------
   100.0   21,775,858,592      2,001  10,882,488.1  10,468,520  11,332,043  cudapy::__main__::_numba_lombscargle_32$241(Array<float, 1, C, mutable, aligned>, Array<float, 1, C…

Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/nvtxppsum numba_v5.sqlite] to console... 

 Time(%)  Total Time (ns)  Instances     Average       Minimum      Maximum            Range         
 -------  ---------------  -

### Single Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,987.8 | 51,098.9    | 1.0      |
| Numba (Baseline)      | 58        | 428.9    | 16.0        | 3193.7   |
| Numba (User Cache)    | 58        | 431.6    | 16.1        | 3173.8   |
| Numba (Data Type)     | 40        | 554.9    | 13.3        | 3842.0   |
| Numba (Fast Math)     | 35        | 442.9    | 13.1        | 3900.7   |
| Numba (Max Registers) | 32        | 439.4    | 13.1        | 3900.7   |

### Double Precision
|                       | Registers | JIT (ms) | Kernel (ms) | Speed Up |
|:----------------------|:----------|:---------|:------------|:---------|
| Scipy                 | -         | 50,835.6 | 50,879.4    | 1.0      |
| Numba (Baseline)      | 66        | 421.7    | 29.6        | 1718.9   |
| Numba (User Cache)    | 66        | 428.5    | 29.4        | 1730.6   |
| Numba (Data Type)     | 58        | 428.9    | 29.2        | 1742.4   |
| Numba (Fast Math)     | 58        | 507.3    | 29.1        | 1748.4   |
| Numba (Max Registers) | 62        | 449.2    | 29.5        | 1724.7   |