# T Function

\begin{equation}
T(p,t)_{l_0l_1l_2l_3}=\sum_{x_1}e^{-i p x}\epsilon_{c_0c_1c_2c_3}V(t)_{c_0x,l_0}V(t)_{c_1x,l_1}V(t)_{c_2x,l_2}V(t)_{c_3x,l_3}
\end{equation}

\begin{align}
T_{ijkl}&=\epsilon_{abcd}V_{ai}V_{bj}V_{ck}V_{dl}\\
        &=\epsilon_{0123}V_{0i}V_{1j}V_{2k}V_{3l}
         +\epsilon_{0132}V_{0i}V_{1j}V_{3k}V_{2l}+\cdots\\
        &=V_{0i}V_{1j}V_{2k}V_{3l}-V_{0i}V_{1j}V_{3k}V_{2l}+\cdots\\
T_{ijkk}&=V_{0i}V_{1j}(V_{2k}V_{3k}-V_{3k}V_{2k})\\
        &=0
\end{align}


If $\epsilon=0$, don't compute any elements of T.  $N_c^4$ terms $\rightarrow$ $N_c!$ terms

If any indices of T are the same, don't compute the element of T.  $N_v^4$ terms $\rightarrow$ some other combinatorial amount of terms.  Maybe $N_v$ choose 4.


In [1]:
from math import comb

In [5]:
comb(20,4)

4845

### Timing

Parallelized on CPU's, parallelzing loop over l3, not the loop over spatial sites
All times, zero-momentum

#### L=4, T=8, NVEC=8


| Threads   | Time(ms)  | Speedup |
|-----------|-----------|---------|
| 1         | 19242     |  -
| 2         | 10734     | 1.79
| 4         | 5450      | 3.53
| 8         | 3237      | 5.94


#### L=16, T=32

| NVEC  | Threads   | Time(hrs) | Expected Scaling   | Actual Scaling |
|-------|-----------|-----------|-----------         |------|
| 4     |  4        | 0.056     | -   | - |
| 8     |  8        | 0.96      | 2^3 | 17 |  

* Did receive warning for 8 thread calculation using too many PUs
* QDP says 8 threads, omp_get_num_threads says 1
* jsrun command could be suboptimal

## GPU Code

Parallelized spatial sum.  Code will now scale as $N_v^4$

| NX | Nv | Time(minutes) |
|----|----|-----------|
| 16 | 4  | 1.0 |
| 32 | 4  | 5.4 |

Still way to long to do large $N_v$!  



GPU code has following steps
1. For loop over color, only compute for non-zero epsilon
2. For loop over evec
    1. Copy evec data onto gpu - as spatial vector
    2. GPU kernel to multiply evecs and reduce to scalar

| Kernel Part | Time(microseconds) |
|-------------|--------------------|
| Setup/Data transfer | 120 |
| Multiply & Reduce | 30 |


### Speedup (To do)

1. Copy all data at start - index with stride - reduce setup time
2. Better reduction over resultant vector can be done


## L=32 Baryon Correlators

### Timing Info

* Times measured in milliseconds
* Nvec=8 hit time limit on debug node

| Nvec | Compute T | Compute B | Compute Bprop | Evaluate Diagrams | 
|------|-----------|-----------|---------------|-------------------|
| 4 | 1.8*10^5 | 228 | 7.3*10^4 | 3.6*10^5 | 
| 6 | 1.5*10^6 | 1104 | 4.3*10^5 | 6.9*10^5 |
| 8 | 6.8*10^6 | 3696 | ... | ... |



### Numerical Values

| Nvec | t | C(t) |
|------|---|------|
| 4 | 0 | 6.88748194e-10-4.43434764e-14j | 
| 6 | 0 | 2.28479330e-09-1.70586292e-12j | 
|-|-|-|
| 4 | 1 | -1.35823441e-22-4.98618281e-22j |
| 6 | 1 | -1.49544234e-21-1.45096305e-22j |
|-|-|-|
| 4 | 2 | 5.01071221e-23-3.52351481e-23j |
| 6 | 2 |  -1.38497310e-23+5.44879980e-23j |

In [8]:
import math

In [14]:
for nvec in [4,6,8]:
    print(math.factorial(nvec)/(math.factorial(nvec-4)))

24.0
360.0
1680.0


In [11]:
4**4

256