# trackingML DL with GlueX Fall 2018 data part 3

Training an AI model is very analogous to curve fitting. In fact, one can argue that they really are the same thing. The model has a lot of parameters that are varied in order to minimize a loss function. The main difference is that in traditional curve fitting, you usually have some physical meaning behind the functional form of what you are fitting. The parameters themselves are therefore linked to physical raits. In AI/ML, the weights+biases are not indivudally associated with any physical traits which is what makes the model a "black box". 

In traditional curve fitting, we often do a $\chi^2$ minimization:

$$
\chi^2 = \sum_i{\frac{\left[y_i - f(x_i)\right]^2}{\sigma_i^2}}
$$

Here, the $x_i$, $y_i$ are the data points and the $f()$ is a function whose parameters are varied in order to best minimize the $\chi^2$ and therefore best match the data. The values $\sigma_i$ represent the uncertainty of the measurements. In reality, this should represent the combined uncertainty of the measurements *and* the function $f()$ at that point. We don't include the uncertainty of $f()$ for a couple of reasons, but basically we assume that by the end of the fitting, it will be much smaller than the uncertainties of individual measurements given that it now contains the wisdom of all measurements.

For the tracking problem we are looking for 5 state vector paramters ($\frac{q}{p_t}$, $\phi$, $D$, $tanl$, and $z$). The model is supposed to predict these and in order to measure how well the model is doing, we need to find how close the prediction is to the true values. The values themselves are all in different units and also have different uncertainties based on the area of phase space they are in, how many actual measurements are included, and the uncertainties of those individual measurments. Ultimately, the loss function needs to be a single number that represents how many $\sigma$s away one state vector is from the truth in the 5 dimensional space. This can be written as:

$$
\chi_{i}^2 = \vec{\delta s_i}^\intercal \cdot C^{-1} \cdot \vec{\delta s_i}
$$

where:

$\;\;\;\;\;\;$ $\vec{\delta s_i}$ = $(\vec{s_i^{model}} - \vec{s_i^{label}})$  is the difference between model and actual state vectors for the track
 
 and

$\;\;\;\;\;\;$ $C^{-1}$ is the inverse of the covariance matrix for the track

<br>
<br>

So this is may be a little subtle, but in the analogy with the traditional curve fitting described above:

$\;\;\;\;\;\;$ $\vec{s_i^{label}}$ represents the $y_i$ measurements. <br>
$\;\;\;\;\;\;$ $C^{-1}$ is the $\frac{1}{\sigma_i^2}$ uncertainties (a 5x5 tensor) on $\vec{s_i^{label}}$. <br>
$\;\;\;\;\;\;$ $\vec{s_i^{model}}$ represents the function $f(x_i)$. <br>

<br>

OK, so from the tracking code, we do have a covariance matrix which represents the uncertainties and correlations between the state vector parameters. Note that the matrix is symmetric so there are only 15 unique values to worry about.

$$
C =
\begin{bmatrix} 
\sigma_{q/p_t}^2 & \sigma_{q/p_t}\sigma_{\phi} &  \sigma_{q/p_t}\sigma_{D} &  \sigma_{q/p_t}\sigma_{tanl} &  \sigma_{q/p_t}\sigma_{z} \\
\ddots           & \sigma_{\phi}^2             &  \sigma_{\phi}\sigma_{D}  &  \sigma_{\phi}\sigma_{tanl}  &  \sigma_{\phi}\sigma_{z}  \\
\ddots           & \ddots                      &  \sigma_{D}^2             &  \sigma_{D}\sigma_{tanl}     &  \sigma_{D}\sigma_{z}     \\
\ddots           & \ddots                      &  \ddots                   &  \sigma_{tanl}^2             &  \sigma_{tanl}\sigma_{z}  \\
\ddots           & \ddots                      &  \ddots                   &  \ddots                      &  \sigma_{z}^2             \\
\end{bmatrix}
\quad
$$

What we really need is the inverse of this matrix for every track. It turns out that the inverse of a symmetric matrix is also a symmetric matrix there will also be only 15 unique values. We'll write this as the following:

$$
C^{-1} = W =
\begin{bmatrix} 
w_{(q/p_t)^2} & w_{q/p_t,\phi} &  w_{q/p_t,D} &  w_{q/p_t,tanl} &  w_{q/p_t,z}      \\
\ddots           & w_{\phi^2}       &  w_{\phi,D}  &  w_{\phi,tanl}  &  w_{\phi,z}  \\
\ddots           & \ddots           &  w_{D^2}     &  w_{D,tanl}     &  w_{D,z}     \\
\ddots           & \ddots           &  \ddots      &  w_{tanl^2}     &  w_{tanl,z}  \\
\ddots           & \ddots           &  \ddots      &  \ddots         &  w_{z^2}     \\
\end{bmatrix}
\quad
$$

We should be able to form a 2-D tensor and 1-D vector in Keras and use the backend to do two successive dot products to calculate the loss. The big question is what is the sequence of the labels that makes it easiest to copy them into a 5x5 tensor? Should we just write out the 20 values of the W matrix rather than do backflips to build it from the 15 values? Should we also keep the covariance matrix in the same labels file, or just write it to a separate file?

### Tensor multiplication in the back end

The custom loss function will need to do multiplcation between vectors and tensors. Moreover, it will need to take a 1-D array of values and reshape it into a 2-D tensor. In this next cell, I do a simple exercise to implement some of this using keras backend functions. This can be quite tricky since the "customLoss" procedure does not actually get run every time it needs to calculate the loss. Instead, it is called once in order to define the set of operations needed to calculate the loss for a batch of data. Keras/tensorflow then apply these operations directly during training.

The first thing to know is that keras will pass in exactly 2 parameters which are both tensors whose first dimension is the batch size. The first tensor "y_true" represents the labels and the second, "y_pred" the model prediction. In this exercise, what I actually I want to calculate is the square of a 2-D vector using a 2x2 matrix like this:

$$
L = 
\begin{bmatrix}
a & b
\end{bmatrix}
\begin{bmatrix}
c & d \\
e & f \\
\end{bmatrix}
\begin{bmatrix}
a \\
b\\
\end{bmatrix}
= a(ac + be) + b(ad + bf) 
$$

The shape of the inputs though is not exactly correct for the multiplication rouine se we need to do some reshaping and one transpose. Note that the reshape operations do not alter the data, only the dimensions and axes used to index it.

Below the function I test it using an example matrix and vector. I make a batch of 3 inputs by simply duplicating these when evaluating the loss. This allows me to verify all of the steps are correct, even when working with a batch size that is not just 1. Specifically, the loss should end up doing the following calculation for each of the 3 entries in the batch:

$$
L = 
\begin{bmatrix}
5 & 6
\end{bmatrix}
\begin{bmatrix}
1 & 2 \\
3 & 4 \\
\end{bmatrix}
\begin{bmatrix}
5 \\
6 \\
\end{bmatrix}
=
\begin{bmatrix}
23 & 34
\end{bmatrix}
\begin{bmatrix}
5 \\
6 \\
\end{bmatrix}
= 319 
$$

There are detailed comments indicating the shape and contents of each tensor to make it easier to follow. I should note that the two things that took the most time to figure out were:
1. You need to use tf.transpose instead of K.transpose to avoid including the batch index in the transpose operation
2. You need to use K.batch_dot instead of K.dot for the same reason.

In [2]:
import tensorflow as tf
import tensorflow.keras.backend as K

#--------------------------------------------
# Define custom loss function 
def customLoss(y_true, y_pred):

    print('y_true shape: ' + str(y_true.shape) )  # y_true shape is (batch, 4)
    print('y_pred shape: ' + str(y_pred.shape) )  # y_pred shape is (batch, 2)
    
    batch_size = y_pred.shape[0]
    y_pred = K.reshape(y_pred, (batch_size, 2,1)) # y_pred shape is now (batch, 2,1) [[[5.] [6.]] [[5.] [6.]] [[5.] [6.]]]
    y_true = K.reshape(y_true, (batch_size, 2,2)) # y_true shape is now (batch, 2,2) [[[1. 2.] [3. 4.]] [[1. 2.] [3. 4.]] [[1. 2.] [3. 4.]]]
    
    # n.b. we must use tf.transpose here an not K.transpose since the latter does not allow perm argument
    y_true = tf.transpose(y_true, perm=[0,2,1])   # y_true shape is now (batch,2,2)  [[[1. 3.] [2. 4.]] [[1. 3.] [2. 4.]] [[1. 3.] [2. 4.]]]
    
    # n.b. use "batch_dot" and not "dot"!
    y_dot = K.batch_dot(y_true, y_pred)           # y_dot shape is (batch,2,1) [[[23.] [34.]] [[23.] [34.]] [[23.] [34.]]]
    y_dot = K.reshape(y_dot, (batch_size, 1, 2))  # y_dot shape is now (batch,1,2)  [[[23. 34.]] [[23. 34.]] [[23. 34.]]]
    y_loss = K.batch_dot(y_dot, y_pred)           # y_loss shape is (batch,1,1) [[[319.]] [[319.]] [[319.]]]
    y_loss = K.reshape(y_loss, (batch_size,))     # y_loss shape is now (batch) [319. 319. 319.]
    return y_loss

#--------------------------------------------
# Test loss function
xx = [1.0, 2.0, 3.0, 4.0]
yy = [5.0, 6.0]

loss = K.eval(customLoss(K.variable([xx,xx,xx]), K.variable([yy,yy,yy])))
print('loss shape: '    + str(loss.shape)    )
print(loss)


y_true shape: (3, 4)
y_pred shape: (3, 2)
loss shape: (3,)
[319. 319. 319.]


OK, At this point I would like to do the same thing as above, but with a sparse matrix definition to represent a symmetric matrix. I also want to include a "truth" vector in the labels. Thus, the labels will be 5 elements like this:

$$
\begin{pmatrix}
a_{label} & b_{label} & c & d & e
\end{pmatrix}
$$

and the prediction still just 2 elements like this:

$$
\begin{pmatrix}
a_{pred} & b_{pred}
\end{pmatrix}
$$

The difference vector is defined as:

$$
\vec{\delta a} =
\begin{bmatrix}
(a_{pred}-a_{label}) \\
(b_{pred}-b_{label})
\end{bmatrix}
$$

and the multiplication tensor is:

$$
W = 
\begin{bmatrix}
c & d \\
d & e \\
\end{bmatrix}
$$


So the loss we want to calculate is:

$$
L = \vec{a}^\intercal W \vec{a}
$$

The problem here is to manipulate the sparse matrix (15 elements) into a full matrix form (25 elements) so that it can be multiplied. This would be pretty easy for a single matrix, but here we have a batch AND it must be done using backend functions. I'm sure there is probably some clever way to do this, but it is not clear to me what it is. A simple way to handle this though is to just write out the full 25 value matrix to the labels file. I need to write out the 15 parameters of the sparse inverse covariance matrix anyway so adding the extra 10 parameters will increase it from 38 values to 48 values per track in the labels file. That file is already very small compared to the features file by 2 orders of magnitude. Thus, I won't waste time on this now and will work on writing out the 20 element inverse covariance matrix to the labels file and moving on.