# Advanced options

In the previous lesson we learned about the basic and intermediate options of Tensor Fox. For most of applications this is enough, but sometimes one needs to change more parameters, add constraints, and so on. In this lesson we also cover the options regarding higher order tensors. Warning: this lesson has a more mathematical flavour.

Options already covered:

    display
    maxiter  
    tol     
    tol_step
    tol_improv
    tol_grad
    tol_mlsvd
    trunc_dims
    initialization
    refine    
    init_damp
    symm    
    tol_jump
    
Options to be covered:

    method
    inner_method 
    cg_maxiter 
    cg_factor
    cg_tol 
    constraints 
    trials 
    bi_method
    bi_method_maxiter 
    bi_method_tol 
    epochs 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import TensorFox as tfx
from IPython.display import Image

In [2]:
# Create the tensor.
m = 2
T = np.zeros((m, m, m))
s = 0

for k in range(m):
    for i in range(m):
        for j in range(m):
            T[i,j,k] = s
            s += 1

# Inner algorithm options

The method we are using to solve the problem of tensor approximation is called *damped Gauss-Newton* (dGN), and at each step of this method the program needs to solve a equation of the form

$$(J^T J + \mu D) x = J^Tb$$
as already mentioned. To solve this equation we have to rely on other method, which can be an iterative method like the [conjugate gradient](https://en.wikipedia.org/wiki/Conjugate_gradient_method) (default) or a direct method using matrix factorization. The conjugate gradient methods has its own parameters, and the user may have to tune them sometimes. With this in mind, Tensor Fox offers the parameters $\verb|inner| \_ \verb|method|, \ \verb|cg| \_ \verb|maxiter|, \ \verb|cg| \_ \verb|tol|$ and $\verb|cg| \_ \verb|factor|$. They are explained below.

The inner methods are: $\verb|cg|, \ \verb|cg| \_ \verb|static|$, $\verb|direct|$ and $\verb|als|$ (alternating least squares, but this one doesn't take in account the regularization). We also mention that it is possible to pass the parameter  $\verb|inner| \_ \verb|method|$ as a list of strings containing the names of the method available. Then the program uses the prescribed sequence of methods, one at each iterarion. We noticed that this hybrid way of work can bring good results sometimes. 

The difference between static and non-static versions are the way the program deals with the maximum number of iterations. The static algorithm have a certain maximum number of iterations $\verb|cg| \_ \verb|maxiter|$ which is fixed during all the program. The non-static versions uses the parameter $\verb|cg| \_ \verb|factor|$ to control the number of iterations in a different way. If the program is at the $k$-th iteration of the dGN, then the maximum number of iterations permitted for the cg method is
 
$$1 + int\left( \verb|cg|\_\verb|factor| \cdot \verb|randint|\left( 1 + k^{0.4}, 2 + k^{0.9} \right) \right).$$

This strange interval of random integers were obtained after a lot of tests, a lot! This seems to be a robust choice, but since we can't be right all the time, the parameter $\verb|cg| \_ \verb|factor|$ comes to the rescue. If the number of maximum iterations are increasing too much, just set this parameter to a low value such as $0.1$ or $0.5$. Finally, the parameter $\verb|cg| \_ \verb|tol|$, as the name suggests, is the tolerance parameter for the cg method. The cg iterations stops when the (absolute) residual is less than $\verb|cg| \_ \verb|tol|$. Below there is an example showing how to setup a method and its parameters.

In [3]:
# Let's use cg_static as the inner algorithm, with 3 iteratins max and tolerance of 1e-7.
class options:
    inner_method = 'cg_static'
    cg_maxiter = 3
    cg_tol = 1e-7
    display = 2

R = 3
factors, output = tfx.cpd(T, R, options)

-----------------------------------------------------------------------------------------------
Computing MLSVD
    No compression detected
    Working with dimensions (2, 2, 2)
-----------------------------------------------------------------------------------------------
Type of initialization: random
-----------------------------------------------------------------------------------------------
Computing CPD
    Iteration | Rel error |  Step size  | Improvement | norm(grad) | Predicted error | # Inner iterations
        1     | 8.85e-01  |  5.17e-01   |  8.85e-01   |  2.55e+01  |    2.31e-03     |        3        
        2     | 2.73e-01  |  7.18e-01   |  6.12e-01   |  9.85e+00  |    7.60e-03     |        3        
        3     | 1.13e-01  |  9.79e-02   |  1.59e-01   |  7.66e+00  |    3.56e-04     |        3        
        4     | 3.71e-02  |  5.93e-02   |  7.63e-02   |  2.46e+00  |    1.72e-04     |        3        
        5     | 1.76e-02  |  2.26e-02   |  1.95e-02   |  7.63e-

       97     | 3.42e-13  |  3.43e-13   |  9.26e-14   |  4.57e-12  |    6.52e-26     |        3        
       98     | 2.87e-13  |  1.25e-13   |  5.46e-14   |  4.56e-12  |    2.07e-26     |        3        
       99     | 2.35e-13  |  2.29e-13   |  5.23e-14   |  1.93e-12  |    6.24e-26     |        3        
       100    | 1.79e-13  |  1.64e-13   |  5.63e-14   |  3.74e-12  |    1.65e-26     |        3        
       101    | 1.51e-13  |  5.76e-14   |  2.74e-14   |  3.18e-12  |    7.88e-27     |        3        
       102    | 1.04e-13  |  1.84e-13   |  4.75e-14   |  1.36e-12  |    1.84e-26     |        3        
       103    | 8.32e-14  |  4.39e-14   |  2.05e-14   |  1.72e-12  |    1.76e-27     |        3        
       104    | 7.35e-14  |  2.72e-14   |  9.68e-15   |  8.54e-13  |    1.42e-27     |        3        
       105    | 5.39e-14  |  8.55e-14   |  1.96e-14   |  4.97e-13  |    4.91e-27     |        3        
       106    | 3.50e-14  |  4.91e-14   |  1.89e-14   |  1.12e-1

# Constraints

The parameter $\verb|factors| \_ \verb|norm|$ is used to fix the norm of the factor matrices of the CPD. Suppose $T$ is a third tensor and $(X^{(k)}, Y^{(k)}, Z^{(k)})$ the approximated CPD at iteration $k$. If one set $\verb|factors| \_ \verb|norm| = 2$, for example, then $\| X^{(k)} \| = \| Y^{(k)} \| = \| Z^{(k)} \| = 2$ for all $k$.

# Higher order tensors and the Tensor Train format

Tensor Fox has distinct approaches when it comes to computing the CPD of third order tensors and higher order tensors. By default the program relies on the *Damped Gauss-Newton* (dGN) method. However you can set the program to use the *Tensor Train format* (TT format), also called *Tensor Train decomposition*. Without going in too much details, we use a specific configuration of the TT format which can be obtained by computing several third order CPD's. More precisely, if $T$ is a tensor of order $L$, then we can compute a CPD for it by computing $L-2$ third order CPD's. Once we have the TT format of $T$, the CPD can also be computed. The figure below illustrate the representation of a tensor train associated to a tensor of order $L$.

![tensortrain](tensor-train.png)

Each square represent the coordinates of a tensor and each circle is the coordinate with is shared between two consecutive tensors. These tensors are usually denoted by $\mathcal{G}^{(\ell)}$. For example, the second tensor is $\mathcal{G}^{(2)}$, which has coordinates $\mathcal{G}^{(\ell)}_{j_1 i_2 j_2}$, and the next tensor is $\mathcal{G}^{(3)}$, which has coordinates $\mathcal{G}^{(\ell)}_{j_2 i_3 j_3}$. The first and last tensor are acutally matrices (which mean $j_0 = j_L = 1$), and the other $L-2$ tensor are third order tensors. They are related to $T$ by the following formula:
$$T_{i_1 i_2 \ldots i_L} = \sum_{j_0, j_1, \ldots, j_L} \mathcal{G}^{(1)}_{j_0 i_1 j_1} \cdot \mathcal{G}^{(2)}_{j_1 i_2 j_2} \cdot \ldots \cdot \mathcal{G}^{(L)}_{j_{L-1} i_L j_L}.$$

By computing a CPD for $\mathcal{G}^{(2)}, \ldots, \mathcal{G}^{(L-1)}$ we can obtain a CPD for $T$.

## Trials

In the case $T$ has order higher than $3$, the parameter $\verb|trials|$ defines how much times we compute each one of these third order CPD's. The idea is to compute several times and keep the best result (smaller error). This may be helpful because all $L-2$ CPD's needs to be of good quality in order to get a good CPD for $T$. If just one of the third order CPD's has bad precision, than everything falls apart. Currently the default is $\verb|trials|= 3$, but this may change depending on the problem. This parameter doesn't makes difference if $T$ is a third order tensor. 

## Display

As we've said, the options $\verb|trials|$ says about the repetition of third order CPD computations. If $\verb|display|$ is set to $1, 2, 3$ or $4$, then all the information of each one of these CPD's are printed on the screen. This means we wil have $(L-2) \cdot \verb|trials|$ CPD's informations printed on the screen when $T$ has order $L$ . Sometimes this amount of information is just too much. We can make everything more succint in these situations just by setting $\verb|display| =-1$. Consider the following fourth order tensor.

In [4]:
# Initialize dimensions of the tensor.
k = 2
dims = (k+1, k+1, k+1, k+1)
L = len(dims)

# Create four random factors matrices so that
# A = (orig_factors[0], orig_factors[1], orig_factors[2], orig_factors[3])*I.
orig_factors = []
for l in range(L):
    M = np.random.randn(dims[l], k)
    Q, R = np.linalg.qr(M)
    orig_factors.append(Q)
    
# From the factor matrices generate the respective tensor in coordinates.
A = tfx.cpd2tens(orig_factors)

print('A = ')
tfx.showtens(A) # now this is the same as print(A)

A = 
[[[[-0.03711491  0.07680179  0.10025952]
   [-0.00721874  0.01731023  0.0211103 ]
   [ 0.07075037  0.13224048 -0.00201738]]

  [[ 0.03438766  0.12081365  0.03738994]
   [ 0.00750004  0.0268292   0.00848021]
   [ 0.02978475  0.16094999  0.07059853]]

  [[-0.13563723 -0.06139147  0.13425705]
   [-0.02782745 -0.01312277  0.0271862 ]
   [ 0.08868373 -0.02183128 -0.12983804]]]


 [[[ 0.18185858 -0.00451664 -0.23893454]
   [ 0.03694311 -0.00179423 -0.04913262]
   [-0.16202511 -0.09894374  0.142997  ]]

  [[-0.04128539 -0.06991435  0.00609931]
   [-0.00868676 -0.01543356  0.00079264]
   [ 0.00155304 -0.08229022 -0.05786075]]

  [[ 0.43793686  0.11552733 -0.48959812]
   [ 0.08949784  0.02390535 -0.09985465]
   [-0.32740134 -0.05161518  0.38960847]]]


 [[[-0.11415024 -0.30844164 -0.06127267]
   [-0.02450491 -0.06838206 -0.01462497]
   [-0.05288387 -0.39753762 -0.20119995]]

  [[-0.08058704 -0.39318723 -0.16231672]
   [-0.01804163 -0.0874508  -0.03594885]
   [-0.12445869 -0.5397061  -0.204

In [5]:
# Compute the CPD of A with succint display for higher order tensors.
class options:
    display = -1
    method = 'ttcpd'
    
factors, output = tfx.cpd(A, k, options)

-----------------------------------------------------------------------------------------------
Computing MLSVD
    Compression detected
    Compressing from (3, 3, 3, 3) to (2, 2, 2, 2)

Total of 2 third order CPDs to be computed:
CPD 1 error = 5.768888059150692e-16
CPD 2 error = 7.684335501194144e-16

Final results
    Number of steps = 151
    Relative error = 1.1144179076987622e-15
    Accuracy =  100.0 %


In [6]:
# The options display = -2 is showed below. 
options.display = -2
factors, output = tfx.cpd(A, k, options)

-----------------------------------------------------------------------------------------------
Computing MLSVD
    Compression detected
    Compressing from (3, 3, 3, 3) to (2, 2, 2, 2)
    Compression relative error = 9.507865e-16

SVD Tensor train error =  1.3699703393889097e-15

Total of 2 third order CPDs to be computed:
CPD 1 error = 5.959374814064024e-16
CPD 2 error = 5.255267395909838e-16

CPD Tensor train error =  1.3970430154872666

Final results
    Number of steps = 203
    Relative error = 1.3256020285923175e-15
    Accuracy =  100.0 %


## MLSVD tolerance with high order tensors

Now let's see what happens when we set $\verb|tol| \_ \verb|mlsvd| = [\verb|1e-6|, \verb|-1|]$ for the tensor $A$. This choice means the program will perform the high order compression using $10^{-6}$ as tolerance, and will not compress the intermediate third order tensors. 

In [7]:
options.tol_mlsvd = [1e-6, -1]
factors, output = tfx.cpd(A, k, options)

-----------------------------------------------------------------------------------------------
Computing MLSVD
    Compression detected
    Compressing from (3, 3, 3, 3) to (2, 2, 2, 2)
    Compression relative error = 9.507865e-16

SVD Tensor train error =  1.3699703393889097e-15

Total of 2 third order CPDs to be computed:
CPD 1 error = 1.766354016019247e-16
CPD 2 error = 4.069766273599779e-16

CPD Tensor train error =  0.7223142407664687

Final results
    Number of steps = 130
    Relative error = 1.3400884814913531e-15
    Accuracy =  100.0 %


## Inner algorithm options

Just as usual third order tensors has the options $\verb|method|, \verb|tol|, \verb|maxiter|$ for its inner computations, the third order tensors of a tensor train also can receive these parameters. However there is a difference here: when computing the CPD's of each $\mathcal{G}^{(\ell)}$, the program starts computing the CPD of $\mathcal{G}^{(2)}$, and one factor is of the CPD is used to compute the CPD of $\mathcal{G}^{(3)}$. Then one factor of this CPD is used to compute the CPD of $\mathcal{G}^{(4)}$ and so on. In short, each CPD depends on the previous computed CPD. The matrices $\mathcal{G}^{(1)}$ and $\mathcal{G}^{(L)}$ are easily computed after we have the other CPD's.

The first CPD can be computed as any CPD, but the other always depends on some previous computed factor, which is always used to fix one factor of the next CPD. This means each CPD, except the first, is actually only computing two factors, so there is a difference in how the program computes the first CPD and the remaining ones. Therefore, the parameters $\verb|method|, \ \verb|method| \_ \verb|tol|, \ \verb|method| \_ \verb|maxiter|$ are used for the first CPD and the parameters $\verb|bi| \_ \verb|method|, \ \verb|bi| \_ \verb|method| \_ \verb|tol|, \ \verb|bi| \_ \verb|method| \_ \verb|maxiter|$ are used for all the remaining CPD's. The figure below illustrate these observations.

![tensortrainmethods](tensor-train-methods.png)

## Epochs

As we can note, there is a flow of information in the tensor train format, the CPD's are computed from left to right, and the next CPD always depend on some information about the previous CPD. Once we compute the CPD of $\mathcal{G}^{(L-1)}$ it is possible to "go back", that is, use the information of the CPD of $\mathcal{G}^{(L-1)}$ to compute a new CPD for $\mathcal{G}^{(L-2)}$, we just have to reverse the way information is propagated. Doing this we may be able to refine all CPD's. These cycles can repeated several times, with the information being propagated forward and backward again and again. Each cycle is called an *epoch*, and the number of epochs can be passed to the program through the parameter $\verb|epochs|$. Below we redefine the tensor $A$ to have a higher order and try to refine the CPD by using more epochs than just $1$.

In [8]:
# Initialize the sixth-order tensor and compute its CPD with default options.
k = 2
dims = (k+1, k+1, k+1, k+1, k+1, k+1)
L = len(dims)

orig_factors = []
for l in range(L):
    M = np.random.randn(dims[l], k)
    Q, R = np.linalg.qr(M)
    orig_factors.append(Q)
    
A = tfx.cpd2tens(orig_factors)

class options:
    display = -1
    method = 'ttcpd'
    
factors, output = tfx.cpd(A, k, options)

-----------------------------------------------------------------------------------------------
Computing MLSVD
    Compression detected
    Compressing from (3, 3, 3, 3, 3, 3) to (2, 2, 2, 2, 2, 2)

Total of 4 third order CPDs to be computed:
CPD 1 error = 3.630479235543843e-16
CPD 2 error = 8.402324947297753e-16
CPD 3 error = 8.626582930171298e-16
CPD 4 error = 7.7864160336646665e-16

Final results
    Number of steps = 157
    Relative error = 2.238693023212968e-15
    Accuracy =  100.0 %


In [9]:
# Now we use 5 epochs on the same tensor.
options.epochs = 5
options.display = -1
factors, output = tfx.cpd(A, k, options)

-----------------------------------------------------------------------------------------------
Computing MLSVD
    Compression detected
    Compressing from (3, 3, 3, 3, 3, 3) to (2, 2, 2, 2, 2, 2)

Total of 4 third order CPDs to be computed:
Epoch  1
CPD 1 error = 4.1278632793483593e-16
CPD 2 error = 9.639678422602486e-16
CPD 3 error = 9.820925191391736e-16
CPD 4 error = 3.5208032455753626e-15

Epoch  2
CPD 3 error = 1.1756050901949517e-15
CPD 2 error = 2.286430943149485e-15
CPD 1 error = 1.856750182925619e-15

Epoch  3
CPD 2 error = 1.9035644921838262e-15
CPD 3 error = 8.224272590745016e-16
CPD 4 error = 6.099915288392242e-16

Epoch  4
CPD 3 error = 1.2331724470195406e-15
CPD 2 error = 9.60527086057593e-16
CPD 1 error = 6.494983885761952e-16

Epoch  5
CPD 2 error = 1.9772512311969352e-15
CPD 3 error = 2.085180913601653e-15
CPD 4 error = 4.585973723508304e-16

Final results
    Number of steps = 12
    Relative error = 2.5431153877136344e-15
    Accuracy =  100.0 %


# Method

The last parameter to be seen is $\verb|method|$. By default Tensor Fox uses $\verb|method| = \verb|'dGN'|$, which means the program will use the damped Gauss-Newton method. For higher order tensors, $\verb|method| = \verb|'ttcpd'|$ can be a good choice when the rank is smaller than all dimensions. This method is recommended for dense tensors since it avoids the curse of dimensionality. We note that it is also possible to set $\verb|method| = \verb|'ttcpd'|$ for third order tensors. Finally, another possibility is $\verb|method| = \verb|'als'|$ (Alternating Least Squares), a classic and well known method to compute CPDs.