### Optimizer description: InexactZSCG
**Name:** Zero Stochastic Conditional Gradient with Inexact Updates<br>
**Class:** zeroOptim.InexactZSCG <br>
**Paper:** *Zeroth-order Nonconvex Stochastic Optimization: Handling Constraints, High-Dimensionality and Saddle-Points* (rishnakumar Balasubramanian†1 and Saeed Ghadimi‡2) <br>


**Description:** <br>
The Zero-order Stochastic Conditional Gradient Descent with Inexact Updates is modified version of the classic *ZSCG* that implements the *Inexac Conditional Gradient*. At each iteration *k* try to minimize *F(x)* with these 2 main steps:

    1. Estimate the gradient as follow:
$$G_{v}^{k} \equiv G_{v}(x_{k-1}, \xi_{k}, u_{k}) = \frac{1}{m_{k}} \sum_{j=1}^{m_{k}} \frac{F(x_{k-1} + vu_{k,j}, \xi_{k,j}) - (x_{k-1}, \xi_{k,j})}{v}u_{k,j}$$
    2. Use ICG to compute the new x
$$x_{k+1} = ICG(x_{k}, G_{v}^{k}, \gamma_{k}, \mu_{k}) $$


    where ICG is described as follow:

&emsp;&emsp;Input is ($x, g, \gamma, \mu$) <br>
&emsp;&emsp;Set $\hat{y}_{0} = x$, $t=0$ amd $n=0$ <br>
&emsp;&emsp;*While* n = 0: <br>
$$y_{t} = argmin_{u \in \chi}\big\{h_{\gamma}(u) := \langle g + \gamma(\hat{y}_{t-1} - x), u - \hat{y}_{t-1}\rangle \big\}$$

&emsp;&emsp;&emsp;&emsp;If $h_{\gamma}(y_{t}) \geq \mu$, set $n = 1$<br>
&emsp;&emsp;&emsp;&emsp;Else $\hat{y}_{t} = \frac{t-1}{t+1}\hat{y}_{t-1} + \frac{2}{t+1}y_{t}$<br>
&emsp;&emsp;*end while*<br>
&emsp;&emsp;Output $\hat{y}_{t}$
    
where: <br>
$x_{k}$ is our optimization parameter <br>
$\xi_{k}$ is a sample of our distribution <br>
$u_{k,j} \sim N(0, I_{d})$ <br>
$m_{k}$ is the number of gaussian vector to generate <br>
$v$ is the gaussian smoothing parameter <br>
$\gamma_{k}$ is the momentum inside ICG at time k <br>
$\mu_{k}$ is the stopping criterion of ICG at time k <br>

**Args:**

        Name            Type                Description
        x               (torch.tensor)      The variable of our optimization problem. Should be a 3D tensor (img)
        v               (float)             The gaussian smoothing
        n_gradient      (list)              Number of normal vector to generate at every step
        gamma_k         (list)              Momentum at every step inside ICG
        mu_k            (list)              Stoppinc criterion at every step k inside ICG
        max_t           (int)               The maximum number of iteration inside of ICG.
        epsilon         (float)             The upper bound of norm
        L_type          (int)               Either -1 for L_infinity or x for Lx. Default is -1
        batch_size      (int)               Maximum parallelization during the gradient estimation. Default is -1 (=n_grad)
        C               (tuple)             The boundaires of the pixel. Default is (0, 1)
        max_steps       (int)               The maximum number of steps. Default is 100
        verbose         (int)               Display information or not. Default is 0
        additional_out  (bool)              Return also all the x. Default is False
        tqdm_disable    (bool)              Disable the tqdm bar. Default is False                      
     
**Suggested values:** <br>
$v = \sqrt{\frac{1}{2N(d+3)^3}}$, 
$\gamma_{k} =2L$,
$\mu_{k} = \frac{1}{4N}$
$m_{k} = 6(d + 5)N$,
$\forall k \geq 1$

where:<br>
- *N* is the number of steps <br>
- *d* is the dimension of *x* <br>
- *L* is the constant of the Lipschitz gradient of f

**Empirical values:** <br>
In case of MNIST we can set:<br>
$N = 100$ and $d = 784$, so:<br>

- $v = 3e-6$
- $\gamma_{k} = 2$
- $\mu_{k} = 0.0025$
- $m_{k} = 473400$

**N.B** <br>
In reality it has been seen that the *ICG* cycle takes a lot of time to converge to the sopping criterion, but usually we have good results after less than an hundred iteration. So the algorithm works much more efficiently when we set a maximum *t* inside *ICG*. <br>
Moreover, as the classic ZSCG, the number of function evaluation needed to have a good approximation of the gradient ($m_{k}$ in the paper, *n_gradient* in the *run* arguments) can be much less than the one indicated by the paper. Infact if the paper multiply the dimensions *d* by 6 and by *N* we found that this parameters doesn't depend by N and can be reduced even to *d*


### Results

The results are all taken with the torch random seed set as *42*. It has been seen that a good convergence time can be achivied with $\gamma \in [2, 4]$, *max_t* $\in [50, 200]$ and $\mu = 0.0025$.

The maximum number of step has been set to 100.

**N.B** <br>
All the results are taken in the *google colab enviroment* using the available GPU *Tesla K80*. 


**1. MNIST**
    
    1.a) Untarget
        
         Check results at this values of epsilon (0.25, 0.20, 0.15, 0.10, 0.05) for infinity norm
         
                 
                 
    1.b) Target
        
         Check results at this values of epsilon (0.50, 0.40, 0.30, 0.20, 0.10) for infinity norm
    
       
**2 Cifar10**


    1.a) Untarget
        
         Check results at this values of epsilon (0.02, 0.01, 0.005) for infinity norm
         
                 
                 
    1.b) Target
        
         Check results at this value of epsilon (0.02, 0.01, 0.005) for infinity norm
    