# Overview
This notebook is a short summary of the ***ProxSkip*** optimization algorithm introduced in the following ***[paper](https://proceedings.mlr.press/v162/mishchenko22b.html)***

# The problem statement
ProxSkip tackles the following class fo problems: 

$$ \min_{x \in \mathbb{R}^d} f(x) + \psi(x)$$

where $f: \mathbb{R}^d: \mathbb{R}$ is a smooth, convex function and $\psi: \mathbb{R}^d: \mathbb{R}$ is an expensive, non-smooth regularizer. 

Numerous applications can be represented in such setting: 

1. Signal Processing: Splitting a signal (a function) into a sum of functions with convex constraints: the constraints can be modeled as an indicator function across all sets [1](https://arxiv.org/pdf/0912.3522.pdf) 
2. Machine Learning: Decentralized / distributed training is crucial to train huge models. Consensus form is a mechanism to ensure that local solutions (in different devices) can be effectively leveraged to minimize the ***global objective*** 

# Prox Gradient Descent: The starting point
Such class of problems is generally sovled with the Proximal Gradient Descent:

$$ x_{t+1} = prox_{\gamma_t \psi}(x_t - \gamma_t \nabla f(x_t))$$

where the $prox$ operator is defined as: 

$$ prox_{\gamma \psi}(x) = \argmin_{y \in \mathbb{R}^d} [~\frac{1}{2} \| x - y \| ^ 2 + \gamma \cdot \psi(x)~]$$

Even though the proximity operator presents itself as a sub optimization problem, closed formulas have been developed for most standard and popular regularizers: such as  $\|x\|_1$ , $\|x\|_2$ ... 

However, since $\psi$ is generally non-smooth and not differentiable (at least not on its entire domain, take $\|x\|_1$ for example),the computation of the ***PROX OPERATOR*** can turn out quite computationally expensive. 

# Expensive Proxy Operators: 
## Inherently Computationally expensive

The proximity operator bridges the gap between constrained and unconstrained optimization where the problem: 

$$ 
\min_{x \in \mathbb{R}^d} f(x) \\
x \in X
$$

is formulized as: 

$$ \min_{x \in \mathbb{R}^d} f(x) + \psi(x)$$

where 

$$
\psi(x) =
\begin{equation*}
    \begin{cases}
    0 && \text{if $x \in X$} \\
    \infty && \text{if $x \not\in X$} \\
    \end{cases}
\end{equation*}
$$

This operation can represent a difficult optimization problem for a several complex sets.

## Expensive Communication-wise
The proximity operator emerges in decentralized training regime. Assuming $m$ devices, the global training objective is: 


original problem:=


$$ 

f(x) = \frac{1}{n} \sum_{i = 1} ^ n f_i(x) \\
$$

We will model the problem as $f(X)$
$$
f(X) = f(x_1, x_2, ... x_n) = \frac{1}{n} \sum_{i = 1} ^ n f_i(x_i) \\
$$

but adding the constraint, $$x_1 = x_2, ..., = x_{n - 1} = x_n$$


As you have probably guessed such constraint can be expressed in terms of a proxy operator:  

$$
\psi_C(x_1, x_2, ..., x_{n - 1}, x_n) =
\begin{equation*}
    \begin{cases}
    0 && \text{if $x_1 = x_2 = , ..., =  x_{n - 1} = x_n$} \\
    \infty && \text{otherwise} \\
    \end{cases}
\end{equation*}
$$



The theoretical solution for 


is the average of $x_i$ which is theoretically straightforward. However, in Federated Learning, such a simple operation would require $O(n)$ communincations which can be prohibitly expensive mainly in the modern settings ($n$ is quite large.)

# ProxSkip: A provable solution:

The concensus constraint is just a single example of several other expensive constraints due to communication constraints has been seeking better Gradient Descent methods with a communication rate lower than $$ O(\kappa \cdot \frac{1}{\epsilon})$$ with no additional assumptions on data similarity of stronger smoothness assumptions. 

where  $\kappa = \frac{L}{\mu}$

The authors of the paper introduce a version of the Prox Gradient Descent where the proximity operator is calculated $p$ times less frequently (on average) and ***Scaffnew*** an extension of this algorithm to distributed training settings without referring to any particular acceleration mechanisms.

Scaffnew achieves Linear Convergence rate : $$O(\kappa \cdot \log(\frac{1}{\epsilon}))$$
and the theoretically optimal communication rate: 

$$O(\sqrt{\kappa} \cdot \log(\frac{1}{\epsilon}))$$ 

The authors extend vanilla ProxSkip to Stochastic ProxSkip and Decentralized version.

# Convergence Analysis: The magic Explained

## Assumptions:
1. $f$ is strongly convex with constant $\mu$, smooth with constant $L$
2. $\psi$ is a proper, convex and closed regularizer. 
3. Firm non-expansiveness: The proxy operator is assumed to satisfy the following inequality for all $x,y \in \mathbb{R}^d$:

    $$
    \| prox_{\psi}(x) − prox_{\psi} y \| ^ 2 + \|(x − prox_{\psi} (x)) − (y − prox_{\psi} (x)) \|^2 \leq \|x - y \| ^ 2
    $$

## intuition
<p float="left">
  <img src="proximal_gd.png" width="600" />
  <img src="proxskip.png" width="600" /> 
</p>

ProxSKip does not only differ with the Proximal Gradient Descent not only by skipping the Proximity operator but by adding a control mechanism, or as referred to in the paper **controle variate**. 

In general, the gradient of $f$ at $x^{*}$ is not necessarily $0$. Thus, Skipping the proximity operator without proper compensation would lead $x_t$ to drift from $x^{*}$.

$$
\hat{x}_{ t + 1} = x_t - \gamma (\nabla f(x_t) - h_t) \\ 
$$

Intuitively if we would like the algorithm to converge: 

$$ 
\lim_{x_t \rightarrow \infty} = x^{*}
$$

Intuitively, $h_t$ should converge to $\nabla f(x^*)$ as when $x_t$ converges to $x^{*}$, we would like the term after $\gamma$ to converge to $0$. More formally, we would a Lyapounove function $\psi$ in terms of $x_t, h_t, x^{*}, h^{*}$ that (in expectation) converges to $0$. The authors chose the following:   

$$ 
\psi_{t} = \|x_t - x^{*} \| ^ 2 + (\frac{\gamma}{p})^2 \cdot  \|h_t - h^{*} \| ^ 2
$$

The authors prove that for any $t \geq 1$, $~\frac{1}{L} \geq \gamma > 0$, the inequality: 

$$ \mathbb{E}[\psi_t] \leq (1 - \xi) ^ T \cdot \psi_0 $$

where $\xi = \min \{\gamma \cdot \mu , p^2\}$


## Important Results:
Under the assumptions listed above, convex analysis tells us that: 

1. There is a unique minimizer for the problem: $f + \psi$

2. Using the strong convexity of $f$, we have: 

\begin{align}
f(y) &\geq f(x) + < \nabla f(x), y - x> ~ + \frac{\mu}{2} \cdot \| x - y\|^2 \\
\iff <\nabla f(x), x - y> ~ &\geq f(x) - f(y) + \frac{\mu}{2} \cdot \| x - y\|^2
\end{align}

3. Using the $L$ smoothness of $f$:
\begin{align}
f(x) - f(y) - <\nabla f(y), x - y> ~ \geq \frac{1}{2 \cdot L} \cdot \| x - y\|^2
\end{align}

## Interesting Remark: 
Let's, as in the paper, denote $prox_{\frac{\gamma}{p}}(.)$ by $P(.)$. $P$ satisfies the following equation: 

$$ 
x^{*} = P(x^{*} - \frac{\gamma}{p} \cdot h^{*})
$$

where $h^{*} = \nabla f(x^{*})$ 

This is a powerful property of the proximity operator which can be proven as follows:

\begin{align}
y &= \argmin_{u \in \mathbb{R}^d} \frac{1}{2} \| u - x^{*} + \alpha h^{*}\|^2 + \alpha \cdot \psi(u) \\
\iff 0 &\in  \delta ( \frac{1}{2} \| u - x^{*} + \alpha h^{*}\|^2 + \alpha \cdot \psi(u)) \\
\iff 0 &\in   u - x^{*} + \alpha \nabla(x^{*}) + \alpha \cdot \delta \psi(u) \\
\end{align}

On the other hand: 
\begin{align}
x^{*} &= \argmin_{u \in \mathbb{R}^d} f(x) + \alpha \cdot \psi(x) \\
\iff 0 & \in \nabla f (x^{*})+ \alpha \cdot \delta \psi(x) \\
\iff 0 & \in x^{*} + \nabla f (x^{*})+  - x^{*} + \alpha \cdot \delta \psi(x)
\end{align}

We can see that $x^{*}$ satisfies the equality of the sub-norm above. Thus:

$$ 
x^{*} = P(x^{*} - \frac{\gamma}{p} \cdot h^{*})
$$


Using this identity, firm non-expansiveness, the authors an estimation of the expectation at $t$ completely without the Proximity operators: 

$$
\mathbb{E}[\Psi(t + 1)] = p (\|\hat{x}_{t+1} − \frac{γ}{p} \cdot h_t - (x^{*} - \frac{\gamma}{p} h^{*}) \| ^ 2) + (1 - p) \cdot (\|\hat{x}_{t + 1} - x^{*} \| ^ 2 + \frac{\gamma ^ 2}{p^2} \|h_t - h^{*} \| ^ 2)
$$

The proof is explained in details in the paper.

## Proxy operator rates
Using the convergence rate, we can say that for $T \geq \max(\frac{1}{\mu \gamma}, \frac{1}{p^2}) \log (\frac{1}{\epsilon})$, we have 

$$\mathbb{E}[\Psi(T)] \leq \epsilon \Psi(t)$$

Since the proxy operatory will be called $p \cdot T$ (on average) after $T$ iterations we can say that for

$$ p \cdot \max(\frac{1}{\mu \gamma}, \frac{1}{p^2}) \log (\frac{1}{\epsilon}) =  \max(\frac{p}{\mu \gamma}, \frac{1}{p}) \log (\frac{1}{\epsilon})$$
prox operator calls we have:

$$\mathbb{E}[\Psi(T)] \leq \epsilon \Psi(t)$$

The next step is to minimize the term $\max(\frac{p}{\mu \gamma}, \frac{1}{p})$ which reaches the minimum value for the maximum step size $\gamma = \frac{1}{L}$ and $\frac{p}{\mu \gamma} = \frac{1}{p}$ implying $p = \frac{1}{\sqrt{\kappa}}$.

Thus, for $\gamma = \frac{1}{L}$, $p = \frac{1}{\sqrt{k}}$, The proxy operator rate is:
$$O(\sqrt{\kappa} \frac{1}{\epsilon})$$ 


The authors apply standard techniques to prove similar rates for the case of the Stochastic and Federated Learning versions of the algorithm.