# Overview
This notebook is a short summary of the ***ProxSkip*** optimization algorithm introduced in the following ***[paper](https://proceedings.mlr.press/v162/mishchenko22b.html)***


## Problem Statement
The paper tackles the following class of problems: 

$$ \min_{x \in \mathbb{R}^d} f(x) + \psi(x)$$

where $f: \mathbb{R}^d: \mathbb{R}$ is a smooth function and $\psi: \mathbb{R}^d: \mathbb{R}$ is a proper closed convex regularizer

The Proximal Gradient descent, also known as the forward-backward algorithm is the de-facto approach: 

$$ x_{t+1} = prox_{\gamma_t \psi}(x_t - \gamma_t \nabla f(x_t))$$

where the $prox$ operator is defined as: 

$$ prox_{\gamma \psi}(x) = \argmin_{y \in \mathbb{R}^d} [~\frac{1}{2} \| x - y \| ^ 2 + \psi(x)~]$$

This approach (as well as the more recent method built on top of it) is suited for problems where the ***proximity operator*** is inexpensive while the bottleneck mainly lies in the gradient: $\nabla f$.

The authors of ***ProxSkip*** tackle a different sub-class of problems where the ***proximity operator*** is ***expensive*** while forward pass: $\nabla f$ is computationally cheap. 

## The relevance of the problem
It is crucial to consider the relevance and usefulness of the problem tackled by the paper's authors.

The success of Machine Learning is to a great extent due to the substantial increase in both computational power and generated data. Distributing the training accross multiple nodes presented itself as a very promising and attractive path to minimize the time requirements of the training process. Nevertheless, such directions has inherent computational issues: mainly the communication cost between different nodes which is a pillar of ***Federated Learning***. 

In distributed training, the prox operator requires calling the different nodes which as mentioned above computationally expensive. Therefore, the authors are tackling a problem rooted in both industry and academia.

## Main Contributions
The paper introduces a new optimization method with a convergence rate 
$$O(k \frac{1}{\epsilon})$$ 
where $k =\frac{L}{u}$ and a number of proxy operator rate (in expectation) 
$$O(\sqrt{K} \frac{1}{\epsilon})$$ 
without any assumptions on the data unlike several recent proposed methods.

The same convergence and proxy operator rates hold for both stochastic and distributed training

## Convergence Analysis
This section describes a sketch of the proof of the convergence and proxy operator rates: 

### Known facts and assumptions:
1. $f$ is convex and $\psi$ is proper convex closed regularizer
2. The assumption above implies that the problem has a unique solution denoted by $x^* = \argmin_{x \in \mathbb{R}^d} f(x) + \psi(x)$
3. Two other important implications are the following: 
    * $\| prox_f(x) − prox_f y \| ^ 2 + \|(x − prox_f (x)) − (y − prox_f (x)) \|^2 \leq \|x - y \| ^ 2$
    * $\forall \gamma$, $x^*$ satisfies $ x = prox_{f_1} (x  - \gamma \cdot \nabla f_2 (x))$

Introducing a bit of notation : 
* $h^{*} = \nabla f(x^*)$ 
* $P(.) = prox_{\gamma ~ \psi}(.)$
* $x = \hat{x}_{t+1} − \frac{γ}{p} \cdot h_t$
* $y = x^{*} - \frac{\gamma}{p} h^{*}$

1. First step let's rewrite the $x_{t+1}$ and $h_{t+1}$ in terms of $x_t$ and $h_t$:
$
x_{t + 1} = 
\begin{equation}
\begin{cases}
P(x) ~~  p \\
\hat{x}_{t+1} ~~ 1 - p
\end{cases}
\end{equation}
$

$
h_{t + 1} = 
\begin{equation}
\begin{cases}
h_t + \frac{p}{\gamma} (P(x) - \hat{x}_{t+1}) ~~ p \\
h_t ~~ 1 - p
\end{cases}
\end{equation}
$

The main result is: 

$$\mathbb{E}[\Psi(t)] = (1 - \xi)^{T} \Psi_{0}$$
where $\Psi(t) = \|x_t - x^{*} \| ^ 2 + \frac{\gamma ^ 2}{p^2} \|h_t - h^{*} \| ^ 2$ 

\begin{align*}
\mathbb{E}[\Psi(t + 1)] &= p (\|P(x) - x^{*} \| ^ 2 + \frac{\gamma ^ 2}{p^2} \|h_t + \frac{p}{\gamma} (P(x) - \hat{x}_{t + 1}) - h^{*} \| ^ 2) + (1 - p) \cdot (\|\hat{x}_{t + 1} - x^{*} \| ^ 2 + \frac{\gamma ^ 2}{p^2} \|h_t - h^{*} \| ^ 2) && \text{this can be written as}\\

\mathbb{E}[\Psi(t + 1)] &= p (\|P(x) - P(y) \| ^ 2 + \|P(x) - x + y - P(y) \| ^ 2) + (1 - p) \cdot (\|\hat{x}_{t + 1} - x^{*} \| ^ 2 + \frac{\gamma ^ 2}{p^2} \|h_t - h^{*} \| ^ 2)  && \text{algebric manipulation}\\

\mathbb{E}[\Psi(t + 1)] &= p (\|x - y \| ^ 2) + (1 - p) \cdot (\|\hat{x}_{t + 1} - x^{*} \| ^ 2 + \frac{\gamma ^ 2}{p^2} \|h_t - h^{*} \| ^ 2)  && \text{applying firm nonexpansiveness}\\
\end{align*}

recalling the definitions:

* $x = \hat{x}_{t+1} − \frac{γ}{p} \cdot h_t$
* $y = x^{*} - \frac{\gamma}{p} h^{*}$

then:


\begin{align*}
\mathbb{E}[\Psi(t + 1)] &\leq p (\|x - y \| ^ 2) + (1 - p) \cdot (\|\hat{x}_{t + 1} - x^{*} \| ^ 2 + \frac{\gamma ^ 2}{p^2} \|h_t - h^{*} \| ^ 2) \\
\mathbb{E}[\Psi(t + 1)] &\leq \|\hat{x}_{t + 1} - x^{*} \| ^ 2 - 2 \cdot \gamma <\hat{x}_{t + 1} - x^{*}, h_t - h^{*}> + \frac{\gamma ^ 2}{ p ^ 2} + \| h_t - h^{*} \| ^ 2 \\
\mathbb{E}[\Psi(t + 1)] &\leq \|\hat{x}_{t + 1} - x^{*} \| ^ 2 - 2 \cdot \gamma <\hat{x}_{t + 1} - x^{*}, h_t - h^{*}> + \gamma \cdot \|h_t - h^{*} \| ^ 2 + (\frac{\gamma ^ 2}{ p ^ 2} - \gamma)\| h_t - h^{*} \| ^ 2 \\

\mathbb{E}[\Psi(t + 1)] &\leq \|(\hat{x}_{t + 1} - h_t) - (x^{*} - h^{*}) \| ^ 2 + \frac{\gamma ^ 2}{ p ^ 2} \cdot (1 - p^2)\| h_t - h^{*} \| ^ 2 \\
\end{align*}


Using strong convexity and smoothness of $f$, we can estimate an upper bound for the first term

\begin{align*}
\|(\hat{x}_{t + 1} - h_t) - (x^{*} - h^{*}) \| ^ 2 &= \|x_{t} - x^{*} - \gamma(\nabla f(x_t) - \nabla f(x^*)) \| ^ 2\\
\|(\hat{x}_{t + 1} - h_t) - (x^{*} - h^{*}) \| ^ 2 &= \|x_{t} - x^{*}\| ^ 2 + \gamma ^ 2 \cdot \| \nabla f(x_t) - \nabla f(x^*) \| ^ 2 - 2\gamma <\nabla f(x_t) - \nabla f(x^*), x_t - x^{*}> \\
\|(\hat{x}_{t + 1} - h_t) - (x^{*} - h^{*}) \| ^ 2 & \leq (1 - \gamma \mu) \|x_{t} - x^{*}\| ^ 2 + \gamma ^ 2 \cdot \| \nabla f(x_t) - \nabla f(x^*) \| ^ 2 - 2\gamma D_f(x_t, x^*) && \text{using strong convexity} \\
\|(\hat{x}_{t + 1} - h_t) - (x^{*} - h^{*}) \| ^ 2 & \leq (1 - \gamma \mu) \|x_{t} - x^{*}\| ^ 2 - 2 \gamma ^  \cdot (2\gamma D_f(x_t, x^*) - \frac{\gamma}{2} \| \nabla f(x_t) - \nabla f(x^*) \| ^ 2 )\\

\|(\hat{x}_{t + 1} - h_t) - (x^{*} - h^{*}) \| ^ 2 & \leq (1 - \gamma \mu) \|x_{t} - x^{*}\| ^ 2 && \text{The last term is negative for $0 < \gamma < \frac{1}{L}$}\\
\end{align*}


Combining both intermediate results, we reach the main result of the paper:
\begin{align}
\mathbb{E}[\Psi(t + 1)] &\leq \|(\hat{x}_{t + 1} - h_t) - (x^{*} - h^{*}) \| ^ 2 + \frac{\gamma ^ 2}{ p ^ 2} \cdot (1 - p^2)\| h_t - h^{*} \| ^ 2 \\
\mathbb{E}[\Psi(t + 1)] &\leq (1 - \mu \gamma) \|x_t - x^{*}\|^2 + \frac{\gamma ^ 2}{ p ^ 2} \cdot (1 - p^2)\| h_t - h^{*} \| ^ 2 \\
\mathbb{E}[\Psi(t + 1)] &\leq (1 - \xi) (\|x_t - x^{*}\|^2 + \frac{\gamma ^ 2}{ p ^ 2} \cdot \| h_t - h^{*} \| ^ 2) && \text{$\xi = \min(\gamma \mu , p^2)$}\\
\mathbb{E}[\Psi(t + 1)] &\leq (1 - \xi) \Psi(t)\\
\end{align}

This equality proves the linear convergence rate of the ***ProxSkip*** method while proving that $h_t$ converges to $\nabla f (x^*)$


## Proxy operator rates
Using the convergence rate, we can say that for $T \geq \max(\frac{1}{\mu \gamma}, \frac{1}{p^2}) \log (\frac{1}{\epsilon})$, we have 

$$\mathbb{E}[\Psi(T)] \leq \epsilon \Psi(t)$$

Since the proxy operatory will be called $p \cdot T$ (on average) after $T$ iterations we can say that for

$$ p \cdot \max(\frac{1}{\mu \gamma}, \frac{1}{p^2}) \log (\frac{1}{\epsilon}) =  \max(\frac{p}{\mu \gamma}, \frac{1}{p}) \log (\frac{1}{\epsilon})$$
prox operator calls we have:

$$\mathbb{E}[\Psi(T)] \leq \epsilon \Psi(t)$$

The next step is to minimize the term $\max(\frac{p}{\mu \gamma}, \frac{1}{p})$ which reaches the minimum value for the maximum step size $\gamma = \frac{1}{L}$ and $\frac{p}{\mu \gamma} = \frac{1}{p}$ implying $p = \frac{1}{\sqrt{k}}$.

Thus, for $\gamma = \frac{1}{L}$, $p = \frac{1}{\sqrt{k}}$, The proxy operator rate is:
$$O(\sqrt{K} \frac{1}{\epsilon})$$ 


The authors apply standard techniques to prove similar rates for the case of the Stochastic and Federated Learning versions of the algorithm.