### <span style="color:rgb(139,69,19)">CIMPA School Research School "Control, Optimization, and Model Reduction in Machine Learning"</span>

### <span style="color:rgb(139,69,19)">Optimization for Machine Learning - C. W. Royer</span>


# <span style="color:rgb(139,69,19)">Lab 02 - Stochastic gradient</span>

#### <span style="color:rgb(139,69,19)">Preliminary remarks</span>

The goal of this lab session is to implement several variants of the stochastic gradient method. As in the previous lab session, we will try those on two different regression problems that possess a finite-sum structure.

In [None]:
# Preamble: useful toolboxes, librairies, functions, etc.

%matplotlib inline
import matplotlib.pyplot as plt

from math import sqrt # Square root

# NumPy - Matrix and vector structures
import numpy as np # NumPy library
from numpy.random import multivariate_normal, randn # Probability distributions on vectors

# SciPy - Efficient mathematical calculation
from scipy.linalg import toeplitz # A special kind of matrices
from scipy.linalg import svdvals # Singular values
from scipy.linalg import norm # Euclidean norm
from scipy.optimize import check_grad # Check accuracy between objective and gradient values
from scipy.optimize import fmin_l_bfgs_b # Efficient optimizer

# <span style="color:rgb(139,69,19)">Part 1 - Data generation and finite-sum problems</span>

In this section, we restate our setup from the first lab session. Recall that our results were based upon a dataset $\{(\mathbf{a}_i,y_i)\}_{i=1,\dots,n}$, where $a_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}$, available in the form of:

- a feature matrix $A \in \mathbb{R}^{n \times d}$;
- and a vector of labels $y \in \mathbb{R}^n$. 

Given this dataset, we seek a model parameterized by a vector $w$ that explains the data according to a certain loss function. This results in the following formulation:
$$
    \min_{\mathbf{x} \in \mathbb{R}^d} f(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^n f_i(\mathbf{x}), \qquad f_i(\mathbf{x}) = \ell(h(\mathbf{a}_i,\mathbf{x}),y_i) + \frac{\lambda}{2}\|\mathbf{x}\|^2.
$$
where $\lambda \ge 0$ is an optional regularization parameter *(more on this in the lectures on proximal gradient and LASSO)*.

The dataset will be produced according to the procedure below.

In [None]:
# Data generation.
# This code is inspired by a generator proposed by A. Gramfort.

def simu_linmodel(w, n, std=1., corr=0.5):
    """
    Simulation values obtained by a linear model with additive noise
    
    Parameters
    ----------
    w : np.ndarray, shape=(d,)
        The coefficients of the model
    
    n : int
        Sample size
    
    std : float, default=1.
        Standard-deviation of the noise

    corr : float, default=0.5
        Correlation of the feature matrix
    """    
    d = w.shape[0]
    cov = toeplitz(corr ** np.arange(0, d))
    X = multivariate_normal(np.zeros(d), cov, size=n)
    noise = std * randn(n)
    y = X.dot(w) + noise
    return X, y

The data is thus produced from a linear model corrupted with (Gaussian) noise. Note that the feature vectors are correlated so as to increase interest of stochastic gradient approaches.

### <span style="color:rgb(139,69,19)"> Regression models</span>

We implement a generic regression problem encoding both linear and logistic regression tasks.

- In linear regression, our model is linear and the loss is the least-squares (or $\ell_2$ loss). We thus have
$$
    f(\mathbf{w}) = \frac{1}{2 n} \|\mathbf{X} \mathbf{w} - \mathbf{y}\|^2 + \frac{\lambda}{2}\|\mathbf{w}\|^2, 
    \quad
    f_i(\mathbf{w}) = \frac{1}{2} (\mathbf{a}_i^T \mathbf{w} - y_i)^2 + \frac{\lambda}{2} \|\mathbf{w}\|^2.
$$
The function $f$ is $\mathcal{C}^{1,1}_L$ with $L=\frac{\|\mathbf{X}^T \mathbf{X}\|}{n}+\lambda$ and
$$
    \nabla f(\mathbf{w})=\frac{1}{n}\mathbf{X}^T (\mathbf{X} \mathbf{w} - \mathbf{y}) + \lambda \mathbf{x}.
$$
We can also establish that every $f_i$ is $\mathcal{C}^1$ with
$$
    \nabla f_i(\mathbf{w}) = (\mathbf{x}_i^T \mathbf{w}-y_i)\mathbf{x}_i + \lambda\mathbf{w}.
$$
- In logistic regression, our goal is to classify correctly the data points. Our model is still linear but we now consider a sigmoid loss:
$$
    f(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n f_i(\mathbf{w}), \quad 
    f_i(\mathbf{w}) = \log(1+\exp(-y_i \mathbf{x}_i^T \mathbf{w}))+\frac{\lambda}{2}\|\mathbf{w}\|^2.
$$
We have already seen that $f$ is $\mathcal{C}^{1,1}_L$ with
$$
\nabla f(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n  -\frac{y_i}{1 + \exp(y_i \mathbf{x}_i^T \mathbf{w})} \mathbf{x}_i + \lambda \mathbf{w}
$$
and $L:=\frac{\|\mathbf{X}^T \mathbf{X}\|}{4n}+\lambda$. Every $f_i$ is also $\mathcal{C}^1$ 
$$
\nabla f_i(\mathbf{w}) = - \frac{y_i}{1 + \exp(y_i \mathbf{x}_i^T \mathbf{w})} \mathbf{x}_i + \lambda \mathbf{w}.
$$

In [None]:
# Python class for regression problems
class RegPb(object):
    '''
        A class for regression problems with linear models.
        
        Attributes:
            X: Data matrix (features)
            y: Data vector (labels)
            n,d: Dimensions of X
            loss: Loss function to be considered in the regression
                'l2': Least-squares loss
                'logit': Logistic loss
            lbda: Regularization parameter
    '''
   
    # Instantiate the class
    def __init__(self, X, y,lbda=0,loss='l2'):
        self.X = X
        self.y = y
        self.n, self.d = X.shape
        self.loss = loss
        self.lbda = lbda
        
    
    # Objective value
    def fun(self, w):
        if self.loss=='l2':
            return norm(self.X.dot(w) - self.y) ** 2 / (2. * self.n) + self.lbda * norm(w) ** 2 / 2.
        elif self.loss=='logit':
            yXw = self.y * self.X.dot(w)
            return np.mean(np.log(1. + np.exp(-yXw))) + self.lbda * norm(w) ** 2 / 2.
    
    # Partial objective value
    def f_i(self, i, w):
        if self.loss=='l2':
            return norm(self.X[i].dot(w) - self.y[i]) ** 2 / (2.) + self.lbda * norm(w) ** 2 / 2.
        elif self.loss=='logit':
            yWxi = self.y[i] * np.dot(self.X[i], w)
            return np.log(1. + np.exp(- yXwi)) + self.lbda * norm(w) ** 2 / 2.
    
    # Full gradient computation
    def grad(self, w):
        if self.loss=='l2':
            return self.X.T.dot(self.X.dot(w) - self.y) / self.n + self.lbda * w
        elif self.loss=='logit':
            yXw = self.y * self.X.dot(w)
            aux = 1. / (1. + np.exp(yXw))
            return - (self.X.T).dot(self.y * aux) / self.n + self.lbda * w
    
    # Partial gradient
    def grad_i(self,i,w):
        x_i = self.X[i]
        if self.loss=='l2':
            return (x_i.dot(w) - self.y[i]) * x_i + self.lbda*w
        elif self.loss=='logit':
            grad = - x_i * self.y[i] / (1. + np.exp(self.y[i]* x_i.dot(w)))
            grad += self.lbda * w
            return grad     

    # Lipschitz constant for the gradient
    def lipgrad(self):
        if self.loss=='l2':
            L = norm(self.X, ord=2) ** 2 / self.n + self.lbda
        elif self.loss=='logit':
            L = norm(self.X, ord=2) ** 2 / (4. * self.n) + self.lbda
        return L

In [None]:
# Generate the problem instances - we use moderate sizes but those will serve our purpose

d = 50
n = 1000
idx = np.arange(d)
lbda = 1. / n ** (0.5)

# Fix random seed for reproducibility
np.random.seed(1)

# Ground truth coefficients of the model
w_model_truth = (-1)**idx * np.exp(-idx / 10.)

Xlin, ylin = simu_linmodel(w_model_truth, n, std=1., corr=0.1)
Xlog, ylog = simu_linmodel(w_model_truth, n, std=1., corr=0.7)
ylog = np.sign(ylog) # Taking the logarithm for binary classification

pblinreg = RegPb(Xlin, ylin,lbda,loss='l2')
pblogreg = RegPb(Xlog, ylog,lbda,loss='logit')

In this lab, we work with relatively simple loss functions: we can thus efficiently compute a solution using a second-order method. This provides us with a target objective value as well as a target vector of weights.

In [None]:
# Use L-BFGS-B to determine a solution for both problems

w_init = np.zeros(d)
# Compute the optimal solution for linear regression
w_min_lin, f_min_lin, _ = fmin_l_bfgs_b(pblinreg.fun, w_init, pblinreg.grad, args=(), pgtol=1e-30, factr =1e-30)
print("Linear regression:")
print("\t Numerical minimal value:",f_min_lin)
print("\t Numerical minimum gradient norm:",norm(pblinreg.grad(w_min_lin)))

# Compute the optimal solution for logistic regression
w_min_log, f_min_log, _ = fmin_l_bfgs_b(pblogreg.fun, w_init, pblogreg.grad, args=(), pgtol=1e-30, factr =1e-30)
print("Logistic regression:")
print("\t Numerical minimal value:",f_min_log)
print("\t Numerical minimum gradient norm:",norm(pblogreg.grad(w_min_log)))

These solutions will enable us to study the behavior of the distance to optimality in terms of function values 
$f(\mathbf{w}_k)-f^*$ and iterates $\|\mathbf{w}_k -\mathbf{w}^*\|$. 

# <span style="color:rgb(139,69,19)">Part 2 - Gradient and stochastic gradient methods</span>

Having defined our problems, we now build a generic stochastic gradient method for comparison with gradient descent. We will then try to tune the batch size. 

## <span style="color:rgb(139,69,19)"> 2.1 Generic stochastic gradient framework</span>

The iteration of stochastic gradient (also called *Stochastic Gradient Descent*, or *SGD*) is given by:

$$
    \mathbf{w}_{k+1} = \mathbf{w}_k - \alpha_k \nabla f_{i_k}(\mathbf{w}_k),
$$

where $i_k$ is drawn at random in $\{1,\dots,n\}$. For the purpose of this lab session, $i_k$ will be drawn uniformly at random.

A more general version of stochastic gradient, called batch stochastic gradient, is given by the iteration

$$
    \mathbf{w}_{k+1} = \mathbf{w}_k - \frac{\alpha_k}{|S_k|} \sum_{i \in S_k} \nabla f_i(\mathbf{w}_k)
$$

where $S_k$ is a set of indices drawn uniformly in $\{1,\dots,n\}$. For this lab, the samples will be drawn without replacement, so that $|S_k|=n$ results in a full gradient step, while $|S_k|=1$ corresponds to a basic stochastic gradient step. In this notebook, we will focus on using the same batch size across all iterations.

The algorithmic template below implements a general batch stochastic gradient method. We include the possibility for two stepsize choices:
 - *$\alpha_k=\tfrac{1}{L}$, and* 
 - *$\alpha_k=\frac{\alpha_0}{\sqrt{k+1}}$, where $\alpha_0$ is an input parameter of the method.*

In [None]:
# Stochastic gradient implementation
def stoch_grad(w0,problem,wtarget,stepchoice=0,step0=1, n_iter=1000,nb=1,with_replace=False,verbose=True): 
    """
        A code for gradient descent with various step choices.
        
        Inputs:
            w0: Initial vector
            problem: Problem structure
                problem.fun() returns the objective function, which is assumed to be a finite sum of functions
                problem.n returns the number of components in the finite sum
                problem.grad_i() returns the gradient of a single component f_i
                problem.lipgrad() returns the Lipschitz constant for the gradient
                problem.cvxval() returns the strong convexity constant
                problem.lambda returns the value of the regularization parameter
            wtarget: Target minimum (unknown in practice!)
            stepchoice: Strategy for computing the stepsize 
                0: Constant step size equal to 1/L
                t>0: Step size decreasing in 1/(k+1)**t
            step0: Initial steplength (only used when stepchoice is not 0)
            n_iter: Number of iterations, used as stopping criterion
            nb: Number of components drawn per iteration/Batch size 
                1: Classical stochastic gradient algorithm (default value)
            with_replace: Boolean indicating whether components are drawn with or without replacement
                True: Components drawn with replacement
                False: Components drawn without replacement (Default)
            verbose: Boolean indicating whether information should be plot at every iteration (Default: False)
            
        Outputs:
            x_output: Final iterate of the method (or average if average=1)
            objvals: History of function values (Numpy array of length n_iter at most)
            normits: History of distances between iterates and optimum (Numpy array of length n_iter at most)
    """
    ############
    # Initial step: Compute and plot some initial quantities

    # objective history
    objvals = []
    
    # iterates distance to the minimum history
    normits = []
    
    # Lipschitz constant
    L = problem.lipgrad()
    
    # Number of samples
    n = problem.n
    
    # Initial value of current iterate  
    w = w0.copy()
    nw = norm(w)

    # Initialize iteration counter
    k=0
    
    # Current objective
    obj = problem.fun(w) 
    objvals.append(obj);
    # Current distance to the optimum
    nmin = norm(w-wtarget)
    normits.append(nmin)
    
    if verbose:
        # Plot initial quantities of interest
        print("Stochastic Gradient, batch size=",nb,"/",n)
        print(' | '.join([name.center(8) for name in ["iter", "fval", "normit"]]))
        print(' | '.join([("%d" % k).rjust(8),("%.2e" % obj).rjust(8),("%.2e" % nmin).rjust(8)]))
    
    ################
    # Main loop
    while (k < n_iter and nw < 10**100):
        
        ############################################
        # Computing stochastic gradient
        
        # Draw the batch indices
        ik = np.random.choice(n,nb,replace=with_replace)# Batch gradient
        # Stochastic gradient calculation
        sg = np.zeros(d)
        for j in range(nb):
            gi = problem.grad_i(ik[j],w)
            sg = sg + gi
        sg = (1/nb)*sg
        
        #############################################
            
        if stepchoice==0:
            w[:] = w - (step0/L) * sg
        elif stepchoice>0:
            sk = float(step0/((k+1)**stepchoice))
            w[:] = w - sk * sg
        
        nw = norm(w) #Computing the norm to measure divergence 

        obj = problem.fun(w)
        nmin = norm(w-wtarget)
        
       
        
        k += 1
        # Plot quantities of interest at the end of every epoch only
        if (k*nb) % n == 0:
            objvals.append(obj)
            normits.append(nmin)
            if verbose:
                print(' | '.join([("%d" % k).rjust(8),("%.2e" % obj).rjust(8),("%.2e" % nmin).rjust(8)]))       
    
    # End of main loop
    #################
    
    # Plot quantities of interest for the last iterate (if needed)
    if (k*nb) % n > 0:
        objvals.append(obj)
        normits.append(nmin)
        if verbose:
            print(' | '.join([("%d" % k).rjust(8),("%.2e" % obj).rjust(8),("%.2e" % nmin).rjust(8)]))              
    
    # Outputs
    w_output = w.copy()
    
    return w_output, np.array(objvals), np.array(normits)

## <span style="color:rgb(139,69,19)"> 2.2 Gradient descent vs Stochastic gradient on logistic regression</span>

### <span style="color:rgb(139,69,19)">Hands-on! Gradient descent VS Stochastic gradient</span>

**Run the script below to compare stochastic gradient and gradient descent on the logistic regression problem with 30 epochs and the step size strategies $\alpha_k = \tfrac{1}{L}$ and $\alpha_k = \tfrac{0.2}{\sqrt{k+1}}$. What is your interpretation of those curves?**

In [None]:
# Compare implementations of gradient descent/stochastic gradient
# Pay attention to the budget allocated to each solver (the cost of one iteration of gradient descent vs 
# the cost of 1 iteration of stochastic gradient are different)

nb_epochs = 60
n = pblinreg.n
nbset = 1
w0 = np.zeros(d)

# Run a - Gradient descent with constant stepsize
_, obj_a, nits_a = stoch_grad(w0,pblogreg,w_min_log,stepchoice=0,step0=1, n_iter=nb_epochs,nb=n)

# Run b - Stochastic gradient with constant stepsize
# The version below may diverges, in which case the bound on norm(w) in the code will be triggered
_, obj_b, nits_b = stoch_grad(w0,pblogreg,w_min_log,stepchoice=0,step0=1, n_iter=int(nb_epochs*n/nbset),nb=1)

# Run c - Gradient descent with decreasing stepsize
_, obj_c, nits_c = stoch_grad(w0,pblogreg,w_min_log,stepchoice=0.5,step0=0.2, n_iter=nb_epochs,nb=n)

# Run d - Stochastic gradient with decreasing stepsize
_, obj_d, nits_d = stoch_grad(w0,pblogreg,w_min_log,stepchoice=0.5,step0=0.2, n_iter=nb_epochs*n,nb=1)

In [None]:
# Plot the comparison of variants of GD/SG with the same stepsize rule
# NB: The x-axis is in epochs (1 iteration of GD).

# In terms of objective value (logarithmic scale)
plt.figure(figsize=(7, 5))
plt.semilogy(obj_a-f_min_log, label="GD - 1/L", lw=2)
plt.semilogy(obj_b-f_min_log, label="SG - 1/L", lw=2)
plt.semilogy(obj_c-f_min_log, label="GD - 1/sqrt(k+1)", lw=2)
plt.semilogy(obj_d-f_min_log, label="SG - 1/sqrt(k+1)", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#epochs (log scale)", fontsize=14)
plt.ylabel("Objective (log scale)", fontsize=14)
plt.legend()

## <span style="color:rgb(139,69,19)"> 2.3 Experimenting with the step size/learning rate</span>

We now run several instances of vanilla stochastic gradient with constant step size proportional to $\frac{1}{L}$ (using the ``step0`` parameter from the code).

### <span style="color:rgb(139,69,19)">Hands-on! Tuning the learning rate</span>

**1) Run vanilla stochastic gradient with various constant values for the stepsize chosen to be proportional to $\frac{1}{L}$. What do you observe?**

**2) Run vanilla stochastic gradient with decreasing stepsizes of the form $\frac{\alpha_0}{(k+1)^t}$. What do you observe?**

In [None]:
# Run several instances of stochastic gradient with constant batch size

nb_epochs = 60
n = pblinreg.n
nbset = 1
w0 = np.zeros(d)

########################
# Input your choices of scaling factor for the stepsizes here
valsstep0 = []
##########################################
nvals = len(valsstep0)

objs = np.zeros((nb_epochs+1,nvals))

for val in range(nvals):
    _, objs[:,val], _ = stoch_grad(w0,pblogreg,w_min_log,stepchoice=0,step0=valsstep0[val], n_iter=int(nb_epochs*n/nbset),nb=1)


In [None]:
# Plot the comparison of variants of SG with different (constant) stepsizes
# NB: The x-axis is in epochs (1 iteration of GD).

# In terms of objective value (logarithmic scale)
plt.figure(figsize=(7, 5))
plt.set_cmap("RdPu")
for val in range(nvals):
    plt.semilogy(objs[:,val]-f_min_log, label="SG -"+str(valsstep0[val])+"/L", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#epochs (log scale)", fontsize=14)
plt.ylabel("Objective (log scale)", fontsize=14)
plt.legend()

In [None]:
# Run several instances of stochastic gradient with decreasing step sizes

nb_epochs = 60
n = pblinreg.n
nbset = 1
w0 = np.zeros(d)

###########################
# Input your choices for t to define a stepsize sequence {step0/(k+1)^t}
decstep = []
###########################

nvals = len(decstep)

objs = np.zeros((nb_epochs+1,nvals))

for val in range(nvals):
    _, objs[:,val], _ = stoch_grad(w0,pblogreg,w_min_log,stepchoice=decstep[val],step0=0.2, n_iter=int(nb_epochs*n/nbset),nb=1)

In [None]:
# Plot the comparison of variants of SG with different (constant) stepsizes
# NB: The x-axis is in epochs (1 iteration of GD).

# In terms of objective value (logarithmic scale)
plt.figure(figsize=(7, 5))
plt.set_cmap("RdPu")
for val in range(nvals):
    plt.semilogy(objs[:,val]-f_min_log, label="SG -(k+1)^"+str(decstep[val]), lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#epochs (log scale)", fontsize=14)
plt.ylabel("Objective (log scale)", fontsize=14)
plt.legend(loc=1)

## <span style="color:rgb(139,69,19)"> 2.4 Experimenting with the batch size</span>

We now wish to compare the performance of stochastic gradient with several values for the batch size (using decreasing stepsizes) using the same epoch budget

***NB: One must pay attention to the definition of the number of iterations for each variant, as it depends on the batch size.***  

### <span style="color:rgb(139,69,19)">Hands-on! Batch size tuning</span>

**1) Run the method with $n_b \in \left\{1,\tfrac{n}{100},\tfrac{n}{10},\tfrac{n}{2},n \right\}$ (NB: $n=1000$ in the default settings) without replacement, to include stochastic gradient ($n_b=1$) and gradient descent ($n_b=n$ without replacement).**

**2) Can you find a better value for the batch size than those above?**

**3) Do the conclusions change when the indices are drawn with replacement?**

In [None]:
# Test several values for the batch size (with replacement) using the same epoch budget.

nb_epochs = 100
n = pblogreg.n
w0 = np.zeros(d)

replace_batch=False

# Stochastic gradient (batch size 1)
_, obj_a, nits_a = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=nb_epochs*n,nb=1)

# Batch stochastic gradient (batch size n/100)
nbset=int(n/100)
_, obj_b, nits_b = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=int(nb_epochs*n/nbset),nb=nbset,with_replace=replace_batch)
# Batch stochastic gradient (batch size n/10)
nbset=int(n/10)
_, obj_c, nits_c = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=int(nb_epochs*n/nbset),nb=nbset,with_replace=replace_batch)
# Batch stochastic gradient (batch size n/2)
nbset=int(n/2)
_, obj_d, nits_d = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=int(nb_epochs*n/nbset),nb=nbset,with_replace=replace_batch)

# Gradient descent (batch size n, taken without replacement)
_, obj_e, nits_e = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=int(nb_epochs),nb=n)

In [None]:
# Plot the comparison of variants of batch SGD with the same stepsize rule

# In terms of objective value (logarithmic scale)
plt.figure(figsize=(7, 5))
plt.semilogy(obj_a-f_min_lin, label="SG (batch=1)", lw=2)
plt.semilogy(obj_b-f_min_lin, label="Batch SG - n/100", lw=2)
plt.semilogy(obj_c-f_min_lin, label="Batch SG - n/10", lw=2)
plt.semilogy(obj_d-f_min_lin, label="Batch SG - n/2", lw=2)
plt.semilogy(obj_e-f_min_lin, label="GD", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#epochs (log scale)", fontsize=14)
plt.ylabel("Objective (log scale)", fontsize=14)
plt.legend()

In [None]:
# Test several values for the batch size (with replacement) using the same epoch budget.

nb_epochs = 100
n = pblogreg.n
w0 = np.zeros(d)

replace_batch=True

# Stochastic gradient (batch size 1)
_, obj_ar, nits_ar = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=nb_epochs*n,nb=1)
# Batch stochastic gradient (batch size n/100)
nbset=int(n/100)
_, obj_br, nits_br = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=int(nb_epochs*n/nbset),nb=nbset,with_replace=replace_batch)
# Batch stochastic gradient (batch size n/10)
nbset=int(n/10)
_, obj_cr, nits_cr = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=int(nb_epochs*n/nbset),nb=nbset,with_replace=replace_batch)
# Batch stochastic gradient (batch size n/2)
nbset=int(n/2)
_, obj_dr, nits_dr = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=int(nb_epochs*n/nbset),nb=nbset,with_replace=replace_batch)

# Gradient descent (batch size n, taken with replacement)
_, obj_er, nits_er = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=int(nb_epochs),nb=n,with_replace=replace_batch)



In [None]:
# Plot the comparison of variants of batch SGD with the same stepsize rule

# In terms of objective value (logarithmic scale)
plt.figure(figsize=(7, 5))
plt.semilogy(obj_ar-f_min_lin, label="SG (batch=1 wr)", lw=2)
plt.semilogy(obj_a-f_min_lin, label="SG (batch=1)", lw=2)
plt.semilogy(obj_br-f_min_lin, label="Batch SG - n/100 (wr)", lw=2)
plt.semilogy(obj_b-f_min_lin, label="Batch SG - n/100", lw=2)
plt.semilogy(obj_cr-f_min_lin, label="Batch SG - n/10 (wr)", lw=2)
plt.semilogy(obj_c-f_min_lin, label="Batch SG - n/10", lw=2)
plt.semilogy(obj_dr-f_min_lin, label="Batch SG - n/2 (wr)", lw=2)
plt.semilogy(obj_d-f_min_lin, label="Batch SG - n/2", lw=2)
plt.semilogy(obj_er-f_min_lin, label="Batch SG - n (wr)", lw=2)
plt.semilogy(obj_e-f_min_lin, label="GD", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#epochs (log scale)", fontsize=14)
plt.ylabel("Objective (log scale)", fontsize=14)
plt.legend()

### <span style="color:rgb(139,69,19)">Bonus experiment</span> 

As a bonus experiment, we compare batch stochastic gradient techniques over 10 random runs.

In [None]:
# Test several values for the batch size using the same epoch budget.

nb_epochs = 100
n = pblinreg.n
w0 = np.zeros(d)

nruns = 10

for i in range(nruns):
    ############################
    # Run standard stochastic gradient (batch size 1)
    _, obj_a, _ = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=nb_epochs*n,nb=1,with_replace=True,verbose=False)
    # Batch stochastic gradient (batch size n/10)
    nbset=int(n/10)
    _, obj_b, _ = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=int(nb_epochs*n/nbset),nb=nbset,with_replace=True,verbose=False)
    # Batch stochastic gradient (batch size n/2)
    nbset=int(n/2)
    _, obj_c, _ = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=int(nb_epochs*n/nbset),nb=nbset,with_replace=True,verbose=False)
    # Batch stochastic gradient (batch size n, with replacement)
    nbset=n
    _, obj_d, _ = stoch_grad(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=int(nb_epochs*n/nbset),nb=nbset,with_replace=True,verbose=False)
    ############################
    
    # Plots runs on the same figure
    if i<nruns-1:
        plt.semilogy(obj_a-f_min_lin,color='orange',lw=2)
        plt.semilogy(obj_b-f_min_lin,color='green', lw=2)
        plt.semilogy(obj_c-f_min_lin,color='red', lw=2)
        plt.semilogy(obj_d-f_min_lin,color='blue', lw=2)
plt.semilogy(obj_a-f_min_lin,label="SG",color='orange',lw=2)
plt.semilogy(obj_b-f_min_lin,label="batch n/10",color='green',lw=2)
plt.semilogy(obj_c-f_min_lin,label="batch n/2",color='red', lw=2)
plt.semilogy(obj_d-f_min_lin,label="batch n",color='blue', lw=2)    

plt.title("Convergence plot", fontsize=16)
plt.xlabel("#epochs (log scale)", fontsize=14)
plt.ylabel("Objective (log scale)", fontsize=14)
plt.legend()

# <span style="color:rgb(139,69,19)">Part 3 - Variants on the stochastic gradient framework</span>

The goal of this section is to present popular variants of the classical stochastic gradient scheme. The augmented code below aims at adding the following feature The following implementation will be used throughout.

##  <span style="color:rgb(139,69,19)">3.1 Practical SG variants based on diagonal scaling</span>

#### <span style="color:rgb(139,69,19)">About *RMSProp* and *Adagrad*</span>

*RMSProp* and *Adagrad* are both based on diagonal scaling. This corresponds to rescaling the stochastic gradient step componentwise as follows

 $$
     [\mathbf{w}_{k+1}]_i  = [\mathbf{w}_k]_i -\frac{\alpha_k}{\sqrt{[\mathbf{r}_k]_i + \epsilon}}[\nabla f_{i_k}(\mathbf{w}_k)]_i,
 $$
 
 where $\epsilon>0$ is added to avoid numerical issues, and $\mathbf{r}_k \in \mathbb{R}^d$ is defined recursively by $\mathbf{r}_{-1} = 0_{\mathbb{R}^d}$ and
 
 $$ 
     \forall k \ge 0,\ \forall i=1,\dots,d, \qquad 
     [\mathbf{r}_k]_i = 
     \left\{
         \begin{array}{ll}
             \beta [\mathbf{r}_{k-1}]_i + (1-\beta) [\nabla f_{i_k}(\mathbf{w}_k)]_i^2 &\mathrm{for\ RMSProp,} \\
             [\mathbf{r}_{k-1}]_i + [\nabla f_{i_k}(\mathbf{w}_k)]_i^2 &\mathrm{for\ Adagrad.}
         \end{array}
     \right.
 $$
(Suggested values: $\epsilon=\tfrac{1}{2 \sqrt{n}}$, $\beta=0.8$.)

The use of $\epsilon>0$ prevents the scaling of each component of the stochastic gradient from going to zero. *This technique is typically adopted in modern implementations of these methods.*

In [None]:
# Advanced stochastic gradient implementation based on diagonal scaling
def stoch_grad_scaling(w0,problem,wtarget,stepchoice=0,step0=1, n_iter=1000,nb=1,beta=0,with_replace=False,verbose=False): 
    """
        A code for gradient descent with various step choices.
        
        Inputs:
            w0: Initial vector
            problem: Problem structure
                problem.fun() returns the objective function, which is assumed to be a finite sum of functions
                problem.n returns the number of components in the finite sum
                problem.grad_i() returns the gradient of a single component f_i
                problem.lipgrad() returns the Lipschitz constant for the gradient
                problem.cvxval() returns the strong convexity constant
                problem.lambda returns the value of the regularization parameter
            wtarget: Target minimum (unknown in practice!)
            stepchoice: Strategy for computing the stepsize 
                0: Constant step size equal to 1/L
                t>0: Step size decreasing in 1/(k+1)^t
            step0: Initial steplength (only used when stepchoice is not 0)
            n_iter: Number of iterations, used as stopping criterion
            nb: Number of components drawn per iteration/Batch size 
                1: Classical stochastic gradient algorithm (default value)
                problem.n: Classical gradient descent (default value)
            beta: Use a diagonal scaling
                0: No scaling (default)
                (0,1): Average of magnitudes (RMSProp)
                1: Normalization with magnitudes (Adagrad)
            with_replace: Boolean indicating whether components are drawn with or without replacement
                True: Components drawn with replacement
                False: Components drawn without replacement (Default)
            verbose: Boolean indicating whether information should be plot at every iteration (Default: False)
            
        Outputs:
            w_output: Final iterate of the method (or average if average=1)
            objvals: History of function values (Numpy array of length n_iter at most)
            normits: History of distances between iterates and optimum (Numpy array of length n_iter at most)
    """
    ############
    # Initial step: Compute and plot some initial quantities

    # objective history
    objvals = []
    
    # iterates distance to the minimum history
    normits = []
    
    # Lipschitz constant
    L = problem.lipgrad()
    
    # Number of samples
    n = problem.n
    
    # Initial value of current iterate  
    w = w0.copy()
    nw = norm(w)

    
    #Scaling values
    if beta>0:
        eps=1/(2 *(n ** (0.5))) # To avoid numerical issues
        v = np.zeros(d)

    # Initialize iteration counter
    k=0
    
    # Current objective
    obj = problem.fun(w) 
    objvals.append(obj);
    # Current distance to the optimum
    nmin = norm(w-wtarget)
    normits.append(nmin)
    
    # Plot initial quantities of interest
    if verbose:
        print("Stochastic Gradient, batch size=",nb,"/",n)
        print(' | '.join([name.center(8) for name in ["iter", "fval", "normit"]]))
        print(' | '.join([("%d" % k).rjust(8),("%.2e" % obj).rjust(8),("%.2e" % nmin).rjust(8)]))
    
    ################
    # Main loop
    while (k < n_iter and nw < 10**100):
        
        #########################################
        # Draw the batch indices
        ik = np.random.choice(n,nb,replace=with_replace)# Batch gradient
        # Stochastic gradient calculation
        sg = np.zeros(d)
        for j in range(nb):
            gi = problem.grad_i(ik[j],w)
            sg = sg + gi
        sg = (1/nb)*sg
        ###########################################
        
        ###########################################
        # Scaling
        if beta>0:
            if beta==1:
                # Adagrad update
                v = v + sg*sg 
            else:
                # RMSProp update
                v = beta*v + (1-beta)*sg*sg
            sg = sg/(np.sqrt(v+eps))
        ##########################################
            
        if stepchoice==0:
            w[:] = w - (step0/L) * sg
        elif stepchoice>0:
            sk = float(step0/((k+1)**stepchoice))
            w[:] = w - sk * sg
        
        nx = norm(w) #Computing the norm to measure divergence 
        
        obj = problem.fun(w)
        nmin = norm(w-wtarget)
       
        
        k += 1
        # Plot quantities of interest at the end of every epoch only
        if (k*nb) % n == 0:
            objvals.append(obj)
            normits.append(nmin)
            if verbose:
                print(' | '.join([("%d" % k).rjust(8),("%.2e" % obj).rjust(8),("%.2e" % nmin).rjust(8)]))     
    
    # End of main loop
    #################
    
    # Plot quantities of interest for the last iterate (if needed)
    if (k*nb) % n > 0:
        objvals.append(obj)
        normits.append(nmin)
        if verbose:
            print(' | '.join([("%d" % k).rjust(8),("%.2e" % obj).rjust(8),("%.2e" % nmin).rjust(8)]))              
    
    # Outputs
    w_output = w.copy()
    
    return w_output, np.array(objvals), np.array(normits)

### <span style="color:rgb(139,69,19)">Hands-on! Diagonal scaling</span> 

**Run the block below to compare RMSProp and Adagrad with SG using a decreasing stepsize $\frac{\alpha_0}{\sqrt{k+1}}$. What do you observe?**

**Can you improve the performance of RMSProp by playing with $\beta$?**

In [None]:
# Comparison of stochastic gradient with and without diagonal scaling

nb_epochs = 60
n = pblinreg.n
w0 = np.zeros(d)

# Stochastic gradient (batch size 1) without diagonal scaling
_, obj_a, nits_a = stoch_grad_scaling(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=nb_epochs*n,nb=1,beta=0)

# Stochastic gradient (batch size 1) with Adagrad diagonal scaling (decreasing stepsize)
_, obj_bd, nits_bd = stoch_grad_scaling(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=nb_epochs*n,nb=1,beta=1)

# Stochastic gradient (batch size 1) with RMSProp diagonal scaling (decreasing stepsize)
_, obj_cd, nits_cd = stoch_grad_scaling(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=nb_epochs*n,nb=1,beta=0.8)

# Stochastic gradient (batch size 1) with Adagrad diagonal scaling (constant stepsize)
_, obj_bc, nits_bc = stoch_grad_scaling(w0,pblinreg,w_min_lin,stepchoice=0,step0=0.2, n_iter=nb_epochs*n,nb=1,beta=1)

# Stochastic gradient (batch size 1) with RMSProp diagonal scaling (constant stepsize)
_, obj_cc, nits_cc = stoch_grad_scaling(w0,pblinreg,w_min_lin,stepchoice=0,step0=0.2, n_iter=nb_epochs*n,nb=1,beta=0.8)

In [None]:
# Plot the results - Comparison of stochastic gradient with and without diagonal scaling
# In terms of objective value (logarithmic scale)
plt.figure(figsize=(7, 5))
plt.semilogy(obj_a-f_min_lin, label="SG", lw=2)
plt.semilogy(obj_bd-f_min_lin, label="Adagrad (dec)", lw=2)
plt.semilogy(obj_cd-f_min_lin, label="RMSProp (dec)", lw=2)
plt.semilogy(obj_bc-f_min_lin, label="Adagrad (cst)", lw=2)
plt.semilogy(obj_cc-f_min_lin, label="RMSProp (cst)", lw=2)
plt.title("SG vs Adagrad/RMSProp", fontsize=16)
plt.xlabel("#epochs (log scale)", fontsize=14)
plt.ylabel("Objective (log scale)", fontsize=14)
plt.legend()
# In terms of distance to the minimum (logarithmic scale)
plt.figure(figsize=(7, 5))
plt.semilogy(nits_a, label="SG", lw=2)
plt.semilogy(nits_bd, label="Adagrad (dec)", lw=2)
plt.semilogy(nits_cd, label="RMSProp (dec)", lw=2)
plt.semilogy(nits_bc, label="Adagrad (cst)", lw=2)
plt.semilogy(nits_cc, label="RMSProp (cst)", lw=2)
plt.title("SG vs Adagrad/RMSProp", fontsize=16)
plt.xlabel("#epochs", fontsize=14)
plt.ylabel("Distance to minimum (log scale)", fontsize=14)
plt.legend()

## <span style="color:rgb(139,69,19)">3.2 Momentum-based techiques</span>

#### <span style="color:rgb(139,69,19)">SGD with momentum and Adam</span>

The idea behind momentum is to leverage information from the past iterations, by combining previous steps with the step suggested by stochastic gradient.

***Stochastic gradient with momentum*** has the following form

$$
    \mathbf{w}_{k+1} = \mathbf{w}_k - \alpha_k \mathbf{m}_k, 
    \quad \mathrm{where} \quad
    \mathbf{m}_k = \beta_1 \mathbf{m}_{k-1} + (1-\beta_1)\nabla f_{i_k}(\mathbf{w}_k),
$$

with $\beta_1 \in [0,1)$ (with $\beta_1=0$, we recover the standard stochastic gradient technique).

***Adam*** combines momentum and diagonal scaling ideas, and can be written as 
$$
    \mathbf{w}_{k+1} = \mathbf{w}_k - \alpha_k \mathbf{m}_k \oslash \sqrt{\mathbf{v}_k},
$$
with
$$
    \mathbf{m}_k = \frac{1-\beta_1^k}{1-\beta_1^{k+1}}\beta_1\mathbf{m}_{k-1} + \frac{1-\beta_1}{1-\beta_1^{k+1}}\nabla f_{i_k}(\mathbf{w}_k)
$$
and
$$
    \mathbf{v}_k = \frac{1-\beta_2^k}{1-\beta_2^{k+1}}\beta_2\mathbf{v}_{k-1} + \frac{1-\beta_2}{1-\beta_2^{k+1}}\nabla f_{i_k}(\mathbf{w}_k)\otimes\nabla f_{i_k}(\mathbf{w}_k).
$$

In practice, the vector $\mathbf{v}_k$ is replaced by $\mathbf{v}_k+\epsilon$.

The code below implements these momentum-based techniques.

In [None]:
# Stochastic gradient with momentum
def stoch_grad_momentum(w0,problem,wtarget,stepchoice=0,step0=1, n_iter=1000,nb=1,beta1=0.9,beta2=0.999,with_replace=False,verbose=False): 
    """
        A code for stochastic gradient with momentum and Adam
        The code depends on two parameter beta1 and beta2.
            
            1) beta1=beta2=0 gives vanilla stochastic gradient.
            2) beta1>0 and beta2=0 corresponds to stochastic gradient with momentum
            3) beta1>0 and beta2>0 corresponds to Adam
            
            Nb: Choosing beta1=0 and beta2>0 would be more in the spirit of RMSProp above.
        
        Inputs:
            w0: Initial point
            problem: Instance to be minimized
                problem.fun(x) Objective function            
                problem.grad_i() Gradient of function f_i for finite sum
                problem.lipgrad() Lipschitz constant for the gradient
            wopt: Target minimum
            stepchoice: Stepsize choice
                0: Constant proportional to 1/L (L Lipschitz constant for the gradient)
                a>0: Decreasing, set to 1/((k+1)**a)
            step0: Initial stepsize
            n_iter: Maximum number of iterations
            nb: Batch size
            beta1: Momentum parameter
                0: Classical stochastic gradient direction (no momentum)
                (0,1): Momentum-based method (défaut: 0.9)
            beta2: Scaling parameter
                0: No scaling (Same stepsize for each coordinate)
                (0,1): Scaling every coordinate (default: 0.999) 
            with_replace: Drawn indices with replacement?
            verbose: Plot iteration-dependent information
            
        Outputs:
            x_output: Final iterate of the method (or average if average=1)
            objvals: History of function values (Numpy array of length n_iter at most)
            normits: History of distances between iterates and optimum (Numpy array of length n_iter at most)
    """

    ############
    # Initialization

    # History of objective values and distance to the target minimum
    objvals = []
    normits = []
    
    # Lipschitz constant
    L = problem.lipgrad()
    
    # Number of data points
    n = problem.n
    
    # Initial point
    w = w0.copy()
    nw = norm(w)
    
    # Vector characterizing the direction
    mv = np.zeros(d)
    
    # Scaling vector (if needed)
    if beta2>0:
        eps=0 #10**(-8) # Avoids numerical issues
        v = np.zeros(d)

    # Iteration count
    k=0
    
    # Initial values
    obj = problem.fun(w) 
    objvals.append(obj);
    # 
    nmin = norm(w-wtarget)
    normits.append(nmin)
    
    # Plotting information (optional)
    if verbose:
        if beta1>0:
            if beta2>0:
                print("Adam, batch size=",nb,"/",n)
            else:
                print("SGD with momentum, batch size=",nb,"/",n)
        else:
            print("Stochastic gradient, batch size=",nb,"/",n)
        print(' | '.join([name.center(8) for name in ["iter", "fval", "normit"]]))
        print(' | '.join([("%d" % k).rjust(8),("%.2e" % obj).rjust(8),("%.2e" % nmin).rjust(8)]))
    
    ################
    # Main loop
    while (k < n_iter and nw < 10**100):
        
        #########################################
        # Draw indices
        ik = np.random.choice(n,nb,replace=with_replace)
        # Compute stochastic gradient estimate
        sg = np.zeros(d)
        for j in range(nb):
            gi = problem.grad_i(ik[j],w)
            sg = sg + gi
        sg = (1/nb)*sg
        ###########################################
        
        ###########################################
        # Update the direction
        if beta1>0:
            if beta2>0:
                mv = ((1-beta1**k)/(1-beta1**(k+1)))*beta1*mv + ((1-beta1)/(1-beta1**(k+1)))*sg
            else:
                mv = beta1*mv + (1-beta1)*sg
        else:
            mv = sg
            
        ###########################################
        # Update scaling vector
        if beta2>0:
            v = ((1-beta2**k)/(1-beta2**(k+1)))*beta2*v + ((1-beta2)/(1-beta2**(k+1)))*sg*sg
            if k>0:
                mv = mv/(np.sqrt(v+eps))
        ##########################################
            
        if stepchoice==0:
            w[:] = w - (step0/L) * mv
        elif stepchoice>0:
            sk = float(step0/((k+1)**stepchoice))
            w[:] = w - sk * mv
        
        nw = norm(w) # Compute norm to avoid divergence
        
        obj = problem.fun(w)
        nmin = norm(w-wtarget)
        
        k += 1
        # Affichage
        if (k*nb) % n == 0:
            objvals.append(obj)
            normits.append(nmin)
            if verbose:
                print(' | '.join([("%d" % k).rjust(8),("%.2e" % obj).rjust(8),("%.2e" % nmin).rjust(8)]))     

    
    # End main loop
    #################
    
    # Plotting information (optional)
    if (k*nb) % n > 0:
        objvals.append(obj)
        normits.append(nmin)
        if verbose:
            print(' | '.join([("%d" % k).rjust(8),("%.2e" % obj).rjust(8),("%.2e" % nmin).rjust(8)]))              
    
    # Last iterate
    w_output = w.copy()
    
    return w_output, np.array(objvals), np.array(normits)

### <span style="color:rgb(139,69,19)">Hands-on! Momentum</span> 

**1) Compare SG with momentum and Adam to SG using the defaults settings for Adam and $\beta_1=0.9$ for SG with momentum.**

**2) Play with the learning rate to try to improve the numerical performance of Adam.**

In [None]:
# Numerical comparison
nb_epochs = 100
n = pblinreg.n
w0 = np.zeros(d)

# Fix random seed for reproducibility
np.random.seed(4)

# Vanilla SG (decreasing stepsize)
_, obj_sg_d, nits_sg_d = stoch_grad_momentum(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=nb_epochs*n,nb=1,beta1=0,beta2=0)
# SG with momentum (decreasing stepsize)
_, obj_sgm_d, nits_sgm_d = stoch_grad_momentum(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.2, n_iter=nb_epochs*n,nb=1,beta1=0.9,beta2=0)
# Adam (decreasing stepsize)
_, obj_adam_d, nits_adam_d = stoch_grad_momentum(w0,pblinreg,w_min_lin,stepchoice=0.5,step0=0.01, n_iter=nb_epochs*n,nb=1)
# SG (constant stepsize)
_, obj_sg_c, nits_sg_c = stoch_grad_momentum(w0,pblinreg,w_min_lin,stepchoice=0,step0=0.001, n_iter=nb_epochs*n,nb=1,beta1=0,beta2=0)
# SG with momentum (constant stepsize)
_, obj_sgm_c, nits_sgm_c = stoch_grad_momentum(w0,pblinreg,w_min_lin,stepchoice=0,step0=0.001, n_iter=nb_epochs*n,nb=1,beta1=0.9,beta2=0)
# Adam (constant stepsize)
_, obj_adam_c, nits_adam_c = stoch_grad_momentum(w0,pblinreg,w_min_lin,stepchoice=0,step0=0.01, n_iter=nb_epochs*n,nb=1)

In [None]:
# Results
# In terms of function values
plt.figure(figsize=(7, 5))
plt.semilogy(obj_sg_d-f_min_lin, label="SG (dec)", lw=2)
plt.semilogy(obj_sgm_d-f_min_lin, label="SG+momentum (dec)", lw=2)
plt.semilogy(obj_adam_d-f_min_lin, label="Adam (dec)", lw=2)
plt.semilogy(obj_sg_c-f_min_lin, label="SG (cst)", lw=2)
plt.semilogy(obj_sgm_c-f_min_lin, label="SG+momentum (cst)", lw=2)
plt.semilogy(obj_adam_c-f_min_lin, label="Adam (cst)", lw=2)
plt.title("SG vs Momentum", fontsize=16)
plt.xlabel("#epochs", fontsize=14)
plt.ylabel("Objective-target (log)", fontsize=14)
plt.legend()
# In terms of distance to the target value
plt.figure(figsize=(7, 5))
plt.semilogy(nits_sg_d, label="SG (dec)", lw=2)
plt.semilogy(nits_sgm_d, label="SG+momentum (dec)", lw=2)
plt.semilogy(nits_adam_d, label="Adam (dec)", lw=2)
plt.semilogy(nits_sg_c, label="SG (cst)", lw=2)
plt.semilogy(nits_sgm_c, label="SG+momentum (cst)", lw=2)
plt.semilogy(nits_adam_c, label="Adam (cst)", lw=2)
plt.title("SG vs Momentum", fontsize=16)
plt.xlabel("#epochs", fontsize=14)
plt.ylabel("Distance to minimum (log)", fontsize=14)
plt.legend()

In [None]:
# Version 4.4 - C. W. Royer, February 2025.