<a href="https://colab.research.google.com/github/gabrielecola/Optimization_Tecniques/blob/main/Stochastic_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework $\beta$ - Stochastic methods and regularization techniques

In [None]:
import typing
import numpy as np
import scipy.linalg
import numpy as np
import plotly.colors
import plotly.graph_objects as go
import plotly.subplots
import os
import pandas as pd

## Homework $\beta$.1

In this exercise we consider a training dataset composed of $m$ entries, where each entry is characterized by $d$ features and one observation. Throughout the exercise we will use a linear regression problem on such dataset to compare deterministic methods to stochastic methods, as well as unregularized problems to problems with regularization techniques. In this exercise we will be focused only on the training part (and not on the validation of the trained model on a test dataset), therefore in the following the training dataset will be simply called *the dataset*, as it will be the only dataset we will encounter in this exercise.

1. Generate a dataset with $d = 50$ features and $m = 1000$ samples using the following Python function.

   For reproducibility, please set the *seed* `np.random.seed(21 + 100)` at the beginning of the cell in which you call the function.

In [None]:
def generate_dataset(d: int, m: int) -> typing.Tuple[np.array, np.array]:
    """Generate a dataset with d features and m samples."""
    range_d = np.arange(d)
    cov = scipy.linalg.special_matrices.toeplitz(0.1**range_d)
    X = np.random.multivariate_normal(np.zeros(d), cov, size=m)
    w = (-1)**range_d * np.exp(- range_d / 10.)
    noise = np.random.randn(m)
    y = np.dot(X, w) + noise
    return X, y

In [None]:
m = 1000
d=50
np.random.seed(21 + 100)
X, y = generate_dataset(d, m)

  cov = scipy.linalg.special_matrices.toeplitz(0.1**range_d)


In [None]:
X.shape #(m, d)

(1000, 50)

2. Implement the evaluation of the empirical risk associated to a linear regression via least squares with multiple features, as well as its gradient. The implemented functions should have two arguments: the first argument `w` represents a vector with $n = d + 1$, while the second argument `addend` is an optional argument that can be either a natural number `j` or `None` (default value: `None`), for later use in stochastic method. If `addend` is a natural number `j`, then the function empirical risk should return the least square loss associated to the `j`-th row of the dataset, while if `addend` is `None` then the empirical risk on the whole dataset should be computed.

To compute the function of the empirical risk, it is necessary to compute the prediction function $\hat{y}$ defined as:
$$\hat{y}(\boldsymbol{x}; \boldsymbol{w}) = \sum_{i = 1}^{n - 1} w^{(i)} x^{(i)} + w^{(n)}$$

In [None]:
def y_hat(x_j: float, w: np.ndarray) -> float:
    #Evaluate the prediction function associated to a polynomial regression.
    return np.dot(w[:-1], x_j) + w[-1]

> The least squares loss and its derivatives are:
$$\ell(x, y; \boldsymbol{w}) = \left(\sum_{i = 1}^{n - 1} w^{(i)} x^{(i)} + w^{(n)} - y\right)^2$$
$$\nabla_\boldsymbol{w} \ell(x, y; \boldsymbol{w}) = 2 \left(\sum_{i = 1}^{n - 1} w^{(i)} x^{(i)} + w^{(n)} - y\right) \begin{bmatrix}\boldsymbol{x}\\1\end{bmatrix}$$

In [None]:
def least_squares_loss(x_j: np.ndarray, y_j: float, w: np.ndarray) -> float:
  #Evaluate the least squares loss
  return (y_hat(x_j, w) - y_j)**2

In [None]:
def grad_least_squares_loss(x_j: np.ndarray, y_j: float, w: np.ndarray) -> np.ndarray:
  #Evaluate the gradient of the least squares loss
  vec = np.zeros(w.shape[0])
  vec[:-1] = x_j
  vec[-1] = 1
  return 2 * (y_hat(x_j, w) - y_j) * vec

The empirical risk is the obtained summing over all elements of the  dataset.

In [None]:
def empirical_risk(w: np.ndarray, addend: int = None) -> float:
  #Evaluate the empirical risk (if addend is None).
  if addend is None:
      m = X.shape[0]
      return 1 / m * sum(least_squares_loss(x_j, y_j, w) for (x_j, y_j) in zip(X, y))

  #For use within a stochastic method, the optional parameter addend may take an integer value.
  #In such case, the loss associated to the addend-th element is computed instead.
  else:
      return least_squares_loss(X[addend], y[addend], w)

In [None]:
def grad_empirical_risk(w: np.ndarray, addend: int = None) -> np.ndarray:
  #Evaluate the gradient of the empirical risk (if addend is None).
  if addend is None:
    m = X.shape[0]
    return 1 / m * sum(grad_least_squares_loss(x_j, y_j, w) for (x_j, y_j) in zip(X, y))

  #For use within a stochastic method, the optional parameter addend may take an integer value.
  #In such case, the gradient loss associated to the addend-th element is computed instead.

  else:
    return grad_least_squares_loss(X[addend], y[addend], w)

3. How do the strong convexity constant $\mu$ and smoothness constant $L$ of the least squares empirical risk relate to the eigenvalues of $\boldsymbol{X}^T \boldsymbol{X}$? Based on your answer, determine if the least squares empirical risk associated to the generated dataset $\boldsymbol{X}$ is a strongly convex and smooth function.

The strong convexity constant $\mu$ is equal to $$μ = 2 \frac{\lambda_{\min}(\boldsymbol{X}^T \boldsymbol{X})}{m}.$$

Computing this value we get that $\mu = 1.1177985917378552$ that is positive and therefore it is possible to conclude that the least squares empirical risk associated to the generated dataset $\boldsymbol{X}$ is strongly convex.

Moreover, $$L = 2 \frac{\lambda_{\max}(\boldsymbol{X}^T \boldsymbol{X})}{m}.$$
that is equal to 3.2455878260791122, which is higher than 0 and therefore the least squares empirical risk associated to the generated dataset $\boldsymbol{X}$ is also a smooth function.

In [None]:
# Compute (X^T)X
XtX = np.dot(np.transpose(X), X)
# Compute eigenvalues
eigs, _ = np.linalg.eig(XtX)

# Compute the maximum and the minimum eigenvalues
lambda_min = min(eigs)
lambda_max = max(eigs)

# Compute mu
mu = 2/m *lambda_min
mu #positive, so f is strongly convex

1.1177985917378568

In [None]:
# Compute L
L = 2/m *lambda_max
L #positive, so f is smooth

3.245587826079118

4. Implement the gradient descent method in a Python function. Such function should take as inputs
   1. the function $f$,
   2. its gradient $\nabla f$,
   3. the value $\mu$ of the strong convexity constant,
   4. the value $L$ of the smoothness constant,
   5. a string that controls how successive step lengths are generated: `constant` to generate $\alpha_k = \frac{1}{L}$, `inverse_linear` to generate $\alpha_k = \frac{1}{\mu k + L}$, `inverse_square_root` to generate $\alpha_k = \frac{1}{\sqrt{\mu^2 k + L^2}}$,
   6. the tolerance $\varepsilon$ for the stopping criterion, based on the norm of the gradient,
   7. maximum number $K_{\max}$ of allowed iterations, and
   8. the initial condition $\boldsymbol{w}_{0}$;

   and return as outputs:
   1. the optimization variable iterations $\{\boldsymbol{w}_k\}_k$,
   2. the corresponding function values $\{f(\boldsymbol{w}_k)\}_k$ and
   3. the corresponding gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$.

In [None]:
def gradient_descent(f: typing.Callable, grad_f: typing.Callable, mu: float, L: float, step_lengths: str, epsilon: float, k_max: int, w_0: np.ndarray)-> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
  # Prepare lists collecting the required outputs over the iterations
  all_w = [w_0]
  all_f = [f(w_0)]
  all_grad_f = [grad_f(w_0)]

  # Prepare iteration counter
  k = 0

  # Use the norm of the gradient as stopping criterion.
  while np.linalg.norm(all_grad_f[k]) > epsilon:

    # Setting the value of alpha depending on the choosen step length
    if step_lengths == "constant":
      alpha_k = 1/L
    elif step_lengths == "inverse_linear":
      alpha_k = 1/(mu*k+ L)
    elif step_lengths == "inverse_square_root":
      alpha_k = 1/np.sqrt(mu**2 * k + L**2)
    else:
      Warning("Wrong indication for alpha")

    w_k = all_w[k]
    f_k = all_f[k]
    grad_f_k = all_grad_f[k]

    p_k = - grad_f_k

    # Compute w_{k+1}
    w_k_plus_1 = w_k + alpha_k * p_k

    # Update required outputs
    all_w.append(w_k_plus_1)
    all_f.append(f(w_k_plus_1))
    all_grad_f.append(grad_f(w_k_plus_1))

    # Increment iteration counter
    k += 1

    # Bail out if exceeded allowed number of iterations
    if k >= k_max:
      return np.array(all_w), np.array(all_f), np.array(all_grad_f)
      #print("WARNING: stochastic gradient method exceeded number of allowed iterations")

  # For convenience we transform the outputs into numpy array before returning
  return np.array(all_w), np.array(all_f), np.array(all_grad_f)

5. Choose step length generation as either `constant`, `inverse_linear` or `inverse_square_root`, $\varepsilon = 10^{-8}$, $K_{\max} = 600$ and $\boldsymbol{w}_0 = \boldsymbol{0}$. Run the gradient method and visualize:
   * a semilogarithimic plot of the function value $\{f(\boldsymbol{w}_k)\}_k$ versus the iteration counter $k$;
   * a semilogarithimic plot of the norm of the gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$ versus the iteration counter $k$.

   Each plot should contain three curves, corresponding to the three possible step length generation choices `constant`, `inverse_linear`, `inverse_square_root`.

   Based on the information in the plots, answer to the following questions:
   1. which step length generation choice would you prefer, and why?
   2. are the two plots equally helpful in answering to the previous question, or is one of them more helpful than the other? If the latter, why?

In [None]:
w0=np.array(np.zeros(X.shape[1]+1))

Running gradient descent method with alpha constant equal to 1/L:

In [None]:
all_w_gradient_constant, all_f_gradient_constant, all_grad_f_gradient_constant = gradient_descent(empirical_risk,
                                                                                                  grad_empirical_risk,
                                                                                                  mu,
                                                                                                  L,
                                                                                                  "constant",
                                                                                                  10**(-8),
                                                                                                  600,
                                                                                                  w0)

Running gradient descent method with inverse linear step length equal to $\alpha_k = \frac{1}{k \mu + L}$:

In [None]:
all_w_gradient_inverse_linear, all_f_gradient_inverse_linear, all_grad_f_gradient_inverse_linear = gradient_descent(empirical_risk,
                                                                                                                    grad_empirical_risk,
                                                                                                                    mu,
                                                                                                                    L,
                                                                                                                    "inverse_linear",
                                                                                                                    10**(-8),
                                                                                                                    600,
                                                                                                                    w0)

Running gradient descent method with inverse square root step length equal to $\alpha_k = \frac{1}{\sqrt{k \mu^2 + L^2}}$:

In [None]:
all_w_gradient_inverse_square_root, all_f_gradient_inverse_square_root, all_grad_f_gradient_inverse_square_root = gradient_descent(empirical_risk,
                                                                                                                                   grad_empirical_risk,
                                                                                                                                   mu,
                                                                                                                                   L,
                                                                                                                                   "inverse_square_root",
                                                                                                                                   10**(-8),
                                                                                                                                   600,
                                                                                                                                   w0)

In [None]:
fig1 = go.Figure()
all_f = [all_f_gradient_constant, all_f_gradient_inverse_linear, all_f_gradient_inverse_square_root]
for run in range(3):
  methods = ["constant step length",  "inverse linear step length",  "inverse square root step length"]
  fig1.add_scatter(
      x=np.arange(all_f[run].shape[0]),
      y=all_f[run],
      marker=dict(color=plotly.colors.qualitative.Set1[run], size=7),
      line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
      mode="lines+markers", name= methods[run]
      )
fig1.update_layout(
    title="Error on the function value VS iterations k",
    width=768, height=768, autosize=False
)
fig1.update_yaxes(type="log", exponentformat="power")
fig1.show()

In [None]:
fig2 = go.Figure()
all_grad_f = [all_grad_f_gradient_constant, all_grad_f_gradient_inverse_linear, all_grad_f_gradient_inverse_square_root]
for run in range(3):
  methods = ["constant step length",  "inverse linear step length",  "inverse square root step length"]
  fig2.add_scatter(
      x=np.arange(all_f[run].shape[0]), y=np.linalg.norm(all_grad_f[run], axis=1),
      marker=dict(color=plotly.colors.qualitative.Set1[run], size=10),
      line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
      mode="lines+markers", name= methods[run]
      )
fig2.update_layout(
    title="Norm of the gradient VS iterations k",
    width=768, height=768, autosize=False
)
fig2.update_yaxes(type="log", exponentformat="power")
fig2.show()

From the graphs above, it is evident that the gradient method with a constant step length ($\alpha$=$\frac{1}{L}$) converges in fewer iterations. Therefore, for gradient descent, I would select this constant value for $\alpha$. Additionally, the chart below demonstrates that the gradient method reaches the fixed value of $\epsilon$=$10^{-8}$ within 600 iterations only when using either the constant step length or the inverse linear step length.

Examining the differences between the two plots, it is clear that both indicate the same optimal step length. However, the second plot (showing the norm of the gradient) more clearly highlights that the constant step length is the fastest. Furthermore, it reveals that the initial decrease among the three lines varies significantly, whereas the first plot suggests that the three curves behave similarly in the initial iterations.

In [None]:
result = pd.DataFrame(columns=[0,1,2], index=["Norm of the gradient", "Iterations", "Function Value"])
for i in range(3):
  result[i].iloc[0] = np.linalg.norm(all_grad_f[i][-1])
  result[i].iloc[1] = len(all_f[i])
  result[i].iloc[2] = all_f[i][-1]

result.rename({0:'alpha = 1/L', 1:'alpha_k = 1/(mu*k+L)', 2:'alpha_k = 1/(sqrt(k*mu^2 + L^2))'}, axis=1, inplace=True)



g = result.style.set_caption("GRADIENT DESCENT: Final norm of the gradient, iterations and cost function for different values of alpha") \
.highlight_min(axis=1, subset= "alpha = 1/L", props='color:black; background-color: #8deeee;')\
.format(na_rep='\\', thousands=" ")

headers = {
    'selector': 'th:not(.index_name)',
    'props': 'background-color: #009acd; color: white;text-align: center;font-weight:bold;'
}
cells = {'selector': 'td',
         'props': 'text-align: center;'}
g.set_table_styles([headers, cells])

Unnamed: 0,alpha = 1/L,alpha_k = 1/(mu*k+L),alpha_k = 1/(sqrt(k*mu^2 + L^2))
Norm of the gradient,0.0,0.001968,0.0
Iterations,39.0,601.0,96.0
Function Value,0.94534,0.945341,0.94534


6. Implement the mini-batch stochastic gradient method in a Python function. Such function should take as inputs
   1. the number $m$ of addends,
   2. the size $m_b$ of a mini-batch,
   3. the function $f$,
   4. its gradient $\nabla f$,
   5. the value $\alpha_0$ of the first step length,
   6. a string that controls how successive step lengths are generated: `constant` to generate $\alpha_k = \alpha_0$, `inverse_linear` to generate $\alpha_k = \frac{\alpha_0}{k + 1}$, `inverse_square_root` to generate $\alpha_k = \frac{\alpha_0}{\sqrt{k + 1}}$,
   7. the tolerance $\varepsilon$ for the stopping criterion, based on the norm of the gradient,
   8. maximum number $E_{\max}$ of allowed *epochs*, and
   9. the initial condition $\boldsymbol{w}_{0}$;

   and return as outputs:
   1. the optimization variable iterations $\{\boldsymbol{w}_k\}_k$,
   2. the corresponding function values $\{f(\boldsymbol{w}_k)\}_k$ and
   3. the corresponding gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$.

In [None]:
def mini_batch_stochastic_gradient(m: int, m_b: int, f: typing.Callable, grad_f: typing.Callable, alpha_0: float, epsilon: float, Emax: int, w_0: np.ndarray, step_lenghts: str) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
  # Prepare lists collecting the required outputs over the iterations
  all_w = [w_0]
  all_f = [f(w_0)]
  all_grad_f = [grad_f(w_0)]
  E = []

  # Prepare iteration counter
  k = 0

  # Use the norm of the gradient as stopping criterion.
  while np.linalg.norm(all_grad_f[k]) > epsilon:

    if step_lenghts == "constant":
      alpha_k = alpha_0
    elif step_lenghts == "inverse_linear":
      alpha_k = alpha_0/(k+ 1)
    elif step_lenghts == "inverse_square_root":
      alpha_k = alpha_0/np.sqrt(k + 1)
    else:
      Warning("Wrong indication for alpha")

    w_k = all_w[k]

    # Draw random indices
    J_k = np.random.choice(m, size=m_b, replace=False)

    # Compute the update direction
    g_k = - 1 / m_b * sum([grad_f(w_k, addend=j) for j in J_k])

    # Compute w_{k + 1}
    w_k_plus_1 = w_k + alpha_k * g_k

    # Update required outputs
    all_w.append(w_k_plus_1)
    all_f.append(f(w_k_plus_1))
    all_grad_f.append(grad_f(w_k_plus_1))

    #epochs
    E_k = k * m_b / m

    E.append(E_k)

    # Bail out if exceeded allowed number of iterations
    if E_k >= Emax:
      #print("WARNING: stochastic gradient method exceeded number of allowed epochs")
      return np.array(all_w), np.array(all_f), np.array(all_grad_f), np.array(E)

    # Increment iteration counter
    k += 1


  # For convenience we transform the outputs into numpy array before returning
  return np.array(all_w), np.array(all_f), np.array(all_grad_f), np.array(E)

7. Choose $m_b = \frac{m}{10}$, $\alpha_0 = \frac{1}{L}$, step length generation choice either `constant`, `inverse_linear` or `inverse_square_root`, $\varepsilon = 10^{-3}$, $E_{\max} = 100$ and $\boldsymbol{w}_0 = \boldsymbol{0}$. Run the mini-batch stochastic gradient method and visualize:
   * a semilogarithimic plot of the norm of the gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$ versus the iteration counter $k$;
   * a semilogarithimic plot of the norm of the gradients $\{\nabla f(\boldsymbol{w}_e)\}_e$ versus the epoch counter $e$.

   Each plot should contain three curves, corresponding to the three possible step length generation choices `constant`, `inverse_linear`, `inverse_square_root`.

   Based on the information in the plots, answer to the following questions:
   1. which step length generation choice would you prefer, and why? Is the answer the same as the one that you gave for the gradient method?
   2. are the two plots equally helpful in answering to the previous question, or is one of them more helpful than the other? If the latter, why?

   For reproducibility, please set the *seed* `np.random.seed(21 + 700)` at the beginning of the cell where the mini-batch stochastic gradient method is run.

Running mini-batch stochastic gradient method with constant step length equal to $\alpha_k = \alpha_0$:

In [None]:
np.random.seed(21 + 700)
all_w_MB_SG_constant, all_f_MB_SG_constant, all_grad_f_MB_SG_constant, E_MB_SG_constant = mini_batch_stochastic_gradient(
    m,
    m//10,
    empirical_risk,
    grad_empirical_risk,
    1/L,
    10**(-3),
    100,
    np.array(np.zeros(X.shape[1]+1)),
    "constant")

Running mini-batch stochastic gradient method with inverse linear step length equal to $\alpha_k = \frac{\alpha_0}{k+1}$:

In [None]:
np.random.seed(21 + 700)
all_w_MB_SG_inverse_linear, all_f_MB_SG_inverse_linear, all_grad_f_MB_SG_inverse_linear, E_MB_SG_inverse_linear = mini_batch_stochastic_gradient(
    m,
    m//10,
    empirical_risk,
    grad_empirical_risk,
    1/L,
    10**(-3),
    100,
    np.array(np.zeros(X.shape[1]+1)),
    "inverse_linear")

Running mini-batch stochastic gradient method with inverse square root step length equal to $\alpha_k = \frac{\alpha_0}{\sqrt{k+1}}$:

In [None]:
np.random.seed(21 + 700)
all_w_MB_SG_inverse_square_root, all_f_MB_SG_inverse_square_root, all_grad_f_MB_SG_inverse_square_root, E_MB_SG_inverse_square_root = mini_batch_stochastic_gradient(
    m,
    m//10,
    empirical_risk,
    grad_empirical_risk,
    1/L,
    10**(-3),
    100,
    np.array(np.zeros(X.shape[1]+1)),
    "inverse_square_root")

In [None]:
#creating lists with results for esch different value of step length
all_f_MB = [all_f_MB_SG_constant, all_f_MB_SG_inverse_linear, all_f_MB_SG_inverse_square_root]
all_grad_f_MB = [all_grad_f_MB_SG_constant, all_grad_f_MB_SG_inverse_linear, all_grad_f_MB_SG_inverse_square_root]
methods = ["Constant step length",  "Inverse linear step length",  "Inverse square root step length"]
E = [E_MB_SG_constant, E_MB_SG_inverse_linear, E_MB_SG_inverse_square_root]

# Plotting Norm of the gradient on the Y axis VS the iterations k on the X-axis for each step length
fig3 = go.Figure()
for run in range(3):
  fig3.add_scatter(
      x=np.arange(all_f_MB[run].shape[0]), y=np.linalg.norm(all_grad_f_MB[run], axis=1),
      marker=dict(color=plotly.colors.qualitative.Set1[run], size=10),
      line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
      mode="lines+markers", name= methods[run]
      )
fig3.update_layout(
    title="Norm of the gradient VS iterations k"
)
fig3.update_yaxes(type="log", exponentformat="power")
fig3.show()

In [None]:
# Plotting Norm of the gradient on the Y axis VS the epochs E on the X-axis for each step length
fig4 = go.Figure()
for run in range(3):
  fig4.add_scatter(
      x=(E[run]), y=np.linalg.norm(all_grad_f_MB[run], axis=1),
      marker=dict(color=plotly.colors.qualitative.Set1[run], size=5),
      line=dict(color=plotly.colors.qualitative.Set1[run], width=1),
      mode="lines+markers", name= methods[run]
      )
fig4.update_layout(
    title="Norm of the gradient vs epochs",
)
fig4.update_yaxes(type="log", exponentformat="power")
fig4.show()

In [None]:
result = pd.DataFrame(columns=[0,1,2], index=["Norm of the gradient", "Iterations", "Function Value", "Epochs"])
for i in range(3):
  result[i].iloc[0] = np.linalg.norm(all_grad_f_MB[i][-1])
  result[i].iloc[1] = len(all_grad_f_MB[i])
  result[i].iloc[2] = all_f_MB[i][-1]
  result[i].iloc[3] = E[i][-1]

result.rename({0:'Constant step length', 1:'Inverse linear step length', 2:'Inverse square root step length'}, axis=1, inplace=True)



g = result.style.set_caption("MINI BATCH STOCHASTIC GRADIENT: Final norm of the gradient, iterations, function value and epochs for different values of alpha") \
.highlight_min(axis=1, subset= "Inverse linear step length", props='color:black; background-color: #8deeee;')\
.format(na_rep='\\', thousands=" ")

headers = {
    'selector': 'th:not(.index_name)',
    'props': 'background-color: #009acd; color: white;text-align: center;font-weight:bold;'
}
cells = {'selector': 'td',
         'props': 'text-align: center;'}
g.set_table_styles([headers, cells])

Unnamed: 0,Constant step length,Inverse linear step length,Inverse square root step length
Norm of the gradient,1.188747,0.101827,0.128115
Iterations,1 002,1 002,1 002
Function Value,1.258834,0.948837,0.949274
Epochs,100.000000,100.000000,100.000000



Using the mini-batch stochastic gradient method, the best value for alpha appears to be the inverse linear step length. With this value, the norm of the gradients is smaller when the algorithm reaches the maximum number of allowed epochs. However, it is important to note that none of the considered cases achieve $\epsilon$ = $10^{-3}$ within 100 epochs, and the final results are not significantly different, especially between the inverse linear step length and the inverse square root step length.

The main distinction among the results is that the algorithms with constant step length and inverse square root step length exhibit a lot of oscillation. The constant step length oscillates around a `norm of the gradient`=1, while the inverse square root step length tends to decrease. In contrast, the algorithm with inverse linear step length does not oscillate.

Despite having different x-axes (iterations k in the first plot ranging from [0-1002], and epochs in the second plot ranging from [0-100]), the two plots do not provide different insights. The curves appear identical, differing only in the x-axis scale. This is because $E = K (\frac{m_{b}}{m})$, and since  $\frac{m_{b}}{m}$ = 0.1 for all algorithms, the two plots convey the same information.

8. Choose $m_b \in \{0.05 m, 0.1 m, 0.5 m, m\}$, $\alpha_0 = \frac{1}{L}$, the `inverse_linear` step length generation, $\varepsilon = 10^{-3}$, $E_{\max} = 100$ and $\boldsymbol{w}_0 = \boldsymbol{0}$. Run the mini-batch stochastic gradient method and visualize:
   * a semilogarithimic plot of the norm of the gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$ versus the iteration counter $k$;
   * a semilogarithimic plot of the norm of the gradients $\{\nabla f(\boldsymbol{w}_e)\}_e$ versus the epoch counter $e$.

   Each plot should contain four curves, corresponding to $m_b \in \{0.05 m, 0.1 m, 0.5 m, m\}$.

   Based on the information in the plots, answer to the following questions:
   1. which value of $m_b$ would you prefer (especially if the evaluation of the gradient of the loss function associated to a single entry in the dataset is *very* expansive), and why?
   2. are the two plots equally helpful in answering to the previous question, or is one of them more helpful than the other? If the latter, why?

   For reproducibility, please set the *seed* `np.random.seed(21 + 800)` at the beginning of the cell where the mini-batch stochastic gradient method is run.

In [None]:
all_w = [None] * 4
all_f = [None] * 4
all_grad_f = [None] * 4
all_E = [None] * 4

for run in range(4):
  np.random.seed(21 + 800)
  m = m
  m_b = [int(0.05 * m), int(0.1 * m), int(0.5 * m), int(m)][run]
  all_w[run], all_f[run], all_grad_f[run], all_E[run] = mini_batch_stochastic_gradient(
    m,
    m_b,
    empirical_risk,
    grad_empirical_risk,
    1/L,
    10**(-3),
    100,
    np.array(np.zeros(X.shape[1]+1)),
    "inverse_linear")


In [None]:
result = pd.DataFrame(columns=[0,1,2, 3], index=["Norm of the gradient", "Iterations", "Function Value", "Epochs"])
for i in range(4):
  result[i].iloc[0] = np.linalg.norm(all_grad_f[i][-1])
  result[i].iloc[1] = len(all_grad_f[i])
  result[i].iloc[2] = all_f[i][-1]
  result[i].iloc[3] = all_E[i][-1]

result.rename({0:'m_b = 0.05m', 1:'m_b = 0.01m', 2:'m_b = 0.5m', 3: 'm_b = m'}, axis=1, inplace=True)

g = result.style.set_caption("MINI-BATCH STOCHASTIC GRADIENT: Final norm of the gradient, iterations, function values and epochs for different values of m_b") \
.highlight_min(axis=1, subset= "m_b = 0.05m", props='color:black; background-color: #8deeee;')\
.format(na_rep='\\', thousands=" ")

headers = {
    'selector': 'th:not(.index_name)',
    'props': 'background-color: #009acd; color: white;text-align: center;font-weight:bold;'
}
cells = {'selector': 'td',
         'props': 'text-align: center;'}
g.set_table_styles([headers, cells])

Unnamed: 0,m_b = 0.05m,m_b = 0.01m,m_b = 0.5m,m_b = m
Norm of the gradient,0.122168,0.121140,0.187735,0.243932
Iterations,2 002,1 002,202.0,102.0
Function Value,0.950296,0.950441,0.957539,0.966167
Epochs,100.000000,100.000000,100.0,100.0


In [None]:
m_b = [0.05 * m, 0.1*m, 0.5*m, m]
col = ["mediumturquoise", "skyblue", "steelblue", "royalblue"]

fig5 = go.Figure()
for run in range(4):
    fig5.add_scatter(
        x=np.arange(all_f[run].shape[0]), y=np.linalg.norm(all_grad_f[run], axis=1),
        marker=dict(color=col[run], size=10),
        line=dict(color=col[run], width=2),
        mode="lines+markers", name= "m_b = " + str(int(m_b[run]))
    )
fig5.update_layout(
    title="Norm of the gradient VS iterations k"
)
fig5.update_yaxes(type="log", exponentformat="power")
fig5.show()

In [None]:
fig6 = go.Figure()
for run in range(4):
  fig6.add_scatter(
      x=(all_E[run]), y=np.linalg.norm(all_grad_f[run], axis=1),
      marker=dict(color=col[run], size=10),
      line=dict(color=col[run], width=2),
      mode="lines+markers", name= "m_b = " + str(int(m_b[run]))
      )
fig6.update_layout(
    title="Norm of the gradient VS epochs"
)
fig6.update_yaxes(type="log", exponentformat="power")

fig6.show()

As evident from the two plots and the chart above, none of the algorithms achieve the desired convergence value within 100 epochs. At the 100th epoch, the gradient norm is smallest and quite similar for $𝑚_{𝑏}$=50 and $𝑚_{𝑏}$=100, with $𝑚_{𝑏}$=100 being slightly smaller. A notable difference between the four algorithms is the number of iterations required to reach 100 epochs: methods with higher $𝑚_{𝑏}$ require fewer iterations. Since 1 epoch corresponds to $\frac{𝑚}{𝑚𝑏}$ iterations, comparing iterations across different algorithms is not appropriate due to varying iteration costs. Therefore, it is more effective to compare the algorithms using the plot with epochs on the x-axis. Based on this comparison, the algorithm with $𝑚_{𝑏}$=100 is the best, though it is almost equal to $𝑚_{𝑏}$=50.