In [1]:
import requests
import json
from IPython.display import display, Markdown

In [2]:
query = "Make a search of 10 recent publications and write a literature review on recent advances in deep learning optimizers with an accent on novel optimization algorythms"
params = {"query":query}
url = "http://localhost:8086/agent"
response = requests.get(url, params = params)

In [3]:
response

<Response [200]>

In [4]:
result = json.loads(response.content)["message"]

In [5]:
display(Markdown(result))

Below is a literature review synthesizing recent advances in deep learning optimization methods—with special attention to novel optimization algorithms—based on ten recent publications. The review discusses innovations from adaptive batch-size learning and second‐order and inertial methods to central–flow analyses, ADMM‐based techniques, adaptive subspace acceleration, learned optimizers with safeguards, and general–purpose solvers for constrained deep learning. The following sections offer an integrated discussion of these approaches, highlighting their motivations, methodologies, theoretical contributions, empirical findings, and open challenges.

──────────────────────────────
1. Introduction

The last few years have witnessed tremendous progress in deep learning, largely driven by improvements in optimization algorithms that train ever more complex neural networks. Early work used first–order methods such as stochastic gradient descent (SGD) and its momentum variants (Polyak, 1964; Robbins & Monro, 1951). However, as deep models grew in depth and datasets became larger, these classical methods revealed limitations such as slow convergence, sensitivity to hyperparameters, and issues with nonconvexity and nonsmoothness. Consequently, researchers have explored novel methods—from adaptive gradient approaches and dynamic batch-size selection to second-order and quasi–Newton methods, inertial schemes, and even techniques for constrained optimization—that aim to overcome inherent challenges. This review synthesizes ten recent contributions that represent a cross–section of the latest advances in deep learning optimizers, with an accent on “novel” optimization algorithms that incorporate adaptive mechanisms, curvature information, and hybrid strategies.

──────────────────────────────
2. Overview of First–Order Innovations and Adaptive Approaches

An early and comprehensive overview by Shulman (2023) revisits optimization methods in deep learning. The paper provides a detailed examination of first–order methods—SGD, Adagrad, Adadelta, RMSProp—and momentum–based and adaptive algorithms (Adam, Nadam, AdaMax, AMSGrad). Shulman highlights that while these methods continue to be widely used because of their simplicity and scalability, their performance is highly sensitive to hyperparameter choices such as learning rates or momentum coefficients. The work also reviews auxiliary techniques including weight initialization and normalization (e.g., batch normalization, layer normalization) which work in tandem with the optimization algorithm to stabilize training.

Soydaner (2020) extends this discussion by comparing a broad spectrum of optimization algorithms, especially adaptive gradient methods, for both supervised and unsupervised tasks. The paper shows that on simple datasets such as MNIST, adaptive methods quickly lower the loss, whereas on more challenging datasets like CIFAR‑10 or LFW, methods such as Adam and AdaMax outperform vanilla SGD. Soydaner’s extensive experiments demonstrate that while no single optimizer uniformly outperforms the rest, adaptive methods tend to balance training speed and generalization performance better than classic SGD variants. This evaluation underlines the practical importance of adapting algorithmic parameters on the fly as well as the need for guarding mechanisms in some settings (discussed later).

──────────────────────────────
3. Adaptive Batch Size and Instance–Specific Acceleration

Traditional SGD methods rely on a fixed batch size, which may be suboptimal given that different phases of training (or different datasets) require a trade–off between variance reduction and computational cost. In their work, Alfarra et al. (2020) propose an adaptive approach to learning the optimal batch size online. Building on theoretical analyses relating the batch size, gradient noise, and smoothness, the authors design an algorithm that estimates, at each iteration, a batch size that minimizes an upper bound on the iteration complexity. Their method dynamically updates the batch size along with the learning rate and shows nearly optimal behavior when compared with extensive grid–search baselines. This dynamic adaptation is especially valuable in strongly convex and smooth settings and proposes a pathway for designing instance–adaptive optimizers. The adaptive batch size mechanism underscores the broader trend of moving from conservative “worst–case” fixed parameters to instance–adaptive schemes that exploit the local geometry of the loss landscape.

──────────────────────────────
4. Second–Order and Inertial Algorithms

Several recent contributions turn to second-order information and inertial methods for more effective navigation of the nonconvex loss surfaces of deep neural networks. One challenge with classic second–order methods is that they directly compute or store full Hessians, which is computationally prohibitive in deep learning. Damien Martins Gomes (2025) introduces AdaFisher, a novel adaptive second–order optimizer that exploits a diagonal block–Kronecker approximation of the Fisher Information Matrix (FIM). By approximating the curvature information efficiently, AdaFisher “preconditions” the gradient update in a manner similar to Adam but with richer local geometry information. The analysis shows that this approach leads to faster convergence and improved generalization compared to state–of–the–art first–order methods.

Complementing these developments, Castera et al. (2019) propose the INNA algorithm (Inertial Newton Algorithm) which combines gradient–descent and Newton-like behaviors with inertial effects. By leveraging techniques from nonsmooth analysis—including Clarke subdifferentials and the notion of D–criticality—INNA is designed to work with the nonsmooth, nonconvex loss functions typical in deep learning. The paper provides convergence analyses using continuous–time formulations and Lyapunov methods under moderate assumptions. Empirically, INNA competes with conventional optimizers like SGD, ADAGRAD, and ADAM on popular benchmarks. It shows that inertial and curvature-based acceleration can offer more stable convergence and yield superior performance in some cases.

──────────────────────────────
5. Central Flow Analysis and Implicit Curvature Regularization

Innovative ideas for understanding optimizer dynamics have also emerged. Cohen et al. (2024) propose a “central flow” framework, which is a differential equation that characterizes the time–averaged (or smoothed) trajectory of an optimizer operating at the so–called “edge of stability.” Traditional gradient flow fails to capture the oscillatory behavior visible in practical training, whereas the central flow—derived from a third–order Taylor expansion—predicts macroscopic behavior, loss descent, and even the variation in the top eigenvalue of the Hessian. Their analysis shows that the inherent oscillations in gradient descent effectively lead to implicit curvature regularization, an effect that can “push” the optimizer toward regions of the loss surface where larger stable steps are feasible. This insight offers a theoretical justification for several empirical observations in adaptive methods and indicates how hyperparameters such as the learning rate or EMA decay in RMSProp might modulate stability.

──────────────────────────────
6. Adaptive Subspace Search and Instance–Faster Algorithms

Building on the notion that worst–case complexity bounds are often pessimistic, Liu et al. (2023) propose accelerated gradient algorithms that incorporate an “adaptive subspace search” to achieve instance–faster convergence. Observing that in many machine learning tasks the Hessian’s eigenvalues drop sharply, the authors introduce parameters (α, τ₍α₎) to quantify the degeneracy in the Hessian spectrum. Their methods first extract the subspace corresponding to directions with significant curvature and then apply accelerated routines within that subspace. In quadratic settings (e.g., linear regression), the proposed method improves the gradient complexity from the classical O(μ^–1/2) to nearly optimal O(μ^–1/3). Extensions to general convex and nonconvex problems—through a combination with cubic regularization and large–step prox–Newton schemes—further illustrate that tailored optimization algorithms can outperform worst–case–optimal methods by adapting to instance–specific structure.

──────────────────────────────
7. ADMM and Constrained Optimization in Deep Learning

The alternating direction method of multipliers (ADMM) has historically been valued for its ability to handle constraints and overcome issues such as gradient saturation. Zeng et al. (2019) leverage ADMM in a novel way for training deep neural networks with sigmoid–type activation functions. Their “sigmoid–ADMM pair” circumvents the saturation issues inherent to sigmoid activations and provides global convergence guarantees that rely on the Kurdyka–Łojasiewicz inequality. This work demonstrates that methods other than purely gradient–based updates can be competitive in deep learning settings when constraints (or nonlinear activation properties) are a limiting factor. In related work, Liang, Mitchell, and Sun (2022) introduce NCVX and its PyGRANSO solver—a general–purpose optimization package for constrained deep learning problems. NCVX combines auto–differentiation, GPU acceleration and a quasi–Newton SQP framework to handle nonsmooth and nonconvex constraints directly. Such tools are especially relevant in trustworthy AI and scientific domains where explicit constraints must be respected.

──────────────────────────────
8. Learned Optimizers and Effective Safeguards

Learned optimizers (L2Os) represent a meta–learning approach in which neural networks are trained to produce update rules. Although promising in terms of fast initial convergence, learned optimizers have been observed to plateau or even diverge during long–run training or when operating out–of–distribution. Prémont–Schwarz et al. (2022) propose a simple guarding mechanism—Loss–Guarded L2O (LGL2O)—that hybridizes a learned optimizer with a conventional optimizer (such as SGD or Adam). The method evaluates the anticipated loss reduction under both the learned update and the fallback update, selecting the candidate that yields a lower loss. The theoretical guarantees show that when the fallback optimizer is provably convergent, the hybrid method inherits its convergence properties. Empirical tests across in– and out–of–distribution settings demonstrate that LGL2O combines the fast initial progress of L2Os with the asymptotic stability of classical methods.

──────────────────────────────
9. Discussion and Future Directions

Taken together, these works indicate a strong trend toward moving beyond static, worst–case–oriented design toward algorithms that adapt on multiple levels. Adaptive batch–size methods, second–order approximations (via the Fisher or Hessian information) combined with inertial and curvature–aware adjustments, and subspace–focused acceleration are all directed at tailoring the algorithm to the structure inherent to the problem. Meanwhile, new modeling approaches like central flows provide more refined theoretical insights into the “edge–of–stability” regime frequently encountered in practice. In the realm of constrained optimization, both ADMM for saturating activations and the NCVX toolkit offer promising ways to incorporate explicit constraints in deep learning without incurring prohibitive manual effort.

Notwithstanding these advances, several challenges remain. For instance, extending frameworks like central flows to stochastic mini–batch settings is an open research direction. Similarly, while learned optimizers have made exciting progress, guaranteeing their convergence across a range of out–of–distribution tasks remains a key concern. Finally, developing solvers that integrate second–order information with the computational efficiency required for very large networks continues to be an active area of research.

──────────────────────────────
10. Conclusion

Recent advances in deep learning optimizers illustrate a multifaceted approach to overcoming intrinsic challenges. Innovations range from adaptive batch–size learning and second–order inertial methods to theoretical tools that describe the time‐averaged dynamics of oscillatory updates. In parallel, new frameworks for handling constraints and learning update rules promise to broaden the applicability and robustness of deep learning techniques. Collectively, these contributions not only refine our theoretical understanding of optimization in nonconvex settings but also highlight practical pathways toward more efficient, stable, and adaptive training algorithms.

──────────────────────────────
References

Alfarra, M., Hanzely, S., Albasyoni, A., Ghanem, B., & Richtárik, P. (2020). Adaptive Learning of the Optimal Batch Size of SGD. Retrieved from https://arxiv.org/pdf/2005.01097v2

Castera, C., Bolte, J., Févotte, C., & Pauwels, E. (2019). An Inertial Newton Algorithm for Deep Learning. Retrieved from https://arxiv.org/pdf/1905.12278v6

Cohen, J. M., Damian, A., Talwalkar, A., Kolter, J. Z., & Lee, J. Z. (2024). Understanding Optimization in Deep Learning with Central Flows. Retrieved from https://arxiv.org/pdf/2410.24206v2

Gomes, D. M. (2025). Towards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis. Retrieved from https://arxiv.org/pdf/2504.20096v1

Heaton, H., Chen, X., Wang, Z., & Yin, W. (2020). Safeguarded Learned Convex Optimization. (Related ideas discussed in Prémont–Schwarz et al., 2022).

Liang, B., Mitchell, T., & Sun, J. (2022). NCVX: A General-Purpose Optimization Solver for Constrained Machine and Deep Learning. Retrieved from https://arxiv.org/pdf/2210.00973v2

Liu, Y., Zhao, H., Xu, Y., Yue, P., & Fang, C. (2023). Accelerated Gradient Algorithms with Adaptive Subspace Search for Instance-Faster Optimization. Retrieved from https://arxiv.org/pdf/2312.03218v1

Prémont-Schwarz, I., Vítků, J., & Feyereisl, J. (2022). A Simple Guard for Learned Optimizers. Retrieved from https://arxiv.org/pdf/2201.12426v3

Shulman, D. (2023). Optimization Methods in Deep Learning: A Comprehensive Overview. Retrieved from https://arxiv.org/pdf/2302.09566v2

Soydaner, D. (2020). A Comparison of Optimization Algorithms for Deep Learning. Retrieved from https://arxiv.org/pdf/2007.14166v1

Zeng, J., Lin, S.-B., Yao, Y., & Zhou, D.-X. (2019). On ADMM in Deep Learning: Convergence and Saturation-Avoidance. Retrieved from https://arxiv.org/pdf/1902.02060v3

──────────────────────────────
Final Remarks

The examined contributions reveal a vibrant research landscape in which adaptive, second–order, inertial, and hybrid approaches are actively being developed to address the practical and theoretical challenges in deep learning optimization. As research continues to bridge the gap between worst-case theory and instance-specific performance, these novel optimization algorithms are poised to support the next generation of deep learning architectures and applications.