Great job!. You've finished your linear algebra, calculous, maybe good amount of probability theory, statistics and maybe even some course on optimization. You've gone through your assignments, written some code (perhaps with pytorch), and done some projects.
Now what? Before you jump into the hype-train like LLAMAs and Stable Diffusion, NeRFS etc etc, you should be aware: the field is changing rapidly. I've curated some of the unique & foundational papers that you should read to understand the field better. And by foundational, I mean insightful papers that explores stuff that are generally applicable to many different real-life problems. Deep learning and neural networks are exciting and have far more interesting literature other than just theory (I do think theory is incredibly important though).
You might even consider this as survey of surveys in neural network, as many of the literature I will mention here are extremely unique.
[WIP. I WILL ADD THEM AS I FIND MORE TIME]
We need to rethink generalization
Tendency to find low-rank solution
Tendency to find smoother solution
Tendency to find low-frequency solution
Learning in High Dimension Always Amounts to Extrapolation:
Deep Learning without Poor Local Minima:
Scaling Law from dimensionality of the data:
Pruning Data to improve scaling: Quality of the data matters, even in the large-scale setting!
Scaling Reward Model : Reward modeling + RL is a promising approach in the field of deep learning, popularized by famous chatGPT (instructGPT). Scaling helps, and we should limit the KL divergence during optimization.
In-batch variance : smaller the better?
How large should your batch be?
Larger batch size, larger learning rate?
Emergent Capabilities vs Inverse scaling :
Shortcut learning, Gradient Starvation : Neural network tends to "cheat in learning" when it has the chance.
Dataset Distillation : Did you know that you can reversely train the dataset, so that neural network can learn faster? The field has grown very much.
Localization and Edit : Maybe this is too limited, but the way they do causal tracing to find which layer is responsible for certain output is very generally applicable.
Grokking :
Bootstrapping, self-distillation, ensemble... Learning from itself? How does that even make sense? :
Adversarial Examples Are Not Bugs, They Are Features:
Infinite width Neural Networks : Of course, we see that neural network works well in practice especially in the large-scale setting. But since their analytical training dynamics are clearly intractable, we can't really say much about them. Instead, infinite width neural networks are much easier to work with. NNGP, NTK, and Tensor Programs are some of the most fundamental papers in this field. It maybe bit too math heavy, I recommend you to read this blog by Lilian Weng (as always) first.
Infinite Matrix Factorizations : Alternatively, training dynamics of matrix factorization actually give you a very good grasp of what might be happening in the neural network.
Common variable trick (I made this term up):
Mechanistic interpretability (CNN, Transformer):
Why not just, learn from expert data?