If factorization of computation is possible, very small models equipped with a search algorithm will perform arbitrarily complex computations.
bin_pack/: old BinPack code, may be reused later.c_vpr/: code for the C-VPR task withMhops.multiplication: code for theN*Nmultiplication task.regression/: old quasi-linear regression code, maybe reused later.
- [ ]: implement pmap instead of jit
- [ ]: add gradient accumulation to allow abritrary high batch sizes
- [x]: implement cycle env
- [x]: implement transformer with chain of thought
- [x]: implement learning rate warmup
- [x]: add training on Cycle env (+ find good difficulty)
- [x]: implement chain of thought training
- [x]: implement RL training
- [x]: implement variance reduction for RL training (e.g. using a baseline)
- [x]: log cot policy entropy
- [ ]: speed up compilation time by using flax scan instead of python for loop in COT Module call
- [x]: implement CLS token (instead of mean embedding output)
- [x]: pass num_hops as input (in case of curriculum learning)
- [x]: extend the COT chain to include the last hop (the label)
- [x]: pad cot chain to allow curriculum learning and cot/rl mode
- [x]: add "start_token" as a possible action for sampled cot_tokens (to use as explicit padding)
- [x]: implement policy improvement through search or POPPY-like algo
- [x]: log some examples of inputs and COT during training
- [ ]: evaluate model by altering the generated CoT to see if the last token is a function of the previous ones (see if accuracy decreases)
-
Should intermediate computations be discretely sampled or continuous (e.g. full embedding)? If the former, one will need RL to train the system. If the latter, one can backpropagate through the chain of thought which looks more like an RNN (with attention layers if one uses a small transformer as the factorized model). It is equivalent to a full transformer with complete weight sharing between layers.
-
The above question triggers this one: can a deep model with lots of layers be factorized into a small model repeated many times? This links to adaptive compute, universal transformer, mixture of experts, etc. The number of repetitions of the small model may have to be higher than the ratio deep_model_size/small_model_size due to factorization.
-
What is the best algorithm for exploiting a factorized model? A deep transformer could be seen as doing chain of thought layer by layer. Doing a tree search on top of a factorized model would probably increase its abilities and does not have any equivalent in the NN architecture domain.
-
Can the results of these intermediate computations be discovered (i.e. only using supervision from the final output) efficiently? Chain of thought works well because the chains are supervised, i.e. they exist in the training data. Can we discover these chains from unsupervised, potentially reinforcement-learned data?
We could end up with very small LLMs matching or outperforming the largest ones (e.g. GPT-4), which would be incredibly sick and revolutionary. Also, it would be possible to query these factorized models for arbitrary compute budgets, giving rise to a controllable compute-performance trade-off.
On a tangent, two other applications could emerge. First, one could even imagine a manageable compute-safety trade-off by searching the computation tree for "safe" states given by a value (see LeCun's JEPA). Second, if the search heuristic is well calibrated, one could imagine estimating uncertainty during the search, potentially leading to automatic search budgets.
- Study the scaling laws of factorized computation.
- How does performance on a simple computational task (e.g. arithmetic) evolve as a function of model size, chain of thought length, and training data?
- Can one recover similar performance by factorizing a model?
- Does one get higher sample efficiency than 1-shot models assuming the chains are given?
- Does chain-of-thought factorization lead to linear, sublinear or superlinear scaling when increasing the chain length?
- Teaching Arithmetic to Small Transformers, [Lee et al., 2023]
- Universal Transformers, [Dehghani et al., 2019]
- Adaptivity and Modularity for Efficient Generalization Over Task Complexity, [Abnar et al., 2023]
- Adaptive Computation Time for Recurrent Neural Networks, [Graves, 2017]
- Modular Deep Learning, [Pfeiffer et al., 2023]
- Adaptive Computation Time for Recurrent Neural Networks, [Graves, 2016]
- PonderNet: Learning to Ponder, [Banino et al., 2021]
- An investigation of model-free planning, [Guez et al., 2019]
- Chain of Code: Reasoning with a Language Model-Augmented Code Emulator, [Li et al., 2023]
- Think before you speak: Training Language Models With Pause Tokens, [Goyal, 2023]
- Implicit Chain of Thought Reasoning via Knowledge Distillation, [Deng et al., 2023]
- Addressing Some Limitations of Transformers with Feedback Memory, [Fan et al., 2020]
- CoTFormer: More Tokens With Attention Make Up For Less Depth, [Mohtashami et al., 2023]
- Adaptive Computation with Elastic Input Sequence, [Xue et al., 2023]: generalization with respect to the computation sequence length.
- The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers, [Csordás et al., 2021]
- Recurrent Independent Mechanisms, [Goyal et al., 2019]
- Transferring Inductive Biases through Knowledge Distillation, [Abnar et al., 2020]
- Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization, [Zhang et al., 2021]: generalization, neural network reasonning, indirection.
- Explaining grokking through circuit efficiency, [Varma et al., 2023]: on generalization versus memorization.
- Thinking Like Transformers, [Weiss et al., 2021]: RASP language and computation model behind the transformer.
- What Algorithms can Transformers Learn? A Study in Length Generalization, [Zhou et al., 2023]: on length generalization.
- The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning, [Goldblum et al., 2023]: on no-free lunch theorem.
- Lessons on Parameter Sharing across Layers in Transformers, [Takase & Kiyono, 2021]: superior performance when you stack layers multiple times (sharing params).
- Understanding Parameter Sharing in Transformers, [Lin et al., 2023]
- Sparse Universal Transformer, [Tan et al., 2023]
- Improving the Neural GPU Architecture for Algorithm Learning, [Freivalds & Liepins, 2017]
- Neural GPUs Learn Algorithms, [Kaiser & Sutskever, 2015]
- Self-Discover: Large Language Models Self-Compose Reasoning Structures, [Zhou et al., 2024]
- The Unreasonable Ineffectiveness of the Deeper Layers, [Gromov et al., 2024]
- Let's Think Dot by Dot: Hidden Computation in Transformer Language Models, [Pfau et al., 2024]
- Do Large Language Models Latently Perform Multi-Hop Reasoning?, [Yang et al., 2024]