Hello SimAI Team,
We are using SimAI to model LLM pre-training, which in SOTA practice requires M-level token batch sizes, achieved with a very high number of gradient accumulation steps (GAS).
Context: The Need to Simulate SOTA Configurations
In modern, state-of-the-art LLM pre-training, it is standard practice to use M-level (millions) global batch sizes in terms of tokens. To achieve this, configurations often rely on a global_batch (in samples) set to hundreds of thousands or even millions, coupled with a very high number of gradient accumulation steps (GAS).
This large-batch strategy is essential for maximizing hardware utilization (MFU) by amortizing the high cost of gradient synchronization across many compute-intensive micro-steps.
The Problem
When we set --global_batch to a realistic value (e.g., 1,048,576), the required GAS becomes massive (e.g., 32,768 steps).
This causes workload_generator and SimAI_analytical to run prohibitively slowly, presumably because they are simulating every single micro-step individually instead of abstracting the computation.
For SimAI to be effective for its main use case, it must be able to efficiently simulate these high-GAS scenarios.
Questions
Is this slow, linear scaling with GAS intended?
Is there a more abstract, efficient way to model a global step (e.g., as (N * T_compute) + T_comm) that we are missing?
Thank you for your insights.