workload_generator and SimAI_analytical are extremely slow with very large global_batch values (high gradient accumulation)

Hello SimAI Team,

We are using SimAI to model LLM pre-training, which in SOTA practice requires M-level token batch sizes, achieved with a very high number of gradient accumulation steps (GAS).

**Context: The Need to Simulate SOTA Configurations**
In modern, state-of-the-art LLM pre-training, it is standard practice to use M-level (millions) global batch sizes in terms of tokens. To achieve this, configurations often rely on a global_batch (in samples) set to hundreds of thousands or even millions, coupled with a very high number of gradient accumulation steps (GAS).

This large-batch strategy is essential for maximizing hardware utilization (MFU) by amortizing the high cost of gradient synchronization across many compute-intensive micro-steps.

**The Problem**
When we set --global_batch to a realistic value (e.g., 1,048,576), the required GAS becomes massive (e.g., 32,768 steps).

This causes workload_generator and SimAI_analytical to run prohibitively slowly, presumably because they are simulating every single micro-step individually instead of abstracting the computation.


For SimAI to be effective for its main use case, it must be able to efficiently simulate these high-GAS scenarios.

**Questions**
Is this slow, linear scaling with GAS intended?

Is there a more abstract, efficient way to model a global step (e.g., as (N * T_compute) + T_comm) that we are missing?

Thank you for your insights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

workload_generator and SimAI_analytical are extremely slow with very large global_batch values (high gradient accumulation) #192

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

workload_generator and SimAI_analytical are extremely slow with very large global_batch values (high gradient accumulation) #192

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions