Skip to content

workload_generator and SimAI_analytical are extremely slow with very large global_batch values (high gradient accumulation) #192

@horser1

Description

@horser1

Hello SimAI Team,

We are using SimAI to model LLM pre-training, which in SOTA practice requires M-level token batch sizes, achieved with a very high number of gradient accumulation steps (GAS).

Context: The Need to Simulate SOTA Configurations
In modern, state-of-the-art LLM pre-training, it is standard practice to use M-level (millions) global batch sizes in terms of tokens. To achieve this, configurations often rely on a global_batch (in samples) set to hundreds of thousands or even millions, coupled with a very high number of gradient accumulation steps (GAS).

This large-batch strategy is essential for maximizing hardware utilization (MFU) by amortizing the high cost of gradient synchronization across many compute-intensive micro-steps.

The Problem
When we set --global_batch to a realistic value (e.g., 1,048,576), the required GAS becomes massive (e.g., 32,768 steps).

This causes workload_generator and SimAI_analytical to run prohibitively slowly, presumably because they are simulating every single micro-step individually instead of abstracting the computation.

For SimAI to be effective for its main use case, it must be able to efficiently simulate these high-GAS scenarios.

Questions
Is this slow, linear scaling with GAS intended?

Is there a more abstract, efficient way to model a global step (e.g., as (N * T_compute) + T_comm) that we are missing?

Thank you for your insights.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions