## 1. Making Training on a Single-GPU Repeatable (Running bit-for-bit) 

You have to remove/lock down all randomness and forbid non-deterministic kernels. It’s slower, some ops are disallowed, and cross-hardware or cross-version determinism is not guaranteed.


```python
# ---- put these at the VERY top of your script, before importing numpy/torch things that spawn threads ----
import os
os.environ["PYTHONHASHSEED"] = "0"                 # stable Python hashing (affects dict/set iteration)
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"  # or ":16:8"  (required for CUDA matmul determinism)

import random
import numpy as np
import torch

# 1) Seed ALL RNGs
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# 2) Forbid non-deterministic algorithms
torch.use_deterministic_algorithms(True)   # raises if a nondeterministic kernel would be used

# 3) Make cuDNN + matmul deterministic and stable
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False     # no auto-tuning
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

# 4) Avoid AMP for strict determinism (mixed precision can introduce tiny diffs)
use_amp = False

# 5) Build your DataLoader with deterministic shuffling
g = torch.Generator()
g.manual_seed(SEED)

from torch.utils.data import DataLoader
train_loader = DataLoader(
    dataset, 
    batch_size=BS, 
    shuffle=True,              # OK, but make it reproducible:
    generator=g,               # <- controls the shuffle order deterministically
    num_workers=0,             # easiest way to avoid nondeterminism from workers
    persistent_workers=False,  # if you later use workers>0, keep this False for determinism
    drop_last=True,            # avoids partial-batch edge cases
    pin_memory=False           # optional; not needed for determinism
)

# 6) Remove stochastic layers/augs or fix their RNG
#   - Set Dropout p=0 during training (or manually seed before every forward)
#   - Use deterministic/data-independent augmentations, or seed them from `SEED` every step
#   - BatchNorm is fine but depends on batch content/order (which we've fixed)

# 7) Training loop (no AMP, no randomness)
model.train()
for epoch in range(NUM_EPOCHS):
    # If you insist on reshuffling each epoch but still deterministic, reseed with a function of (SEED, epoch):
    # g.manual_seed(SEED + epoch)
    for x, y in train_loader:
        optimizer.zero_grad(set_to_none=True)
        out = model(x)
        loss = criterion(out, y)
        loss.backward()
        optimizer.step()
```

### What can still break determinism?

* **Different hardware/driver/PyTorch/CUDA/cuDNN versions.** Bitwise sameness is only realistic when the full stack is identical.
* **Ops without deterministic implementations.** With `torch.use_deterministic_algorithms(True)`, PyTorch will throw if you hit one (e.g., some pooling/atomic-based ops on certain versions). Replace those ops or move to CPU alternatives.
* **Data augmentation randomness.** Libraries like Albumentations/torchvision transforms must be seeded per step or made non-random.
* **Dropout.** Either disable (`p=0`) or seed before every forward so the mask is reproducible. (But then you’re not really doing stochastic regularization.)
* **Mixed precision / TF32.** For strict reproducibility, keep FP32 only and disable TF32 (done above).
* **Multi-GPU / DDP.** True bitwise determinism is much harder due to collective operations and scheduling. It can sometimes be approximated with careful NCCL settings and static graphs, but expect occasional drift; single-GPU is the safe path.

### FAQ

* **“Is it meaningful to train ‘deterministically’ if I remove randomness like dropout and data aug?”**
  It’s useful for **debugging, ablations, and CI**. For best generalization, you usually want stochasticity (augmentations, dropout), but you can still keep runs reproducible by fixing seeds and using deterministic kernels—your randomness is then *controlled and repeatable*.

* **“Can I keep per-epoch reshuffling and still be reproducible?”**
  Yes—reseed the DataLoader generator each epoch with a deterministic function (e.g., `SEED + epoch`). You’ll get a *different* shuffle each epoch, but it will be the *same* across runs.

* **“How about BatchNorm?”**
  BN is deterministic given the same batch order/content. If you switch to `model.eval()` to disable dropout, remember that BN will then use running stats (different behavior). Prefer disabling dropout explicitly instead of flipping to eval.

If you tell me your model/ops (e.g., any custom CUDA layers, special pools), I can point out known non-deterministic spots and suggest deterministic substitutes.


## 2. Making Training on a Multi-GPU Repeatable (Running bit-for-bit) 

**multi-GPU Distributed Data Parallel (DDP) adds extra sources of non-determinism**—mainly from parallel communication/reduction, process-local RNGs, and data sharding. You can still get *highly reproducible* runs, but **bit-for-bit determinism is harder and not guaranteed** the way it often is on a single GPU.

Here’s what’s different + how to lock it down as much as PyTorch allows.

### What’s different (and why it drifts)

1. **Floating-point reduction order**
   Gradients are summed across GPUs via collective ops (NCCL all-reduce). Different reduction *orders* (ring vs tree, bucket timings) change rounding, so results can differ by a few ULPs.

2. **Asynchrony & scheduling**
   Backward/communication overlap, CUDA stream timing, and kernel scheduling can vary slightly between runs → tiny numeric diffs.

3. **Parameter bucketing & module order**
   DDP groups params into “buckets” dynamically. If module/param registration order changes (e.g., dict/hash order), bucket order can differ → different reduction order.

4. **Per-rank data & RNGs**
   Each rank sees a different shard of data and has its own RNG stream. If you don’t seed things *per rank* (and per epoch), shuffles/augs diverge run-to-run.

5. **Dataloader workers**
   Multiple workers per rank introduce more threads & RNG streams to coordinate.

6. **AMP / loss scaling**
   Mixed precision adds nondeterminism (different overflow patterns, atomics). Static scale helps, but strict FP32 is safer.

### As-deterministic-as-possible DDP recipe

Use this as a template (single-node, multi-GPU; one process per GPU with `torchrun`):

```python
# ---- very top of your entry script ----
import os
os.environ["PYTHONHASHSEED"] = "0"
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"   # CUDA matmul determinism
# Optional: pin NCCL algo to stabilize reduction order (not a hard guarantee)
os.environ.setdefault("NCCL_ALGO", "Ring")
os.environ.setdefault("NCCL_LAUNCH_MODE", "GROUP")
# Optional: reduces timing jitter when debugging; can slow things down
# os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

import random, numpy as np, torch
import torch.distributed as dist
from torch.utils.data import DataLoader, DistributedSampler

def set_global_determinism(seed: int, rank: int):
    # Per-rank seeding: base + rank so each process is distinct but reproducible
    base = seed + rank
    random.seed(base)
    np.random.seed(base)
    torch.manual_seed(base)
    torch.cuda.manual_seed(base)
    torch.cuda.manual_seed_all(base)

    torch.use_deterministic_algorithms(True)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.backends.cuda.matmul.allow_tf32 = False
    torch.backends.cudnn.allow_tf32 = False

def seed_worker(worker_id: int, base_seed: int, rank: int):
    wseed = base_seed + 1000 * rank + worker_id
    random.seed(wseed); np.random.seed(wseed); torch.manual_seed(wseed)

def create_loader(dataset, batch_size, seed, rank, num_workers=0):
    sampler = DistributedSampler(dataset, shuffle=True, seed=seed, drop_last=True)
    g = torch.Generator()
    g.manual_seed(seed + rank)
    return DataLoader(
        dataset,
        batch_size=batch_size,
        sampler=sampler,
        num_workers=num_workers,            # for strictness keep 0; if >0, use worker_init_fn
        worker_init_fn=(lambda wid: seed_worker(wid, seed, rank)) if num_workers>0 else None,
        generator=g,
        persistent_workers=False,
        pin_memory=False,
        drop_last=True,
    ), sampler

def main_worker(rank, world_size, seed):
    dist.init_process_group(backend="nccl", init_method="env://", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)
    set_global_determinism(seed, rank)

    model = build_model().cuda(rank)
    model = torch.nn.parallel.DistributedDataParallel(
        model, device_ids=[rank], find_unused_parameters=False, static_graph=True
    )

    optimizer = build_optimizer(model.parameters())
    criterion = build_criterion()

    train_loader, train_sampler = create_loader(train_dataset, BS, seed, rank, num_workers=0)

    # Disable stochastic layers or make them deterministic
    # e.g., set Dropout p=0, or keep but accept that it will be deterministic per-run given the seeding above.
    # Avoid AMP for strictest determinism.
    use_amp = False

    for epoch in range(EPOCHS):
        train_sampler.set_epoch(epoch)  # deterministic reshuffle across runs
        model.train()
        for x, y in train_loader:
            x, y = x.cuda(rank, non_blocking=False), y.cuda(rank, non_blocking=False)
            optimizer.zero_grad(set_to_none=True)
            out = model(x)
            loss = criterion(out, y)
            loss.backward()
            optimizer.step()

    dist.barrier()
    dist.destroy_process_group()
```

**Launch consistently** (same world size, same device order, same seeds):

```bash
torchrun --nproc_per_node=<NUM_GPUS> --master_port=29500 train.py
# Ensure the same versions of PyTorch/CUDA/cuDNN/NCCL and identical GPUs/drivers
```

### Extra tips / gotchas

* **Bit-for-bit?** Even with all of the above, **PyTorch does not guarantee bit-exact determinism across multiple GPUs**. You *can* get extremely close (repeatable curves/metrics), but expect rare ±1 ULP differences.

* **Keep the graph static.** `static_graph=True` (PyTorch ≥1.12) avoids bucket re-builds. Also keep `find_unused_parameters=False`.

* **Parameter registration order.** Build modules in a deterministic order (don’t iterate Python dicts unless `PYTHONHASHSEED` is fixed—as we did).

* **Data augs.** If you use stochastic augmentations, they’ll be **reproducible per run** with the seeding above (including per-epoch reshuffles via `set_epoch`). If you want *identical* augs on each rerun, don’t add extra RNG draws elsewhere.

* **AMP.** For strictest reproducibility, run FP32. If you must use AMP, prefer **static loss scale** and accept small nondeterminism.

* **Multiple nodes.** Cross-node adds more variability (network fabric, clocks). Pin NCCL envs and keep machines identical; determinism becomes even less guaranteed.

* **Evaluation sync.** Call `dist.barrier()` before validation to keep ranks aligned (and to avoid one rank using different running stats if you switch modes).

---

