mlsweep is slightly opinionated but very general solution for managing tons of machine learning runs. It takes flexible combinations of hyperparameters and schedules them across your hardware.
The project contains a controller which manages scheduling jobs across workers, and a visualizer. You aren't forced to use our visualizer. You can use mlsweep with the wandb or tensorboard logger, and the mlsweep metrics format can be exported to wandb or tensorboard. The mlsweep logger is also extensible, should you wish. Logs end up on the controller machine.
The main feature of mlsweep is not the logger, but the sweep configuration file. The good stuff. The thing that I've been missing all my machine learning life, and the reason I wrote this library.
mlsweep does pretty much everything that wandb does. If you're missing anything, let me know on Discord or Twitter.
But first, let's install it, and add the logging.
git clone <your-project>
cd <your-project>
python -m venv .venv
pip install 'mlsweep[all]'from mlsweep.logger import MLSweepLogger
# If you don't want to use it as a context manager, remember to call .close().
with MLSweepLogger() as logger:
for step in range(1, num_steps + 1):
loss = train_step()
logger.log({"loss": loss}, step=step)
# Write checkpoints to MLSWEEP_RUN_DIR — they get rsynced back automatically.
# Call logger.sync() to trigger an immediate rsync mid-run (fire-and-forget).
if step % 1000 == 0:
# If your checkpoint saving is asynchronous remember to launch a thread to await a future and then sync or something.
# logger.sync() is async and nonblocking but needs to be called after the artifact dir is ready.
save_checkpoint(os.environ["MLSWEEP_RUN_DIR"], step)
logger.sync()This logging is usually a no-op when run outside of mlsweep_run. Metrics land in outputs/sweeps/<experiment>/<run>/metrics.jsonl. Anything written to MLSWEEP_RUN_DIR is rsynced to outputs/sweeps/<experiment>/<run>/artifacts/ — at the end of every run, and immediately on logger.sync().
Add the following shebang, and use chmod +x so that that your sweep file can be directly executable.
#!/usr/bin/env mlsweep_run
COMMAND = ["python", "train.py"]
OPTIONS = {
".lr": {
"values": [1e-4, 3e-4, 1e-3],
"flags": "--optimizer.lr",
"name": "lr",
},
".batch_size": {
"values": [32, 64, 128],
"flags": "--training.batch_size",
"name": "bs",
},
}Running this produces 9 runs named my_sweep_lr1e-4_bs32, my_sweep_lr1e-4_bs64, etc.
Each run receives its flags appended to COMMAND: python train.py --optimizer.lr 0.0001 --training.batch_size 32.
See sweep_configuration.md for the full format: subdimensions, monotonic/singular skipping, EXCLUDE, NODES_PER_RUN and GPUS_PER_RUN for training with torchrun (see SET_DIST_ENV), and more. For end-to-end examples with real frameworks (Prime-RL, TorchTitan), see examples.md.
If your sweep is specifically for hyperparameter optimization, you can add an OPTIMIZE dict to save compute — it uses TPE (via optuna) to intelligently sample the space and find good configs faster than trying all combinations.
#!/usr/bin/env mlsweep_run
COMMAND = ["python", "train.py"]
OPTIMIZE = {
"method": "bayes",
"metric": "val_loss",
"goal": "minimize",
"budget": 40,
}
OPTIONS = {
# Discrete dim
".optimizer": {
"name": "opt",
".adam": {"flags": ["--optimizer", "adam"]},
".muon": {"flags": ["--optimizer", "muon"]},
},
# Continuous dims
".lr": {
"distribution": "log_uniform",
"min": 1e-5,
"max": 1e-1,
"flags": "--optimizer.lr",
"name": "lr",
},
".wd": {
"distribution": "log_uniform",
"min": 0.0,
"max": 0.2,
"flags": "--optimizer.weight_decay",
"name": "wd",
},
}Requires pip install 'mlsweep[bayes]' (should be installed if you install 'mlsweep[all]').
Run the same way as any other sweep:
mlsweep_run sweeps/bayes_sweep.py -g 4See sweep_configuration.md for continuous ranges, singular dims, and all OPTIMIZE fields.
mlsweep_run sweeps/my_sweep.py # 1 GPU
mlsweep_run sweeps/my_sweep.py -g 4 # 4 GPUs in parallel
mlsweep_run sweeps/my_sweep.py -g # all visible GPUs
mlsweep_run sweeps/my_sweep.py -g 4 -j 5 # 5 jobs per GPU (20 total)ssh user@host -i path/to/key
cd path/to/project/
pip install mlsweep[[workers]]
host = "user@host1"
remote_dir = "/absolute/path/to/project"
ssh_key = "~/.ssh/id_ed25519"
venv = "/absolute/path/to/venv/" # Optional, resolves .venv/, venv/, calls bin/activate, defaults to remote_dir
devices = [0, 1, 2, 3] # Sets CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES
jobs = 2| Field | Required | Notes |
|---|---|---|
host |
yes | SSH target |
remote_dir |
yes | Project root on the remote |
ssh_key |
no | Path to identity file (-i) |
pass |
no | SSH password (needs sshpass); or set MLSWEEP_SSH_PASS env var |
venv |
no | Venv locator (default: remote_dir). Accepts a project root, venv root, bin/ dir, activate script, or python binary. |
devices |
no | Specific GPU IDs to use |
gpus |
no | Total GPU count -g (default: all visible) |
jobs |
no | Concurrent jobs per GPU slot -j (default: 1) |
venv accepts any of:
- Project root containing
.venv/orvenv/ - Venv root directory (contains
bin/mlsweep_worker) bin/directory- Path to
activatescript - Path to a python binary
mlsweep_run sweeps/my_sweep.py --workers workers.tomlOnce you've launched the sweep, on the machine and in the dir you called mlsweep_run from, run:
mlsweep_viz
# or
mlsweep_viz experiment_nameThis will prompt you to open up a browser (or pass --open-browser to do so automatically) to see the sweep visualizer. It will watch your experiment folder and update the metrics viewer in real time.
mlsweep can log all runs to Weights & Biases with no changes to your training script. The controller owns the W&B session — your training script only calls MLSweepLogger as usual.
Install the extra:
pip install 'mlsweep[wandb]'Then pass --wandb-project when launching:
export WANDB_API_KEY=your_key_here
mlsweep_run sweeps/my_sweep.py -g 4 --wandb-project my-project
mlsweep_run sweeps/my_sweep.py -g 4 --wandb-project my-project --wandb-entity my-teamEach run appears in W&B under the project, grouped by experiment name, with its hyperparameter combo stored as the run config.
Same idea — no changes to your training script needed.
Install the extra (or use an existing torch/tensorboardX install):
pip install 'mlsweep[tensorboard]'Then pass --tensorboard-dir when launching:
mlsweep_run sweeps/my_sweep.py -g 4 --tensorboard-dir ./tb_logsLogs are written to <tensorboard-dir>/<experiment>/<run>/. Point TensorBoard at the top-level dir to compare all runs:
tensorboard --logdir ./tb_logsIf the error messages are bad or the docs are bad or you feel confused feel free to hit me up on Discord or Twitter.
