You start training at 11 PM. You go to sleep. You wake up at 8 AM.
The training crashed at 2 AM.
You want to kill someone.
Train Watch — Your AI training supervisor that never sleeps.
- 🤖 Powered by Claude Code
- 🔄 Checks every N minutes (you decide)
- 🚨 Auto-detects crashes, OOM, NaN, GPU hangs
- 🔧 Auto-fixes and restarts when possible
- 📝 Logs everything for morning review
# 1. Start your training with Claude Code
claude
> "Help me run this training"
# 2. Launch the watchdog (it will monitor in background)
./watchdog.sh --session <your-session-id> --pid <training-pid> --log train.log
# 3. Go to sleep
# 4. Wake up, resume your session
claude --resume <your-session-id>
> "What happened overnight?"Claude will tell you everything: how many checks, any issues found, what actions were taken.
- ✅ Process alive/dead
- ✅ GPU utilization (detects hangs)
- ✅ Log errors (NaN, Inf, OOM, exceptions)
- ✅ Loss trends
- ✅ Disk space
git clone https://github.com/yourusername/train-watch.git
cd train-watch
chmod +x watchdog.shRequirements:
- Claude Code CLI installed
- Python 3.7+
- nvidia-smi (for GPU monitoring)
Before using Train Watch:
- Backup your code - Commit and push all changes
- Use
/branchto fork your session - This keeps your main conversation clean and allows rollback - Test on non-critical experiments first - Make sure it works as expected
- Review the auto-fix logic - Claude will attempt to fix issues automatically with
--dangerously-skip-permissions
Disclaimer: This tool gives Claude Code autonomous control over your training process. While it's designed to help, it may make mistakes. Use at your own risk. Always have backups.
Step 1: Copy CLAUDE.md to your training project root
Step 2: Start training with Claude Code and fork the session
cd /your/training/project
claudeIn the conversation:
You: /branch
Claude: [creates a branch]
You: Help me run train.py with these configs
Claude: [starts training, tells you PID and session ID]
Why fork? This keeps your main conversation history clean and lets you return to the original session if needed.
Step 3: Launch watchdog with the branched session ID
nohup ./watchdog.sh --session <session-id> --pid <pid> --log train.log > watchdog.out 2>&1 &Step 4: Sleep peacefully 😴
Step 5: Morning review
claude --resume <session-id>Or just read .train_status.log:
cat .train_status.log./watchdog.sh \
--session <claude-session-id> # Required
--pid <training-pid> # Optional but recommended
--log <log-file> # Optional but recommended
--interval 1800 # Check interval in seconds (default: 30min)
--disk /path/to/check # Disk path to monitor (default: .)kill $(cat .watchdog.pid)[2026-03-26 03:00:01] ✅ OK — 进程存活,GPU 利用率 94%,loss 0.342,无异常
[2026-03-26 03:30:01] 🚨 CRITICAL — 进程已死亡,原因:CUDA OOM。已将 batch_size 从 128 改为 64,已重启,新 PID 45231
[2026-03-26 04:00:01] ✅ OK — 进程存活(PID 45231),GPU 利用率 91%,loss 0.318,恢复正常
Claude Code and Codex are amazing, but they lack one critical feature: heartbeat monitoring.
When you're running overnight experiments, you need an agent that:
- Actively checks status (not passively waiting for you to ask)
- Takes action when things go wrong (not just notifying you)
- Works while you sleep (because 2 AM crashes are the worst)
This is that missing piece.
PRs welcome! Especially for:
- Better error detection patterns
- Auto-fix strategies for common issues
- Support for other training frameworks
MIT
晚上 11 点开始训练。你去睡觉。早上 8 点醒来。
训练在凌晨 2 点崩了。
你想杀人。
Train Watch — 永不睡觉的 AI 训练监工。
- 🤖 基于 Claude Code
- 🔄 每 N 分钟检查一次(你决定)
- 🚨 自动检测崩溃、OOM、NaN、GPU 卡死
- 🔧 可能的话自动修复并重启
- 📝 记录所有事件供早晨查看
# 1. 用 Claude Code 启动训练
claude
> "帮我跑这个训练"
# 2. 启动 watchdog(后台监控)
./watchdog.sh --session <你的session-id> --pid <训练进程PID> --log train.log
# 3. 去睡觉
# 4. 醒来后恢复会话
claude --resume <你的session-id>
> "昨晚发生了什么?"Claude 会告诉你一切:检查了几次、发现了什么问题、采取了什么行动。
- ✅ 进程存活状态
- ✅ GPU 利用率(检测卡死)
- ✅ 日志错误(NaN、Inf、OOM、异常)
- ✅ Loss 趋势
- ✅ 磁盘空间
git clone https://github.com/yourusername/train-watch.git
cd train-watch
chmod +x watchdog.sh依赖:
- Claude Code CLI
- Python 3.7+
- nvidia-smi(用于 GPU 监控)
使用 Train Watch 之前:
- 备份代码 - 提交并推送所有更改
- 使用
/branchfork 会话 - 保持主对话干净,允许回滚 - 先在非关键实验上测试 - 确保按预期工作
- 检查自动修复逻辑 - Claude 会使用
--dangerously-skip-permissions自动修复问题
免责声明:此工具赋予 Claude Code 对训练过程的自主控制权。虽然设计用于帮助,但可能会出错。使用风险自负。务必备份。
第 1 步:将 CLAUDE.md 复制到你的训练项目根目录
第 2 步:用 Claude Code 启动训练并 fork 会话
cd /your/training/project
claude在对话中:
你:/branch
Claude:[创建分支]
你:帮我用这些配置运行 train.py
Claude:[启动训练,告诉你 PID 和 session ID]
为什么要 fork? 保持主对话历史干净,需要时可以返回原始会话。
第 3 步:使用 分支的 session ID 启动 watchdog
nohup ./watchdog.sh --session <session-id> --pid <pid> --log train.log > watchdog.out 2>&1 &第 4 步:安心睡觉 😴
第 5 步:早晨查看
claude --resume <session-id>或直接读日志:
cat .train_status.log./watchdog.sh \
--session <claude-session-id> # 必需
--pid <训练进程PID> # 可选但推荐
--log <日志文件> # 可选但推荐
--interval 1800 # 检查间隔(秒),默认 30 分钟
--disk /path/to/check # 监控的磁盘路径,默认当前目录kill $(cat .watchdog.pid)[2026-03-26 03:00:01] ✅ OK — 进程存活,GPU 利用率 94%,loss 0.342,无异常
[2026-03-26 03:30:01] 🚨 CRITICAL — 进程已死亡,原因:CUDA OOM。已将 batch_size 从 128 改为 64,已重启,新 PID 45231
[2026-03-26 04:00:01] ✅ OK — 进程存活(PID 45231),GPU 利用率 91%,loss 0.318,恢复正常
Claude Code 和 Codex 很强大,但缺少一个关键功能:心跳监控。
跑通宵实验时,你需要一个能:
- 主动检查状态(而不是被动等你问)
- 出问题时采取行动(而不只是通知你)
- 在你睡觉时工作(因为凌晨 2 点崩溃最要命)
这就是那个缺失的拼图。
欢迎 PR!特别是:
- 更好的错误检测模式
- 常见问题的自动修复策略
- 支持其他训练框架
MIT
Train Watch — You sleep. I watch. 🌙