# Vast.ai Training Setup and Execution

This notebook automates the setup and training process for the Rainbow DQN project on a Vast.ai GPU instance.

## Section 1: Clone Repository from Source

Clone the repository from GitHub to the Vast.ai instance.

In [None]:
%%bash

# Clone the repository
cd /root
git clone https://github.com/aryanchegini/function-approximated-RL.git
cd function-approximated-RL
git checkout morusthamid-dev 
# Display current directory and contents
pwd
ls -la

## Section 2: Install Dependencies and Create Environment

Set up the Python environment with all required packages for training.

In [None]:
%%bash

# Update pip and conda
pip install --upgrade pip
conda update -n base -c defaults conda -y
#

# Install core dependencies from requirements.txt
cd /root/function-approximated-RL
pip install -r requirements_vast.txt
AutoROM --accept-license
# Install additional ML/RL specific packages
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Verify installations
python -c "import torch; print(f'PyTorch version: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import gymnasium; print(f'Gymnasium version: {gymnasium.__version__}')"

## Section 3: Configure GPU Environment Variables

Set up GPU-related environment variables to optimize training performance on the Vast.ai GPU instance.

In [None]:
%%bash

# Set GPU environment variables
export CUDA_VISIBLE_DEVICES=0  # Use GPU 0 (adjust if needed)
export CUDA_LAUNCH_BLOCKING=1   # Synchronous CUDA kernel launches for better debugging
export TF_FORCE_GPU_ALLOW_GROWTH=true  # Allow GPU memory growth

# Verify GPU access
nvidia-smi
python -c "import torch; print(f'GPU Count: {torch.cuda.device_count()}'); print(f'Current GPU: {torch.cuda.get_device_name(0)}')"

## Section 4: Execute Training Script

Run the training script with appropriate parameters and log outputs for monitoring.

In [None]:
%%bash

cd /root/function-approximated-RL

# Create logs directory if it doesn't exist
mkdir -p logs

# Run the training script with logging
echo "Starting Rainbow DQN training..."
export CUDA_VISIBLE_DEVICES=0
# python train.py 2>&1 | tee logs/training_$(date +%Y%m%d_%H%M%S).log
python Train.py

echo "Training completed!"

## Section 5: Monitor Training Progress and Logs

Display and tail log files to monitor training progress in real-time.

In [None]:
%%bash

echo "=== GPU Monitoring ==="
nvidia-smi

echo ""
echo "=== Disk Usage ==="
df -h /root/function-approximated-RL

echo ""
echo "=== Process Status ==="
ps aux | grep python | grep -v grep

In [None]:
%%bash

# Navigate to your data directory
cd vast_ai_checkpoints_and_logs

# Add your files
git add .

# Commit
git commit -m "Training results"

git push


## Notes for Vast.ai Usage

1. **Run cells sequentially**: Execute each cell from top to bottom
2. **Keep notebook open**: Vast.ai requires the notebook to be active; don't disconnect
3. **Checkpoints**: Model checkpoints are saved in the `checkpoints/` directory
4. **Logs**: Training logs are saved in the `logs/` directory with timestamps
5. **GPU Memory**: Monitor GPU memory with `nvidia-smi` in Section 5
6. **Long Training**: For extended training sessions, consider using `tmux` or `screen`:
   ```bash
   tmux new-session -d -s training "cd /root/function-approximated-RL && python train.py"
   tmux attach -t training
   ```
7. **Data Backup**: Copy important files to local machine periodically to avoid data loss