You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to find out how to assure that reproducibility of results happen when I run two identical train jobs using multiple GPUs of a node for training.
For example, here is a training process, where I ask the cluster scheduler to assign a single node which has 4 GPUs and 16 available CPUs. From the 16 allocated CPUs, I use 3 CPUs per GPU as data loaders and 4 should remain "free".
Say, a file named yolov7_baseline_train.sh contains the instruction for the cluster like this:
#!/bin/bash#SBATCH --job-name=tr_y7 # Job name;#SBATCH --partition=clara # Request a certain partition;#SBATCH --nodes=1 # Number of nodes;#SBATCH --cpus-per-task=16 # Number of CPUs;#SBATCH --gres=gpu:rtx2080ti:4 # Type and number of GPUs;#SBATCH --mem-per-gpu=11G # RAM per GPU;#SBATCH --time=6:00:00 # requested time in d-hh:mm:ss#SBATCH --output=/path/to/detectors/logs_train_jobs/%j.log # path for job-id.log file;#SBATCH --error=/path/to/detectors/logs_train_jobs/%j.err # path for job-id.err file;#SBATCH --mail-type=BEGIN,TIME_LIMIT,END # email options;# Delete any cache files in the dataset folder that might have been were # created from previous jobs.# This is important when using different YOLO versions.# See https://github.com/WongKinYiu/yolov7/blob/main/README.md#training
rm --force ~/path/to/data/img-sample/*.cache
# Start with a clean environment
module purge
# Load the needed modules from the software tree (same ones used when we created the environment)
module load Python/3.9.6-GCCcore-11.2.0
# Activate virtual environmentsource~/path/to/venv/yolov7/bin/activate
# Train YOLO by calling train.pycd~/path/to/detectors/yolov7 # where I cloned the yolov7 github repository
python -m torch.distributed.launch --nproc_per_node 4 train.py \
--sync-bn \
--weights weights_v0_1/yolov7-tiny.pt \
--data ~/path/to/data/img-sample/data_yolo.yaml \
--hyp data/hyp.scratch.tiny.yaml \
--epochs 300 \
--batch-size 128 \
--img-size 640 640 \
--workers 3 \
--name yolov7_tiny_img640_b32_e300_hyp_scratchtiny_"$SLURM_JOB_ID"# The batch size is total --batch-size divided by number of GPUs# Deactivate virtual environment
deactivate
I expected to get identical results when submit the job script two times to the cluster with sbatch ~/path/to/yolov7_baseline_train.sh.
However, I get worryingly different results.
For example, below are the two confusion matrices on the validation dataset. They look worryingly different (just look at the main diagonal and you will see the striking differences).
Could you help me understand what is happening?
If you want a detector that can be trained deterministically, you'd probably have to look for one that advertises that as a specific feature. I would expect it to be rare.
In terms of the confusion matrices being "worryingly different", you would expect that the two training runs produce weights that perform comparably overall if the training hyperparameters are suitable. It's totally valid if the two networks perform differently on each class but have a similar overall performance.
I would like to find out how to assure that reproducibility of results happen when I run two identical train jobs using multiple GPUs of a node for training.
For example, here is a training process, where I ask the cluster scheduler to assign a single node which has 4 GPUs and 16 available CPUs. From the 16 allocated CPUs, I use 3 CPUs per GPU as data loaders and 4 should remain "free".
Say, a file named
yolov7_baseline_train.sh
contains the instruction for the cluster like this:I expected to get identical results when submit the job script two times to the cluster with
sbatch ~/path/to/yolov7_baseline_train.sh
.However, I get worryingly different results.
For example, below are the two confusion matrices on the validation dataset. They look worryingly different (just look at the main diagonal and you will see the striking differences).
Could you help me understand what is happening?
confusion_matrix_yolov7_tiny_img640_b32_e300_hyp_scratchtiny_589028.png
confusion_matrix_yolov7_tiny_img640_b32_e300_hyp_scratchtiny_592454.png
The log *.err files from each job indicate identical model configurations:
589028.err
592454.err
The text was updated successfully, but these errors were encountered: