Reproducibility of two YOLOv7 identical train jobs #1144

valentinitnelav · 2022-11-21T19:42:57Z

I would like to find out how to assure that reproducibility of results happen when I run two identical train jobs using multiple GPUs of a node for training.

For example, here is a training process, where I ask the cluster scheduler to assign a single node which has 4 GPUs and 16 available CPUs. From the 16 allocated CPUs, I use 3 CPUs per GPU as data loaders and 4 should remain "free".

Say, a file named yolov7_baseline_train.sh contains the instruction for the cluster like this:

#!/bin/bash
#SBATCH --job-name=tr_y7 # Job name;
#SBATCH --partition=clara # Request a certain partition;
#SBATCH --nodes=1 # Number of nodes;
#SBATCH --cpus-per-task=16 # Number of CPUs;
#SBATCH --gres=gpu:rtx2080ti:4 # Type and number of GPUs;
#SBATCH --mem-per-gpu=11G # RAM per GPU;
#SBATCH --time=6:00:00 # requested time in d-hh:mm:ss
#SBATCH --output=/path/to/detectors/logs_train_jobs/%j.log # path for job-id.log file;
#SBATCH --error=/path/to/detectors/logs_train_jobs/%j.err # path for job-id.err file;
#SBATCH --mail-type=BEGIN,TIME_LIMIT,END # email options;


# Delete any cache files in the dataset folder that might have been were 
# created from previous jobs.
# This is important when using different YOLO versions.
# See https://github.com/WongKinYiu/yolov7/blob/main/README.md#training
rm --force ~/path/to/data/img-sample/*.cache


# Start with a clean environment
module purge
# Load the needed modules from the software tree (same ones used when we created the environment)
module load Python/3.9.6-GCCcore-11.2.0
# Activate virtual environment
source ~/path/to/venv/yolov7/bin/activate

# Train YOLO by calling train.py
cd ~/path/to/detectors/yolov7 # where I cloned the yolov7 github repository

python -m torch.distributed.launch --nproc_per_node 4 train.py \
--sync-bn \
--weights weights_v0_1/yolov7-tiny.pt \
--data ~/path/to/data/img-sample/data_yolo.yaml \
--hyp data/hyp.scratch.tiny.yaml \
--epochs 300 \
--batch-size 128 \
--img-size 640 640 \
--workers 3 \
--name yolov7_tiny_img640_b32_e300_hyp_scratchtiny_"$SLURM_JOB_ID"
# The batch size is total --batch-size divided by number of GPUs

# Deactivate virtual environment
deactivate

I expected to get identical results when submit the job script two times to the cluster with sbatch ~/path/to/yolov7_baseline_train.sh.

However, I get worryingly different results.
For example, below are the two confusion matrices on the validation dataset. They look worryingly different (just look at the main diagonal and you will see the striking differences).
Could you help me understand what is happening?

confusion_matrix_yolov7_tiny_img640_b32_e300_hyp_scratchtiny_589028.png

confusion_matrix_yolov7_tiny_img640_b32_e300_hyp_scratchtiny_592454.png

The log *.err files from each job indicate identical model configurations:

589028.err

Namespace(weights='weights_v0_1/yolov7-tiny.pt', cfg='', data='/home/cluster-name/user-name/prj-name/data/img-sample/data_yolo.yaml', hyp='data/hyp.scratch.tiny.yaml', epochs=300, batch_size=32, img_size=[640, 640], rect=False, resume=False, nosave=False, notest=False, noautoanchor=False, evolve=False, bucket='', cache_images=False, image_weights=False, device='', multi_scale=False, single_cls=False, adam=False, sync_bn=True, local_rank=0, workers=3, project='runs/train', entity=None, name='yolov7_tiny_img640_b32_e300_hyp_scratchtiny_589028', exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias='latest', freeze=[0], v5_metric=False, world_size=4, global_rank=0, save_dir='runs/train/yolov7_tiny_img640_b32_e300_hyp_scratchtiny_589028', total_batch_size=128)
�[34m�[1mtensorboard: �[0mStart with 'tensorboard --logdir runs/train', view at http://localhost:6006/
�[34m�[1mhyperparameters: �[0mlr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.05, copy_paste=0.0, paste_in=0.05, loss_ota=1
Overriding model.yaml nc=80 with nc=12

592454.err

Namespace(weights='weights_v0_1/yolov7-tiny.pt', cfg='', data='/home/cluster-name/user-name/prj-name/data/img-sample/data_yolo.yaml', hyp='data/hyp.scratch.tiny.yaml', epochs=300, batch_size=32, img_size=[640, 640], rect=False, resume=False, nosave=False, notest=False, noautoanchor=False, evolve=False, bucket='', cache_images=False, image_weights=False, device='', multi_scale=False, single_cls=False, adam=False, sync_bn=True, local_rank=0, workers=3, project='runs/train', entity=None, name='yolov7_tiny_img640_b32_e300_hyp_scratchtiny_592454', exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias='latest', freeze=[0], v5_metric=False, world_size=4, global_rank=0, save_dir='runs/train/yolov7_tiny_img640_b32_e300_hyp_scratchtiny_592454', total_batch_size=128)
�[34m�[1mtensorboard: �[0mStart with 'tensorboard --logdir runs/train', view at http://localhost:6006/
�[34m�[1mhyperparameters: �[0mlr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.05, copy_paste=0.0, paste_in=0.05, loss_ota=1
Overriding model.yaml nc=80 with nc=12

The text was updated successfully, but these errors were encountered:

vtyw · 2023-07-18T02:13:23Z

If you want a detector that can be trained deterministically, you'd probably have to look for one that advertises that as a specific feature. I would expect it to be rare.

In terms of the confusion matrices being "worryingly different", you would expect that the two training runs produce weights that perform comparably overall if the training hyperparameters are suitable. It's totally valid if the two networks perform differently on each class but have a similar overall performance.

valentinitnelav mentioned this issue Nov 22, 2022

Reproducibility of two YOLOv5 identical train jobs stark-t/PAI#31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility of two YOLOv7 identical train jobs #1144

Reproducibility of two YOLOv7 identical train jobs #1144

valentinitnelav commented Nov 21, 2022

vtyw commented Jul 18, 2023

Reproducibility of two YOLOv7 identical train jobs #1144

Reproducibility of two YOLOv7 identical train jobs #1144

Comments

valentinitnelav commented Nov 21, 2022

vtyw commented Jul 18, 2023