Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility of two YOLOv7 identical train jobs #1144

Open
valentinitnelav opened this issue Nov 21, 2022 · 1 comment
Open

Reproducibility of two YOLOv7 identical train jobs #1144

valentinitnelav opened this issue Nov 21, 2022 · 1 comment

Comments

@valentinitnelav
Copy link

I would like to find out how to assure that reproducibility of results happen when I run two identical train jobs using multiple GPUs of a node for training.

For example, here is a training process, where I ask the cluster scheduler to assign a single node which has 4 GPUs and 16 available CPUs. From the 16 allocated CPUs, I use 3 CPUs per GPU as data loaders and 4 should remain "free".

Say, a file named yolov7_baseline_train.sh contains the instruction for the cluster like this:

#!/bin/bash
#SBATCH --job-name=tr_y7 # Job name;
#SBATCH --partition=clara # Request a certain partition;
#SBATCH --nodes=1 # Number of nodes;
#SBATCH --cpus-per-task=16 # Number of CPUs;
#SBATCH --gres=gpu:rtx2080ti:4 # Type and number of GPUs;
#SBATCH --mem-per-gpu=11G # RAM per GPU;
#SBATCH --time=6:00:00 # requested time in d-hh:mm:ss
#SBATCH --output=/path/to/detectors/logs_train_jobs/%j.log # path for job-id.log file;
#SBATCH --error=/path/to/detectors/logs_train_jobs/%j.err # path for job-id.err file;
#SBATCH --mail-type=BEGIN,TIME_LIMIT,END # email options;


# Delete any cache files in the dataset folder that might have been were 
# created from previous jobs.
# This is important when using different YOLO versions.
# See https://github.com/WongKinYiu/yolov7/blob/main/README.md#training
rm --force ~/path/to/data/img-sample/*.cache


# Start with a clean environment
module purge
# Load the needed modules from the software tree (same ones used when we created the environment)
module load Python/3.9.6-GCCcore-11.2.0
# Activate virtual environment
source ~/path/to/venv/yolov7/bin/activate

# Train YOLO by calling train.py
cd ~/path/to/detectors/yolov7 # where I cloned the yolov7 github repository

python -m torch.distributed.launch --nproc_per_node 4 train.py \
--sync-bn \
--weights weights_v0_1/yolov7-tiny.pt \
--data ~/path/to/data/img-sample/data_yolo.yaml \
--hyp data/hyp.scratch.tiny.yaml \
--epochs 300 \
--batch-size 128 \
--img-size 640 640 \
--workers 3 \
--name yolov7_tiny_img640_b32_e300_hyp_scratchtiny_"$SLURM_JOB_ID"
# The batch size is total --batch-size divided by number of GPUs

# Deactivate virtual environment
deactivate

I expected to get identical results when submit the job script two times to the cluster with sbatch ~/path/to/yolov7_baseline_train.sh.

However, I get worryingly different results.
For example, below are the two confusion matrices on the validation dataset. They look worryingly different (just look at the main diagonal and you will see the striking differences).
Could you help me understand what is happening?

confusion_matrix_yolov7_tiny_img640_b32_e300_hyp_scratchtiny_589028.png
confusion_matrix_yolov7_tiny_img640_b32_e300_hyp_scratchtiny_589028

confusion_matrix_yolov7_tiny_img640_b32_e300_hyp_scratchtiny_592454.png
confusion_matrix_yolov7_tiny_img640_b32_e300_hyp_scratchtiny_592454

The log *.err files from each job indicate identical model configurations:

589028.err

Namespace(weights='weights_v0_1/yolov7-tiny.pt', cfg='', data='/home/cluster-name/user-name/prj-name/data/img-sample/data_yolo.yaml', hyp='data/hyp.scratch.tiny.yaml', epochs=300, batch_size=32, img_size=[640, 640], rect=False, resume=False, nosave=False, notest=False, noautoanchor=False, evolve=False, bucket='', cache_images=False, image_weights=False, device='', multi_scale=False, single_cls=False, adam=False, sync_bn=True, local_rank=0, workers=3, project='runs/train', entity=None, name='yolov7_tiny_img640_b32_e300_hyp_scratchtiny_589028', exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias='latest', freeze=[0], v5_metric=False, world_size=4, global_rank=0, save_dir='runs/train/yolov7_tiny_img640_b32_e300_hyp_scratchtiny_589028', total_batch_size=128)
�[34m�[1mtensorboard: �[0mStart with 'tensorboard --logdir runs/train', view at http://localhost:6006/
�[34m�[1mhyperparameters: �[0mlr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.05, copy_paste=0.0, paste_in=0.05, loss_ota=1
Overriding model.yaml nc=80 with nc=12

592454.err

Namespace(weights='weights_v0_1/yolov7-tiny.pt', cfg='', data='/home/cluster-name/user-name/prj-name/data/img-sample/data_yolo.yaml', hyp='data/hyp.scratch.tiny.yaml', epochs=300, batch_size=32, img_size=[640, 640], rect=False, resume=False, nosave=False, notest=False, noautoanchor=False, evolve=False, bucket='', cache_images=False, image_weights=False, device='', multi_scale=False, single_cls=False, adam=False, sync_bn=True, local_rank=0, workers=3, project='runs/train', entity=None, name='yolov7_tiny_img640_b32_e300_hyp_scratchtiny_592454', exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias='latest', freeze=[0], v5_metric=False, world_size=4, global_rank=0, save_dir='runs/train/yolov7_tiny_img640_b32_e300_hyp_scratchtiny_592454', total_batch_size=128)
�[34m�[1mtensorboard: �[0mStart with 'tensorboard --logdir runs/train', view at http://localhost:6006/
�[34m�[1mhyperparameters: �[0mlr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.05, copy_paste=0.0, paste_in=0.05, loss_ota=1
Overriding model.yaml nc=80 with nc=12
@vtyw
Copy link

vtyw commented Jul 18, 2023

If you want a detector that can be trained deterministically, you'd probably have to look for one that advertises that as a specific feature. I would expect it to be rare.

In terms of the confusion matrices being "worryingly different", you would expect that the two training runs produce weights that perform comparably overall if the training hyperparameters are suitable. It's totally valid if the two networks perform differently on each class but have a similar overall performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants