Error at Training places512 #53

Albert-learner · 2023-01-05T05:08:31Z

Hello, I'm trying to train MAT model at Places365 dataset.
I have similar error at #48.
My Error message is below image

I apply this command
python train.py
--outdir=places365_train_large
--gpus=4
--batch=32
--metrics=fid36k5_full
--data=/home/adminuser/Jabblee/MAT/Places365/train_large
--data_val=/home/adminuser/Jabblee/MAT/Places365/val_large
--dataloader=datasets.dataset_512.ImageFolderMaskDataset
--mirror=False
--cond=False
--cfg=places512
--aug=noaug
--generator=networks.mat.Generator
--discriminator=networks.mat.Discriminator
--loss=losses.loss.TwoStageLoss
--pr=0.1
--pl=False
--truncation=0.5
--style_mix=0.5
--ema=10
--lr=0.001

I have 10950 images for train, 4050 images for validation, and 5000 images for test.
I use A100 4 GPUS for training MAT model.
And I have some questions about training.

At issue questions with places512 trainning #48, you said that if I'd like to train MAT model, I need 36.5k images at using fid36k5 metric.
Is this mean that total images of train and val are 36.5k images or total images of train are 36.5k images?
If I'd like to apply 20k images, did I have to change only metric or other things?
I change fid36k5_full function at metric_main.py in this way
@register_metric
def fid36k5_full(opts):
opts.dataset_kwargs.update(max_size=None, xflip=False)
fid = frechet_inception_distance.compute_fid(opts, max_real=20000, num_gen=20000) # Here is Error Point 2.
return dict(fid36k5_full=fid)

fenglinglwb · 2023-01-05T08:57:42Z

Thanks for your attention. Yes, you just need to redefine the metric.

Albert-learner · 2023-01-05T10:02:46Z

Thank you for your answer. And I have another questions about training MAT model at Places512.
Where can I find the training epoch of this model?? Thank you for your quick answer.

fenglinglwb · 2023-01-07T07:12:26Z

We use 'iteration' as a tag instead of 'epoch'. The trained models are saved in the log folder.

Albert-learner · 2023-01-27T12:08:08Z

Hello, I'd like to use your model in different dataset.
The dataset image is RGBA images(the shape is (32, 4, 512, 512)), and I apply reshape those images to (1, 128, 512, 512).
But this images could not calculate FID score, so I apply commentated to those parts that calculate FID scores in training_loop.py.
...

Evaluate metrics.

    # if (snapshot_data is not None) and (len(metrics) > 0):
    #     if rank == 0:
    #         print('Evaluating metrics...')
    #     for metric in metrics:
    #         result_dict = metric_main.calc_metric(metric=metric, G=snapshot_data['G_ema'],
    #             dataset_kwargs=val_set_kwargs, num_gpus=num_gpus, rank=rank, device=device)  # Here is Error Point.
    #         if rank == 0:
    #             metric_main.report_metric(result_dict, run_dir=run_dir, snapshot_pkl=snapshot_pkl)
    #         stats_metrics.update(result_dict.results)
    # del snapshot_data # conserve memory

...

But when I apply this, another kinds of problems occur.
The problem is that when I apply train.py,
'''

Output directory:   realestate10k_train/00014-train-places512-mat-lr0.001-TwoStageLoss-pr0.1-nopl-batch32-tc0.5-sm0.5-ema10-noaug
Training data:      /work/MAT/Single_View_MPI_results/RealEstate10K/train
Training duration:  15000 kimg
Number of GPUs:     4
Number of images:   15000
Image resolution:   512
Conditional model:  False
Dataset x-flips:    False

Validation options:
Validation data:      /work/MAT/Single_View_MPI_results/RealEstate10K/test
Number of images:   5000
Image resolution:   512
Conditional model:  False
Dataset x-flips:    False

Creating output directory...
Launching processes...
Loading training set...

Num images:  15000
Image shape: [128, 512, 512]
Label shape: [0]

Constructing networks...
Setting up augmentation...
Distributing across 4 GPUs...
Setting up training phases...
Exporting sample images...
Setting up PyTorch plugin "bias_act_plugin"... Done.
Setting up PyTorch plugin "upfirdn2d_plugin"... 
```

I wait for more than one hour, but there`s no progress after last line("Setting up PyTorch plugin "upfirdn2d_plugin"... ").
Actually, I execute several times for those changed dataset, but after several times of modifying, it didn`t work. And when I apply for training, the warning keeps coming out.
The warning was this,

UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown


Could you tell me why I stuck in last line??

fenglinglwb · 2023-02-02T01:36:25Z

This may be caused by the compilation lock. You may try deleting the cache files in .cache/xxx and recompiling.

Albert-learner · 2023-03-11T07:00:21Z

Hello, I'm trying to train Mask Aware Transformer model to my custom dataset. But it takes almost more than 30 days.

So I'd like to know how to train more fast by using MAT model. And I'd like to know which part takes long for training MAT model.

My custom dataset consists of 640,000 rgba images.

Albert-learner changed the title ~~Error at Training~~ Error at Training places512 Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error at Training places512 #53

Error at Training places512 #53

Albert-learner commented Jan 5, 2023

fenglinglwb commented Jan 5, 2023

Albert-learner commented Jan 5, 2023

fenglinglwb commented Jan 7, 2023

Albert-learner commented Jan 27, 2023 •

edited

fenglinglwb commented Feb 2, 2023 •

edited

Albert-learner commented Mar 11, 2023

Error at Training places512 #53

Error at Training places512 #53

Comments

Albert-learner commented Jan 5, 2023

fenglinglwb commented Jan 5, 2023

Albert-learner commented Jan 5, 2023

fenglinglwb commented Jan 7, 2023

Albert-learner commented Jan 27, 2023 • edited

Evaluate metrics.

fenglinglwb commented Feb 2, 2023 • edited

Albert-learner commented Mar 11, 2023

Albert-learner commented Jan 27, 2023 •

edited

fenglinglwb commented Feb 2, 2023 •

edited