Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error at Training places512 #53

Open
Albert-learner opened this issue Jan 5, 2023 · 6 comments
Open

Error at Training places512 #53

Albert-learner opened this issue Jan 5, 2023 · 6 comments

Comments

@Albert-learner
Copy link

Hello, I'm trying to train MAT model at Places365 dataset.
I have similar error at #48.
My Error message is below image
Screenshot from 2023-01-05 14-07-15

I apply this command
python train.py
--outdir=places365_train_large
--gpus=4
--batch=32
--metrics=fid36k5_full
--data=/home/adminuser/Jabblee/MAT/Places365/train_large
--data_val=/home/adminuser/Jabblee/MAT/Places365/val_large
--dataloader=datasets.dataset_512.ImageFolderMaskDataset
--mirror=False
--cond=False
--cfg=places512
--aug=noaug
--generator=networks.mat.Generator
--discriminator=networks.mat.Discriminator
--loss=losses.loss.TwoStageLoss
--pr=0.1
--pl=False
--truncation=0.5
--style_mix=0.5
--ema=10
--lr=0.001

I have 10950 images for train, 4050 images for validation, and 5000 images for test.
I use A100 4 GPUS for training MAT model.
And I have some questions about training.

  1. At issue questions with places512 trainning  #48, you said that if I'd like to train MAT model, I need 36.5k images at using fid36k5 metric.
    Is this mean that total images of train and val are 36.5k images or total images of train are 36.5k images?
  2. If I'd like to apply 20k images, did I have to change only metric or other things?
    I change fid36k5_full function at metric_main.py in this way
    @register_metric
    def fid36k5_full(opts):
    opts.dataset_kwargs.update(max_size=None, xflip=False)
    fid = frechet_inception_distance.compute_fid(opts, max_real=20000, num_gen=20000) # Here is Error Point 2.
    return dict(fid36k5_full=fid)
@Albert-learner Albert-learner changed the title Error at Training Error at Training places512 Jan 5, 2023
@fenglinglwb
Copy link
Owner

Thanks for your attention. Yes, you just need to redefine the metric.

@Albert-learner
Copy link
Author

Thank you for your answer. And I have another questions about training MAT model at Places512.
Where can I find the training epoch of this model?? Thank you for your quick answer.

@fenglinglwb
Copy link
Owner

We use 'iteration' as a tag instead of 'epoch'. The trained models are saved in the log folder.

@Albert-learner
Copy link
Author

Albert-learner commented Jan 27, 2023

Hello, I'd like to use your model in different dataset.
The dataset image is RGBA images(the shape is (32, 4, 512, 512)), and I apply reshape those images to (1, 128, 512, 512).
But this images could not calculate FID score, so I apply commentated to those parts that calculate FID scores in training_loop.py.
...

Evaluate metrics.

    # if (snapshot_data is not None) and (len(metrics) > 0):
    #     if rank == 0:
    #         print('Evaluating metrics...')
    #     for metric in metrics:
    #         result_dict = metric_main.calc_metric(metric=metric, G=snapshot_data['G_ema'],
    #             dataset_kwargs=val_set_kwargs, num_gpus=num_gpus, rank=rank, device=device)  # Here is Error Point.
    #         if rank == 0:
    #             metric_main.report_metric(result_dict, run_dir=run_dir, snapshot_pkl=snapshot_pkl)
    #         stats_metrics.update(result_dict.results)
    # del snapshot_data # conserve memory

...

But when I apply this, another kinds of problems occur.
The problem is that when I apply train.py,
'''

Output directory:   realestate10k_train/00014-train-places512-mat-lr0.001-TwoStageLoss-pr0.1-nopl-batch32-tc0.5-sm0.5-ema10-noaug
Training data:      /work/MAT/Single_View_MPI_results/RealEstate10K/train
Training duration:  15000 kimg
Number of GPUs:     4
Number of images:   15000
Image resolution:   512
Conditional model:  False
Dataset x-flips:    False

Validation options:
Validation data:      /work/MAT/Single_View_MPI_results/RealEstate10K/test
Number of images:   5000
Image resolution:   512
Conditional model:  False
Dataset x-flips:    False

Creating output directory...
Launching processes...
Loading training set...

Num images:  15000
Image shape: [128, 512, 512]
Label shape: [0]

Constructing networks...
Setting up augmentation...
Distributing across 4 GPUs...
Setting up training phases...
Exporting sample images...
Setting up PyTorch plugin "bias_act_plugin"... Done.
Setting up PyTorch plugin "upfirdn2d_plugin"... 
```

I wait for more than one hour, but there`s no progress after last line("Setting up PyTorch plugin "upfirdn2d_plugin"... ").
Actually, I execute several times for those changed dataset, but after several times of modifying, it didn`t work. And when I apply for training, the warning keeps coming out.
The warning was this,

UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown


Could you tell me why I stuck in last line??

@fenglinglwb
Copy link
Owner

fenglinglwb commented Feb 2, 2023

This may be caused by the compilation lock. You may try deleting the cache files in .cache/xxx and recompiling.

@Albert-learner
Copy link
Author

Hello, I'm trying to train Mask Aware Transformer model to my custom dataset. But it takes almost more than 30 days.

So I'd like to know how to train more fast by using MAT model. And I'd like to know which part takes long for training MAT model.

My custom dataset consists of 640,000 rgba images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants