New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error at Training places512 #53
Comments
Thanks for your attention. Yes, you just need to redefine the metric. |
Thank you for your answer. And I have another questions about training MAT model at Places512. |
We use 'iteration' as a tag instead of 'epoch'. The trained models are saved in the log folder. |
Hello, I'd like to use your model in different dataset. Evaluate metrics.
... But when I apply this, another kinds of problems occur.
UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown
|
This may be caused by the compilation lock. You may try deleting the cache files in .cache/xxx and recompiling. |
Hello, I'm trying to train Mask Aware Transformer model to my custom dataset. But it takes almost more than 30 days. So I'd like to know how to train more fast by using MAT model. And I'd like to know which part takes long for training MAT model. My custom dataset consists of 640,000 rgba images. |
Hello, I'm trying to train MAT model at Places365 dataset.
I have similar error at #48.
My Error message is below image
I apply this command
python train.py
--outdir=places365_train_large
--gpus=4
--batch=32
--metrics=fid36k5_full
--data=/home/adminuser/Jabblee/MAT/Places365/train_large
--data_val=/home/adminuser/Jabblee/MAT/Places365/val_large
--dataloader=datasets.dataset_512.ImageFolderMaskDataset
--mirror=False
--cond=False
--cfg=places512
--aug=noaug
--generator=networks.mat.Generator
--discriminator=networks.mat.Discriminator
--loss=losses.loss.TwoStageLoss
--pr=0.1
--pl=False
--truncation=0.5
--style_mix=0.5
--ema=10
--lr=0.001
I have 10950 images for train, 4050 images for validation, and 5000 images for test.
I use A100 4 GPUS for training MAT model.
And I have some questions about training.
Is this mean that total images of train and val are 36.5k images or total images of train are 36.5k images?
I change fid36k5_full function at metric_main.py in this way
@register_metric
def fid36k5_full(opts):
opts.dataset_kwargs.update(max_size=None, xflip=False)
fid = frechet_inception_distance.compute_fid(opts, max_real=20000, num_gen=20000) # Here is Error Point 2.
return dict(fid36k5_full=fid)
The text was updated successfully, but these errors were encountered: