Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The results of LFFont are not satisfactory on a custom korean dataset #3

Closed
ammar-deep opened this issue Jan 4, 2022 · 12 comments
Closed

Comments

@ammar-deep
Copy link

ammar-deep commented Jan 4, 2022

I am trying to train LFFont on a custom Korean dataset consisting of 60 printed font styles.

For the phase 1 training I use all the default configurations of data (cfgs/data) and LFFont (cfgs/LF/p1) except the batch size which is set to 4 and number of workers are 8 on a single 3080 ti GPU with 12 gigz. The training goes normal for 200k iterations. The results are OK considering that the phase 2 training will further improve it.

When I train the model for the phase 2 I have OOM problem with the default configurations although the default batch size is 1. Finally with some alterations to the p2 default configuration file (default.yaml) where I set the num_workers to 2 and emb_dim to 6 I can train the model.
However, the training results are really bad and doesn't seem to get better till 200k iterations (in the paper i think for Korean characters p2 was trained for 50k iterations). I personally assume that probably its down to the embedding dimension i adopted (6) however in the paper it has been shown that it doesn't effect the performance a lot (from 8 to 6).

So, I tried to train p2 with multiple GPUs by only adding the use_ddp in p2/train.yaml to True and gpus_per_node to 3 (in train_LF file) but still i get OOM

Could you please help me in the following

  1. How can I train the model on multiple GPUs? Do I need to change some other parameters other than use_ddp: True, gpus_per_node=3 ?
  2. Do you think the performance of my trained model is down to lowering the emb_dim and num_workers?
  3. For the phase 2 training do we need to give the --resume value of the p1 last checkpoint?

BTW I trained FUNIT with your provided source on the same dataset and it performs well so definitely not down to the dataset I am using.

@8uos
Copy link
Collaborator

8uos commented Jan 7, 2022

Hi, sorry for the late reply.

  1. We control the number of images for each iteration by setting the values of n_in_s, n_in_c, and n_trg of here.
    I recommend firstly to change the n_trg value to a smaller one, and if it still not works, then change the last two values (n_in_s and n_in_c) to 2.
    If you want to train the model on multiple GPUs, it is right to set use_ddp to True and gpus_per_node to your own value.
    However, it will not reduce the number of images for each GPU because the batch_size is 1.

  2. We used small emb_dim value (3 or 4) for the Korean dataset. num_workers can be any value which suits for your machine.

  3. (Important) It is right! I missed to write that in our documents. Sorry for the inconvenience. I will update the document. Thanks for noticing it. Note that, please do not change the force_resume of p2/defaults.yaml to False -- setting it to False will cause a loading error.

@ammar-deep
Copy link
Author

  1. We control the number of images for each iteration by setting the values of n_in_s, n_in_c, and n_trg of here.
    I recommend firstly to change the n_trg value to a smaller one, and if it still not works, then change the last two values (n_in_s and n_in_c) to 2.

This did the trick for me. Setting (n_in_s and n_in_c) to 2. Now I can train LFFont on multiple GPU's.

  1. We used small emb_dim value (3 or 4) for the Korean dataset. num_workers can be any value which suits for your machine.

Alright. Means setting emb_dim to 3 or 4 will not effect the performance.

  1. (Important) It is right! I missed to write that in our documents. Sorry for the inconvenience. I will update the document. Thanks for noticing it. Note that, please do not change the force_resume of p2/defaults.yaml to False -- setting it to False will cause a loading error.

Thanks.

I am training the network now and hope to have satisfactory results this time.
Thanks for helping out.

@ammar-deep
Copy link
Author

So, I finished the training but the results are not satisfactory. I can see broken content.
Any suggestions for this ?

@8uos
Copy link
Collaborator

8uos commented Jan 10, 2022

As our experimental results show, our model may not generate perfect contents.
Can I see your results?

@ammar-deep
Copy link
Author

0200000-None

The attached image is 200k iteration training image generated by ph2 LFFont model.

@8uos
Copy link
Collaborator

8uos commented Jan 10, 2022

It seems like there were some problems during the training. These results are much worse than ours.
Can I see the ph1 training results?
Also, I wonder that what decomposition and primal file was used for the training.
Showing the configuration files (.yaml files) which you used for phase 1 and 2 training will help me to find problems.

@ammar-deep
Copy link
Author

The below is the LFFont phase1 200k iteration image

0200000-None

Below are the configurations I have used for training LFFont model.

cfgs/LF/p1/train.yaml

use_ddp: True
decomposition: data/kor/decomposition.json
primals: data/kor/primals.json

trainer:
  resume:
  work_dir: ./result/lf1

dset:
  train:
    source_path: data/kor/source.ttf
    source_ext: ttf

cfgs/LF/p1/default.yaml

seed: 2
model: lf
phase: comb

# Decomposition rule
decomposition:
primals:

# Optimizer
max_iter: 200000
g_lr: 2e-4
d_lr: 8e-4
ac_lr: 2e-4
adam_betas: [0.0, 0.9]

# Trainer
trainer:
  resume: 
  force_resume: False
  work_dir: ./result/lf
  # Losses
  pixel_loss_type: l1
  pixel_w: 0.1
  gan_w: 1.0
  fm_layers: all
  fm_w: 1.0
  ac_w: 0.1
  ac_gen_w: 0.1
  fact_const_w: 0.
  # Display
  save: all-last
  print_freq: 1000
  val_freq: 10000
  save_freq: 50000
  tb_freq: 100

# Generator
gen:
  emb_dim:

# Dataloader
dset:
  loader:
    batch_size: 8
    num_workers: 16
  train:
    n_in_s: 3
    n_in_min: 3
    n_in_max: 5

cfgs/LF/p2/train.yaml

use_ddp: True
decomposition: data/kor/decomposition.json
primals: data/kor/primals.json

trainer:
  resume: result/lf1/checkpoints/200000.pth
  work_dir: ./result/lf2

dset:
  train:
    source_path: data/kor/source.ttf
    source_ext: ttf
    
gen:
  emb_dim: 3

cfgs/LF/p2/default.yaml

seed: 2
model: lf
phase: fact

# Decomposition rule
decomposition:
primals:

# Optimizer
max_iter: 200000
g_lr: 2e-4
d_lr: 8e-4
ac_lr: 2e-4
adam_betas: [0.0, 0.9]

# Trainer
trainer:
  resume: 
  force_resume: True
  work_dir: ./result/lf
  # Losses
  pixel_loss_type: l1
  pixel_w: 0.1
  gan_w: 1.0
  fm_layers: all
  fm_w: 1.0
  ac_w: 0.1
  ac_gen_w: 0.1
  fact_const_w: 1.
  # Display
  save: all-last
  print_freq: 1000
  val_freq: 10000
  save_freq: 50000
  tb_freq: 100

# Generator
gen:
  emb_dim: 3

# Dataloader
dset:
  loader:
    batch_size: 1
    num_workers: 8

@8uos
Copy link
Collaborator

8uos commented Jan 10, 2022

I cannot find any problem in the configuration file. How many fonts did you use for the training?
When I tested this code, the model generated plausible images at the end of phase 1 training (We used 68 fonts)
If you are using more fonts than 68, I recommend to train the model longer than 200k iterations at phase 1.
For example, we used 367 Chinese fonts and trained 800k iterations at phase 1 for the Chinese training.

These are our results at the end of phase 1 (200k):
image

@ammar-deep
Copy link
Author

I used 60 font files for training with the following cfgs I mentioned. With the same cfgs, FUNIT did a good job. I am not sure about the issue. I can train it more and let you know if it gets better.

@8uos
Copy link
Collaborator

8uos commented Jan 10, 2022

OK, thanks. I will check the code again.
Also, I think the mode collapse can be the reason of this failure.
We observed that our model can be sensitive to the learning rates.
The default learning rates (0.0002 and 0.0008) are the values that we used, however, it can be enhanced by controlling the values.
The losses when our model was converged :
|D 3.830 |G 0.231 |FM 1.323 |R_font 0.497 |F_font 0.562 |R_uni 0.555 |F_uni 0.626

@ammar-deep
Copy link
Author

Thanks for this tip.
I am going to train it again buy increasing the iterations and altering the learning rates. Hopefully it will work better.

@SanghyukChun
Copy link
Collaborator

Closing the issue, assuming the answer resolves the problem.
Please re-open the issue as necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants