Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How many epochs of training should I expect? #143

Closed
batchku opened this issue Nov 23, 2022 · 3 comments
Closed

How many epochs of training should I expect? #143

batchku opened this issue Nov 23, 2022 · 3 comments

Comments

@batchku
Copy link

batchku commented Nov 23, 2022

I've been running training on a set of audio files and am wondering how I should assess how training is going.

After about 24 hours, I'm at about 13,000 epics. I'm not sure how to interpret the tensor board visualizations; any pointers would be very much appreciated.

/content/drive/MyDrive/RAVE_COLLAB
Recursive search in /content/drive/MyDrive/RAVE_COLLAB/resampled/parbass/
audio_00158_00000.wav: 100% 159/159 [00:04<00:00, 33.67it/s] 
/content/miniconda/lib/python3.9/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Restoring states from the checkpoint path at /content/drive/MyDrive/RAVE_COLLAB/runs/parbass/rave/version_2/checkpoints/last-v1.ckpt
/content/miniconda/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:342: UserWarning: The dirpath has changed from 'runs/parbass/rave/version_2/checkpoints' to 'runs/parbass/rave/version_3/checkpoints', therefore `best_model_score`, `kth_best_model_path`, `kth_value`, `last_model_path` and `best_k_models` won't be reloaded. Only `best_model_path` will be reloaded.
  warnings.warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                | Params
------------------------------------------------------
0 | pqmf          | CachedPQMF          | 4.2 K 
1 | loudness      | Loudness            | 0     
2 | encoder       | Encoder             | 4.8 M 
3 | decoder       | Generator           | 12.8 M
4 | discriminator | StackDiscriminators | 16.9 M
------------------------------------------------------
34.5 M    Trainable params
0         Non-trainable params
34.5 M    Total params
138.092   Total estimated model params size (MB)
Restored all states from the checkpoint file at /content/drive/MyDrive/RAVE_COLLAB/runs/parbass/rave/version_2/checkpoints/last-v1.ckpt
/content/miniconda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1933: PossibleUserWarning: The number of training batches (19) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 11571:   0% 0/20 [00:00<00:00, -106397.56it/s]/content/miniconda/lib/python3.9/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
Epoch 12623:  95% 19/20 [00:04<?, ?it/s, v_num=3]
Validation: 0it [00:00, ?it/s]
Validation:   0% 0/1 [00:00<?, ?it/s]
Validation DataLoader 0:   0% 0/1 [00:00<?, ?it/s]
Epoch 12623: 100% 20/20 [00:04<00:00,  4.59s/it, v_num=3]
Epoch 12624:   0% 0/19 [00:00<00:00, -111926.65it/s, v_num=3]/content/miniconda/lib/python3.9/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
Epoch 12706:  63% 12/19 [00:03<-1:59:57, -2.16it/s, v_num=3]

image

image

image

@jacklion710
Copy link

Did you ever find out how many? How long did training take for you?

@0x7b1
Copy link

0x7b1 commented Jun 9, 2023

Isn't it 100000 epochs?

max_epochs=100000,

@jacklion710
Copy link

In my experience, i've noticed that it goes by the number of steps. At least as of the last time I trained which was several months ago. The default was 6000000 steps which can be changed by setting the proper flag on the 'rave train' cmd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants