How many epochs of training should I expect? #143

batchku · 2022-11-23T18:00:41Z

I've been running training on a set of audio files and am wondering how I should assess how training is going.

After about 24 hours, I'm at about 13,000 epics. I'm not sure how to interpret the tensor board visualizations; any pointers would be very much appreciated.

/content/drive/MyDrive/RAVE_COLLAB
Recursive search in /content/drive/MyDrive/RAVE_COLLAB/resampled/parbass/
audio_00158_00000.wav: 100% 159/159 [00:04<00:00, 33.67it/s] 
/content/miniconda/lib/python3.9/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Restoring states from the checkpoint path at /content/drive/MyDrive/RAVE_COLLAB/runs/parbass/rave/version_2/checkpoints/last-v1.ckpt
/content/miniconda/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:342: UserWarning: The dirpath has changed from 'runs/parbass/rave/version_2/checkpoints' to 'runs/parbass/rave/version_3/checkpoints', therefore `best_model_score`, `kth_best_model_path`, `kth_value`, `last_model_path` and `best_k_models` won't be reloaded. Only `best_model_path` will be reloaded.
  warnings.warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                | Params
------------------------------------------------------
0 | pqmf          | CachedPQMF          | 4.2 K 
1 | loudness      | Loudness            | 0     
2 | encoder       | Encoder             | 4.8 M 
3 | decoder       | Generator           | 12.8 M
4 | discriminator | StackDiscriminators | 16.9 M
------------------------------------------------------
34.5 M    Trainable params
0         Non-trainable params
34.5 M    Total params
138.092   Total estimated model params size (MB)
Restored all states from the checkpoint file at /content/drive/MyDrive/RAVE_COLLAB/runs/parbass/rave/version_2/checkpoints/last-v1.ckpt
/content/miniconda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1933: PossibleUserWarning: The number of training batches (19) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 11571:   0% 0/20 [00:00<00:00, -106397.56it/s]/content/miniconda/lib/python3.9/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
Epoch 12623:  95% 19/20 [00:04<?, ?it/s, v_num=3]
Validation: 0it [00:00, ?it/s]
Validation:   0% 0/1 [00:00<?, ?it/s]
Validation DataLoader 0:   0% 0/1 [00:00<?, ?it/s]
Epoch 12623: 100% 20/20 [00:04<00:00,  4.59s/it, v_num=3]
Epoch 12624:   0% 0/19 [00:00<00:00, -111926.65it/s, v_num=3]/content/miniconda/lib/python3.9/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
Epoch 12706:  63% 12/19 [00:03<-1:59:57, -2.16it/s, v_num=3]

The text was updated successfully, but these errors were encountered:

jacklion710 · 2023-01-30T17:36:18Z

Did you ever find out how many? How long did training take for you?

0x7b1 · 2023-06-09T06:37:43Z

Isn't it 100000 epochs?

RAVE/scripts/train.py

Line 146 in 202ab58

max_epochs=100000,

jacklion710 · 2023-06-09T11:16:09Z

In my experience, i've noticed that it goes by the number of steps. At least as of the last time I trained which was several months ago. The default was 6000000 steps which can be changed by setting the proper flag on the 'rave train' cmd

domkirke closed this as completed Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How many epochs of training should I expect? #143

How many epochs of training should I expect? #143

batchku commented Nov 23, 2022

jacklion710 commented Jan 30, 2023

0x7b1 commented Jun 9, 2023

jacklion710 commented Jun 9, 2023

How many epochs of training should I expect? #143

How many epochs of training should I expect? #143

Comments

batchku commented Nov 23, 2022

jacklion710 commented Jan 30, 2023

0x7b1 commented Jun 9, 2023

jacklion710 commented Jun 9, 2023