Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

中断模型训练后,如何接着中断的地方继续训练? #470

Closed
dapsjj opened this issue Jan 16, 2020 · 10 comments
Closed

中断模型训练后,如何接着中断的地方继续训练? #470

dapsjj opened this issue Jan 16, 2020 · 10 comments

Comments

@dapsjj
Copy link

dapsjj commented Jan 16, 2020

各位前辈好!
请问运行train.py的时候,比如已经运行了一段时间了,比如说上午10:00的时候需要暂停训练,到了下午14:00的时候想要继续训练,应该怎么操作?我发现终止train.py运行后,再重新运行train.py的时候,,没有接着中断的地方开始训练,所以我应该怎么做才能从中断的地方接着训练?
我试着修改config.py,把__C.YOLO.ORIGINAL_WEIGHT = "./checkpoint/yolov3_coco.ckpt"改成自己的,比如改成__C.YOLO.ORIGINAL_WEIGHT = "./checkpoint/yolov3_test_loss=16.4555.ckpt",但是再运行train.py的时候,输出的信息没变,依然是:

=> Restoring weights from: ./checkpoint/yolov3_coco_demo.ckpt ... 
  0%|          | 0/4014 [00:00<?, ?it/s]2020-01-11 15:11:44.567005: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.73GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-11 15:11:45.392378: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-11 15:11:45.420473: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
train loss: 4195.52:   0%|          | 1/4014 [00:07<8:15:17,  7.41s/it]2020-01-11 15:11:46.625862: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-11 15:11:46.704219: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
train loss: 1974.33:   0%|          | 2/4014 [00:08<6:09:40,  5.53s/it]2020-01-11 15:11:47.816559: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-11 15:11:47.841658: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
train loss: 3769.86:   0%|          | 3/4014 [00:09<4:44:16,  4.25s/it]2020-01-11 15:11:49.013124: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-11 15:11:49.089715: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
train loss: 1902.35:   0%|          | 4/4014 [00:10<3:41:04,  3.31s/it]2020-01-11 15:11:50.022054: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
train loss: 13.42: 100%|██████████| 4014/4014 [11:29<00:00,  5.82it/s]
=> Epoch:  1 Time: 2020-01-11 15:23:59 Train loss: 546.57 Test loss: 30.60 Saving ./checkpoint/yolov3_test_loss=30.6046.ckpt ...

说明还是从头开始训练了,所以怎么样才能接着训练,不从头开始?
从checkpoint文件夹来看,第一次训练生成的文件是:

yolov3_test_loss=30.1690.ckpt-1.data-00000-of-00001
yolov3_test_loss=30.1690.ckpt-1.index
yolov3_test_loss=30.1690.ckpt-1.meta

而第二次生成的文件是:

yolov3_test_loss=30.6046.ckpt-1.data-00000-of-00001
yolov3_test_loss=30.6046.ckpt-1.index
yolov3_test_loss=30.6046.ckpt-1.meta

这个ckpt-1出现了2次,上面的是第一次生成的,下面的是第二次生成的。只是每次的loss不一样,然后每组文件的时间戳不一样,所请请问如何接着训练?

@Starrynightzyq
Copy link

同样的问题,求教

@wl4135
Copy link

wl4135 commented Mar 14, 2020

同样的问题,求指点

@wl4135
Copy link

wl4135 commented Mar 15, 2020

这个问题我是这么解决的,再重新训练,将congfig.py下的Train options中的__C.TRAIN.INITIAL_WEIGHT = “”改为你最新的权重文件就行,就可以接着训练

@dapsjj
Copy link
Author

dapsjj commented Mar 16, 2020

@wl4135 您好,我按照您说的方法试了,我是这样修改的
__C.TRAIN.INITIAL_WEIGHT = "./checkpoint/yolov3_test_loss=25.1203.ckpt"
但是从新保存的模型文件的loss来看,并没有从中断的地方接着训练。

@wl4135
Copy link

wl4135 commented Mar 16, 2020

@wl4135 您好,我按照您说的方法试了,我是这样修改的
__C.TRAIN.INITIAL_WEIGHT = "./checkpoint/yolov3_test_loss=25.1203.ckpt"
但是从新保存的模型文件的loss来看,并没有从中断的地方接着训练。

我就是这么改的然后从新的loss值开始训练的,因为我重新开始训练的loss是1000多,而从新的loss值是从28开始的,但是是从第一个epochs在开始训练:
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
=> Restoring weights from: ./checkpoint/yolov3_test_loss=28.8428.ckpt-10 ...
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
train loss: 18.45: 27% 37/135 [04:05<10:05, 6.18s/it]Corrupt JPEG data: 1 extraneous bytes before marker 0xdb
train loss: 14.99: 100% 135/135 [14:23<00:00, 6.70s/it]
=> Epoch: 1 Time: 2020-03-15 07:54:53 Train loss: 24.01 Test loss: 19.03 Saving ./checkpoint/yolov3_test_loss=19.0295.ckpt ...
train loss: 15.90: 50% 68/135 [04:30<07:42, 6.91s/it]Corrupt JPEG data: 1 extraneous bytes before marker 0xdb
train loss: 14.39: 100% 135/135 [08:52<00:00, 3.49s/it]
=> Epoch: 2 Time: 2020-03-15 08:07:10 Train loss: 16.27 Test loss: 14.00 Saving ./checkpoint/yolov3_test_loss=13.9986.ckpt ...
train loss: 12.29: 13% 18/135 [00:11<00:58, 2.01it/s]Corrupt JPEG data: 1 extraneous bytes before marker 0xdb
train loss: 29.28: 100% 135/135 [01:28<00:00, 1.72it/s]
这个是我重新训练的

@dapsjj
Copy link
Author

dapsjj commented Mar 16, 2020

@wl4135 我知道问题了,这个地方写错了:
__C.TRAIN.INITIAL_WEIGHT = "./checkpoint/yolov3_test_loss=25.1203.ckpt"
应该是:
__C.TRAIN.INITIAL_WEIGHT = "./checkpoint/yolov3_test_loss=25.1203.ckpt-20"
缺了后面的epoch数。
另外想问一下:
checkpoint文本文件中的内容要不要改成下面这样的?

model_checkpoint_path: "yolov3_test_loss=25.1203.ckpt-20"
all_model_checkpoint_paths: "yolov3_test_loss=25.1203.ckpt-20"

@wl4135
Copy link

wl4135 commented Mar 16, 2020

@wl4135 我知道问题了,这个地方写错了:
__C.TRAIN.INITIAL_WEIGHT = "./checkpoint/yolov3_test_loss=25.1203.ckpt"
应该是:
__C.TRAIN.INITIAL_WEIGHT = "./checkpoint/yolov3_test_loss=25.1203.ckpt-20"
缺了后面的epoch数。
另外想问一下:
checkpoint文本文件中的内容要不要改成下面这样的?

model_checkpoint_path: "yolov3_test_loss=25.1203.ckpt-20"
all_model_checkpoint_paths: "yolov3_test_loss=25.1203.ckpt-20"

我只修改了一处就是那个最新权重文件的路径,其他的没有改动,因为我是在谷歌云计算平台上训练的,我只保留了最新的权重文件,其他的我都删掉了

@dapsjj
Copy link
Author

dapsjj commented Mar 16, 2020

@wl4135 感谢!按照您说的可以继续训练了。

@dapsjj dapsjj closed this as completed Mar 16, 2020
@wl4135
Copy link

wl4135 commented Mar 16, 2020

@wl4135 感谢!按照您说的可以继续训练了。

不客气,相互学习。

@B1D1ng
Copy link

B1D1ng commented Jul 25, 2020

@wl4135 我知道问题了,这个地方写错了:
__C.TRAIN.INITIAL_WEIGHT = "./checkpoint/yolov3_test_loss=25.1203.ckpt"
应该是:
__C.TRAIN.INITIAL_WEIGHT = "./checkpoint/yolov3_test_loss=25.1203.ckpt-20"
缺了后面的epoch数。
另外想问一下:
checkpoint文本文件中的内容要不要改成下面这样的?

model_checkpoint_path: "yolov3_test_loss=25.1203.ckpt-20"
all_model_checkpoint_paths: "yolov3_test_loss=25.1203.ckpt-20"

我只修改了一处就是那个最新权重文件的路径,其他的没有改动,因为我是在谷歌云计算平台上训练的,我只保留了最新的权重文件,其他的我都删掉了

您好,请问您是用的Colab进行训练吗,可以加您的联系方式请教一下吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants