strange loss curve #3

argman · 2017-09-07T15:51:03Z

Thanks for the clean and elegant code!
I tried to run training from scratch (use pretrained vgg_16 model on imagenet), the traning process looks weird.

Total Loss

And the corresponding loss for others.

the loss quickly converged to about 10+, and I test the model, but no text boxes is detected, how can I diagnose this?

BowieHsu · 2017-09-11T03:25:23Z

@argman have you converted the checkpoints from VGG16 FC reduced caffemodel? I used converted checkpoints and train from scratch on ICDAR2015 and it shows good results, the loss should converge to 2.0 more or less, you can see #4 to download my checkpoints

argman · 2017-09-11T03:37:42Z

@BowieHsu , thks! I will try, and will post my result here.

argman · 2017-09-11T03:40:45Z

@BowieHsu , btw, can you share your trained model ?
As i am using tf-1.3, so need to check whether some changes in tf.

argman · 2017-09-11T10:49:16Z

@BowieHsu , after 6 hours of training using 4 gpus, the loss curve is

argman · 2017-09-12T04:33:36Z

@BowieHsu , thks for your model, i can get meaningful result now! The model is really hard to train..

BowieHsu · 2017-09-12T15:47:06Z

haha，it's really a good news

JiasiWang · 2017-10-11T05:15:01Z

@BowieHsu , hi, I used converted checkpoints and trained from scratch on ICDAR2015 but I got a bad result. I set the learning rate in json file like this:
"max_steps": 90000, "base_lr": 1e-4, "lr_breakpoints": [10000, 20000, 60000, 75000, 90000], "lr_decay": [0.64, 0.8, 1.0, 0.1, 0.01],
I guess maybe the base_lr is too samll or something else. Could you please show me your training strategy and the good results? Thank you so much!

BowieHsu · 2017-10-11T08:14:03Z

@JiasiWang Hi,wang, I'm also trained the model with default pretrain.json which shows good result,how about your batch size? or you may check loss value using tensorboard

JiasiWang · 2017-10-11T08:17:42Z

@BowieHsu , I did not change the batchsize, it is 32. I just changed the base_lr to 1e-4. I will check it, thanks

BowieHsu · 2017-10-11T08:24:24Z

@JiasiWang Yep, the default learning rate should be 5e-4.

BowieHsu · 2017-10-15T08:40:29Z

@JiasiWang By the way,the ICDAR2015 seglink model should pretrain on Synthtext datasets first, then finetune on ICDAR2015 train data sets if you want to reach 75% Hmean.

JiasiWang · 2017-10-15T13:19:44Z

@BowieHsu yeah, I know that seglink model need pretrain on Synthtext datasets. and without pretrain, I only get 58% Hmean.
After that, I also pretrained the model as the paper showed, then fine-tune it, both steps I use the default json file, but it seems like that the loss did not converge in finetuning step.

Godricly · 2017-11-17T08:17:29Z

May I ask how to use your model? As I not familiar with tensorflow. I tried to load it in tensorflow 1.4, but I got following error. I did
some search but no solution works for me.

i tried following solutions:

change seglink/sovler.py with

model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001')

set a folder with name VGG_ILSVRC_16_layers_ssd and passed its pass in json
set finetune_model value as VGG_ILSVRC_16_layers_ssd.ckpt, wich is a copy of VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001

Error log:

seglink/data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

BowieHsu · 2017-11-17T09:46:54Z

try "model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt)" @Godricly

Godricly · 2017-11-17T10:13:58Z

Many thanks! That saved my ass. 👍

BowieHsu · 2017-11-17T10:28:44Z

@Godricly 不客气，道友

happycoding1996 · 2018-01-10T13:22:03Z

@BowieHsu 请问我如何利用您Pretrain的模型跳过批pretrain那一步呢？？请问exp/sgd/checkpoint里头是pretrain过程当中的模型吗？但是我将您的模型放进去他说formar不对

BowieHsu · 2018-01-11T03:11:20Z

@tianzhuotao pretrain的json文件是用来训练基于sythtext数据集的模型，如果你不想训练这个模型而是想直接训练基于icdar2015的模型的话
1.修改exp/sgd/finetune_ic15.json中的checkpoint_path为你放置的vgg模型的位置
2. 输入 ./manager train exp/sgd finetune_ic15 就可以了

happycoding1996 · 2018-01-11T03:17:18Z

@BowieHsu 那个finetune的json文件里头只有一个finetune_model, 似乎EXP/SGD里头需要有一个checkpoint文件存在，但是我没有经过pretrain所以没有，您的模型里头似乎也只有3个文件，请问这个如何解决呢？

BowieHsu · 2018-01-11T03:31:41Z

你可以看到finetune.json文件中有两行
"resume": "finetune",
"finetune_model": "../exp/sgd/checkpoint"
把这里的/exp/sgd/checkpoint替换成你放置的我转换的checkpoint就可以了，你可以注意看一下log信息，如果tensorflow找到了checkpoint但是依然报错，是因为这里的resume选项选的是finetune，有一些variable是在vgg模型中不存在的，所以你可能还需要把"resume":"finetune"改成"resume":"vgg16"，你可以先试一试

happycoding1996 · 2018-01-11T03:55:45Z

@BowieHsu 十分感谢!好人一生平安. 还解决了一些其他的问题(gpu什么的...)终于跑起来了

BowieHsu · 2018-01-11T04:06:33Z

@tianzhuotao 你可以关注一下训练的损失函数，如果是直接从vgg模型上来finetune的话，需要调整一下学习率，反正就慢慢调参吧，当然也需要根据实际的任务魔改代码，祝好运。

happycoding1996 · 2018-01-11T11:49:49Z

@BowieHsu 谢谢!我目前用的是默认参数,但是训练起来很慢,7个小时训练了6%,感觉很慢阿qwq 请问您训练大概用了多久呢? 我目前集群申请的16core cpu\1个gpu和32gb内存以及10g硬盘

19931991 · 2018-03-06T13:02:36Z

你好，我最近刚好也在研究多方向文字检测，可以加个qq交流一下吗？

@tianzhuotao @BowieHsu

13230380356 · 2018-04-20T07:34:11Z

你好，convert_caffemodel_to_ckpt.py 文件中import model_vgg16 这个model_vgg16需要用什么来装，装到哪里，还有运行run.sh 时报caffe的错误，网络说是python版本问题，需要换到python2.7，看您的介绍里是用的python3呀，能帮我解决一下疑惑吗

ZimingLu · 2018-05-08T02:35:01Z

@13230380356 我刚刚解决了pretrain的问题具体可以看外面#13我刚刚写的tips

HardSoft2023 · 2018-11-23T05:01:19Z

try "model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt)" @Godricly

everythin is OK until 2018-11-23 04:53:37,597 [INFO ] Restoring parameters from ../premodel/ILVSR_VGG_16_FC_REDUCED/VGG_ILSVRC_16_layers_ssd.ckpt Segmentation fault (core dumped
how to debug?Segmentation fault (core dumped. every comment is welcome

Shualite · 2019-09-10T07:23:11Z

@BowieHsu @JiasiWang 我用了SynthText 40g做的tf文件，预训练90000轮以后，因为finetune_ic15.json里面"finetune_model": "../exp/sgd/checkpoint"（默认）跑不通，我改成了"finetune_model": "../exp/sgd/checkpoint-90000"。接下来训练10000轮以后。在ic15测试集上面跑出的结果只有
Recall | Precision | Hmean
59.56 % | 63.47 % | 61.45 %

为什么没有达到75%呢？
道友盼回复，感谢大佬！

Shualite · 2019-09-12T01:26:44Z

改成batch-size32 依然hmean，61%左右。

Shualite · 2019-09-12T01:28:03Z

我拿预训练模型跑测试，不经过finetune，结果是hmean49%

gzpyunduan · 2021-02-12T11:51:17Z

我拿预训练模型跑测试，不经过finetune，结果是hmean49%

我跟你结果都一样，目前不知道该怎么优化了

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strange loss curve #3

strange loss curve #3

argman commented Sep 7, 2017 •

edited

Loading

BowieHsu commented Sep 11, 2017

argman commented Sep 11, 2017

argman commented Sep 11, 2017 •

edited

Loading

argman commented Sep 11, 2017

argman commented Sep 12, 2017

BowieHsu commented Sep 12, 2017

JiasiWang commented Oct 11, 2017

BowieHsu commented Oct 11, 2017

JiasiWang commented Oct 11, 2017

BowieHsu commented Oct 11, 2017

BowieHsu commented Oct 15, 2017

JiasiWang commented Oct 15, 2017

Godricly commented Nov 17, 2017

BowieHsu commented Nov 17, 2017

Godricly commented Nov 17, 2017

BowieHsu commented Nov 17, 2017

happycoding1996 commented Jan 10, 2018

BowieHsu commented Jan 11, 2018

happycoding1996 commented Jan 11, 2018

BowieHsu commented Jan 11, 2018

happycoding1996 commented Jan 11, 2018

BowieHsu commented Jan 11, 2018

happycoding1996 commented Jan 11, 2018

19931991 commented Mar 6, 2018

13230380356 commented Apr 20, 2018

ZimingLu commented May 8, 2018

HardSoft2023 commented Nov 23, 2018 •

edited

Loading

Shualite commented Sep 10, 2019

Shualite commented Sep 12, 2019

Shualite commented Sep 12, 2019

gzpyunduan commented Feb 12, 2021

strange loss curve #3

strange loss curve #3

Comments

argman commented Sep 7, 2017 • edited Loading

BowieHsu commented Sep 11, 2017

argman commented Sep 11, 2017

argman commented Sep 11, 2017 • edited Loading

argman commented Sep 11, 2017

argman commented Sep 12, 2017

BowieHsu commented Sep 12, 2017

JiasiWang commented Oct 11, 2017

BowieHsu commented Oct 11, 2017

JiasiWang commented Oct 11, 2017

BowieHsu commented Oct 11, 2017

BowieHsu commented Oct 15, 2017

JiasiWang commented Oct 15, 2017

Godricly commented Nov 17, 2017

BowieHsu commented Nov 17, 2017

Godricly commented Nov 17, 2017

BowieHsu commented Nov 17, 2017

happycoding1996 commented Jan 10, 2018

BowieHsu commented Jan 11, 2018

happycoding1996 commented Jan 11, 2018

BowieHsu commented Jan 11, 2018

happycoding1996 commented Jan 11, 2018

BowieHsu commented Jan 11, 2018

happycoding1996 commented Jan 11, 2018

19931991 commented Mar 6, 2018

13230380356 commented Apr 20, 2018

ZimingLu commented May 8, 2018

HardSoft2023 commented Nov 23, 2018 • edited Loading

Shualite commented Sep 10, 2019

Shualite commented Sep 12, 2019

Shualite commented Sep 12, 2019

gzpyunduan commented Feb 12, 2021

argman commented Sep 7, 2017 •

edited

Loading

argman commented Sep 11, 2017 •

edited

Loading

HardSoft2023 commented Nov 23, 2018 •

edited

Loading