Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange loss curve #3

Open
argman opened this issue Sep 7, 2017 · 31 comments
Open

strange loss curve #3

argman opened this issue Sep 7, 2017 · 31 comments

Comments

@argman
Copy link

argman commented Sep 7, 2017

Thanks for the clean and elegant code!
I tried to run training from scratch (use pretrained vgg_16 model on imagenet), the traning process looks weird.

Total Loss
qq 20170907234749

And the corresponding loss for others.
qq 20170907234947

the loss quickly converged to about 10+, and I test the model, but no text boxes is detected, how can I diagnose this?

@BowieHsu
Copy link

@argman have you converted the checkpoints from VGG16 FC reduced caffemodel? I used converted checkpoints and train from scratch on ICDAR2015 and it shows good results, the loss should converge to 2.0 more or less, you can see #4 to download my checkpoints

@argman
Copy link
Author

argman commented Sep 11, 2017

@BowieHsu , thks! I will try, and will post my result here.

@argman
Copy link
Author

argman commented Sep 11, 2017

@BowieHsu , btw, can you share your trained model ?
As i am using tf-1.3, so need to check whether some changes in tf.

@argman
Copy link
Author

argman commented Sep 11, 2017

@BowieHsu , after 6 hours of training using 4 gpus, the loss curve is
snp20170911184828296

@argman
Copy link
Author

argman commented Sep 12, 2017

@BowieHsu , thks for your model, i can get meaningful result now! The model is really hard to train..

@BowieHsu
Copy link

haha,it's really a good news

@JiasiWang
Copy link

@BowieHsu , hi, I used converted checkpoints and trained from scratch on ICDAR2015 but I got a bad result. I set the learning rate in json file like this:
"max_steps": 90000, "base_lr": 1e-4, "lr_breakpoints": [10000, 20000, 60000, 75000, 90000], "lr_decay": [0.64, 0.8, 1.0, 0.1, 0.01],
I guess maybe the base_lr is too samll or something else. Could you please show me your training strategy and the good results? Thank you so much!

@BowieHsu
Copy link

@JiasiWang Hi,wang, I'm also trained the model with default pretrain.json which shows good result,how about your batch size? or you may check loss value using tensorboard

@JiasiWang
Copy link

@BowieHsu , I did not change the batchsize, it is 32. I just changed the base_lr to 1e-4. I will check it, thanks

@BowieHsu
Copy link

@JiasiWang Yep, the default learning rate should be 5e-4.

@BowieHsu
Copy link

@JiasiWang By the way,the ICDAR2015 seglink model should pretrain on Synthtext datasets first, then finetune on ICDAR2015 train data sets if you want to reach 75% Hmean.

@JiasiWang
Copy link

@BowieHsu yeah, I know that seglink model need pretrain on Synthtext datasets. and without pretrain, I only get 58% Hmean.
After that, I also pretrained the model as the paper showed, then fine-tune it, both steps I use the default json file, but it seems like that the loss did not converge in finetuning step.

@Godricly
Copy link

May I ask how to use your model? As I not familiar with tensorflow. I tried to load it in tensorflow 1.4, but I got following error. I did
some search but no solution works for me.

i tried following solutions:

  1. change seglink/sovler.py with
model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001')
  1. set a folder with name VGG_ILSVRC_16_layers_ssd and passed its pass in json
  2. set finetune_model value as VGG_ILSVRC_16_layers_ssd.ckpt, wich is a copy of VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001

Error log:

seglink/data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

@BowieHsu
Copy link

try "model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt)" @Godricly

@Godricly
Copy link

Many thanks! That saved my ass. 👍

@BowieHsu
Copy link

@Godricly 不客气,道友

@happycoding1996
Copy link

@BowieHsu 请问我如何利用您Pretrain的模型跳过批pretrain那一步呢??请问exp/sgd/checkpoint里头是pretrain过程当中的模型吗?但是我将您的模型放进去他说formar不对

@BowieHsu
Copy link

@tianzhuotao pretrain的json文件是用来训练基于sythtext数据集的模型,如果你不想训练这个模型而是想直接训练基于icdar2015的模型的话
1.修改exp/sgd/finetune_ic15.json中的checkpoint_path为你放置的vgg模型的位置
2. 输入 ./manager train exp/sgd finetune_ic15 就可以了

@happycoding1996
Copy link

@BowieHsu 那个finetune的json文件里头只有一个finetune_model, 似乎EXP/SGD里头需要有一个checkpoint文件存在,但是我没有经过pretrain所以没有,您的模型里头似乎也只有3个文件,请问这个如何解决呢?

@BowieHsu
Copy link

你可以看到finetune.json文件中有两行
"resume": "finetune",
"finetune_model": "../exp/sgd/checkpoint"
把这里的/exp/sgd/checkpoint替换成你放置的我转换的checkpoint就可以了,你可以注意看一下log信息,如果tensorflow找到了checkpoint但是依然报错,是因为这里的resume选项选的是finetune,有一些variable是在vgg模型中不存在的,所以你可能还需要把"resume":"finetune"改成"resume":"vgg16",你可以先试一试

@happycoding1996
Copy link

@BowieHsu 十分感谢!好人一生平安. 还解决了一些其他的问题(gpu什么的...)终于跑起来了

@BowieHsu
Copy link

@tianzhuotao 你可以关注一下训练的损失函数,如果是直接从vgg模型上来finetune的话,需要调整一下学习率,反正就慢慢调参吧,当然也需要根据实际的任务魔改代码,祝好运。

@happycoding1996
Copy link

@BowieHsu 谢谢!我目前用的是默认参数,但是训练起来很慢,7个小时训练了6%,感觉很慢阿qwq 请问您训练大概用了多久呢? 我目前集群申请的16core cpu\1个gpu和32gb内存以及10g硬盘

@19931991
Copy link

19931991 commented Mar 6, 2018

你好,我最近刚好也在研究多方向文字检测,可以加个qq交流一下吗?

@tianzhuotao @BowieHsu

@13230380356
Copy link

你好,convert_caffemodel_to_ckpt.py 文件中import model_vgg16 这个model_vgg16需要用什么来装,装到哪里,还有运行run.sh 时报caffe的错误,网络说是python版本问题,需要换到python2.7,看您的介绍里是用的python3呀,能帮我解决一下疑惑吗

@ZimingLu
Copy link

ZimingLu commented May 8, 2018

@13230380356 我刚刚解决了pretrain的问题 具体可以看外面#13我刚刚写的tips

@HardSoft2023
Copy link

HardSoft2023 commented Nov 23, 2018

try "model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt)" @Godricly

everythin is OK until 2018-11-23 04:53:37,597 [INFO ] Restoring parameters from ../premodel/ILVSR_VGG_16_FC_REDUCED/VGG_ILSVRC_16_layers_ssd.ckpt Segmentation fault (core dumped
how to debug?Segmentation fault (core dumped. every comment is welcome

@Shualite
Copy link

@BowieHsu @JiasiWang 我用了SynthText 40g做的tf文件,预训练90000轮以后,因为finetune_ic15.json里面"finetune_model": "../exp/sgd/checkpoint"(默认)跑不通,我改成了"finetune_model": "../exp/sgd/checkpoint-90000"。接下来训练10000轮以后。在ic15测试集上面跑出的结果只有
Recall | Precision | Hmean
59.56 % | 63.47 % | 61.45 %

为什么没有达到75%呢?
道友盼回复,感谢大佬!

@Shualite
Copy link

改成batch-size32 依然hmean,61%左右。

@Shualite
Copy link

我拿预训练模型跑测试,不经过finetune,结果是hmean49%

@gzpyunduan
Copy link

我拿预训练模型跑测试,不经过finetune,结果是hmean49%

我跟你结果都一样,目前不知道该怎么优化了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests