Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

第六章默认代码运行猫狗大战,损失降低到接近0,验证集准确率依然是50% #23

Open
ofooo opened this issue Mar 11, 2018 · 23 comments

Comments

@ofooo
Copy link

ofooo commented Mar 11, 2018

您好。我用第六章代码运行猫狗大战,损失降低到接近0,验证集准确率依然是50%。
请问这个现象是不是过拟合?这个示例代码出现这样的结果是正常的么?我应该在哪几方面尝试改进呢?
谢谢老师
default

@chenyuntc
Copy link
Owner

损失在0.69左右 说明完全没效果 (log 2 = 0.693)
不是过拟合,应该是梯度消失,总之是网络训崩了。

建议使用resnet 别用alexnet。

@vaeXu
Copy link

vaeXu commented Mar 16, 2018

您好,我运行了第六章的代码,并将训练集从2000改到20000但是输出的结果一直都是0.9999...请问究竟是什么原因呢?我该怎么改进?

[25391, 0.9999626874923706]
[25410, 0.9999626874923706]
[25473, 0.9999626874923706]
[25495, 0.9999626874923706]
[156420, 0.9999626874923706]
[156422, 0.9999626874923706]
[156469, 0.9999626874923706]
[156484, 0.9999626874923706]
[156499, 0.9999626874923706]
[156505, 0.9999626874923706]
[156514, 0.9999626874923706]
[156585, 0.9999626874923706]
[156600, 0.9999626874923706]
[156604, 0.9999626874923706]
[156637, 0.9999626874923706]
[156646, 0.9999626874923706]
[156647, 0.9999626874923706]
[156696, 0.9999626874923706]
[156754, 0.9999626874923706]
[156755, 0.9999626874923706]
[156761, 0.9999626874923706]
[156788, 0.9999626874923706]
[201101, 0.9999626874923706]
[201104, 0.9999626874923706]
[201108, 0.9999626874923706]
[201142, 0.9999626874923706]
[201174, 0.9999626874923706]
[201176, 0.9999626874923706]
[201180, 0.9999626874923706]
[201186, 0.9999626874923706]
[201193, 0.9999626874923706]
[201198, 0.9999626874923706]
[201200, 0.9999626874923706]
[201216, 0.9999626874923706]
[201225, 0.9999626874923706]
[201245, 0.9999626874923706]
[201251, 0.9999626874923706]
[201256, 0.9999626874923706]
[201271, 0.9999626874923706]
[201318, 0.9999626874923706]

@chenyuntc
Copy link
Owner

这个0.99 是什么意思?

@vaeXu
Copy link

vaeXu commented Mar 19, 2018

results += batch_results

write_csv(results,opt.result_file)
我是直接将results打印出来,这不是对应图片的预测值吗?

@chenyuntc
Copy link
Owner

嗯嗯, 我发现新版的默认参数有点问题(学习率和weight_decay太大),我这几天再看看。你可以把学习率改成0.001,lr_decay改成0.5,weight_decay改成0看看。

@vaeXu
Copy link

vaeXu commented Mar 19, 2018

OK

@bobo0810
Copy link

bobo0810 commented Apr 2, 2018

把学习率改成0.001,lr_decay改成0.95.跑100个epoch ,验证集可以跑到97%左右。亲测~

@arisliang
Copy link

@bobo0810 能否把你的训练loss图贴出来, 看一下是在哪里开始突破0.69. 我改成你说的参数, 可loss还是在0.69.

@arisliang
Copy link

我把batch size从4 (#37) 增加到32,就开始突破0.69了.

@nemonameless
Copy link

@bobo0810 学习率改成0.001,lr_decay改成0.95.跑100个epoch ,测试集每张图概率仍然都是0.5左右啊?是过拟合了?还是梯度消失?

@bobo0810
Copy link

@nemonameless
model = 'ResNet34' # 使用的模型,名字必须与models/init.py中的名字一致

train_data_root = './data/train/' # 训练集存放路径
test_data_root = './data/test' # 测试集存放路径
load_model_path = './checkpoints/resnet34_0401_11:01:17.pth' # 加载预训练的模型的路径,为None代表不加载

batch_size = 128 # batch size
use_gpu = True # user GPU or not
num_workers = 4 # how many workers for loading data
print_freq = 20 # print info every N batch

debug_file = '/tmp/debug' # if os.path.exists(debug_file): enter ipdb
result_file = 'result.csv'
  
max_epoch = 1000
lr = 0.001 # initial learning rate   初始化学习率bobo
lr_decay =0.95 # when val_loss increase, lr = lr*lr_decay
weight_decay = 0# 损失函数

@nemonameless
Copy link

@bobo0810 谢谢,你的是 max_epoch = 100吧?。。我是阿里云主机训练过程还没可视化,你这是之前跑成功的吧,最近不知道作者有改动什么地方没。我参数基本就是按照作者的默认设置的,lr_decay =0.95是跟你一样的,作者默认lr_decay=0.5也试过了,但都是训练后在测试集上表现不太正常啊,每张图测试都是0.49左右,跟随机猜测没区别,不知道为什么?

@bobo0810
Copy link

@nemonameless 对,跑了0.97

@qingfenghcy
Copy link

@bobo0810 你好,我用的第六章,0.3分支的代码,训练过程中报错,一直没解决,能帮忙看下嘛,谢谢。
Traceback (most recent call last):
File "main.py", line 172, in
fire.Fire()
File "/opt/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/opt/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/opt/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 112, in train
model.save()
File "/home/dep_pic/code/hcy/pytorch/Dogs_vs_cats/models/BasicModule.py", line 28, in save
t.save(self.state_dict(), name)
File "/opt/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 135, in save
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/opt/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 115, in _with_file_like
f = open(f, mode)
PermissionError: [Errno 13] Permission denied: 'checkpoints/resnet34_0426_03:42:49.pth'

@bobo0810
Copy link

@qingfenghcy Permission denied: 是不是权限问题?你的checkpoints文件夹跟整个项目代码在一个目录下吗?

@qingfenghcy
Copy link

@bobo0810 我知道了。。。谢谢,我把数据放到root账户下了。。。谢谢,解决了

@Debatrix
Copy link

Debatrix commented Aug 1, 2018

Alexnet的lr=0.001也还是太高了 我调到5e-5才能有效下降loss epoch在40时loss能小于0.2

@JalexDooo
Copy link

我也出现了相同的问题,loss一直在0.693左右。

@JalexDooo
Copy link

可能是 随机梯度下降 这个优化器不太合适,我用SGD导致loss一直在0.693,改用作者的Adam,没有发现这个loss问题。

@1171273538
Copy link

Traceback (most recent call last):
File "main.py", line 164, in
import fire
File "D:\Users\Yang\Anaconda3\lib\site-packages\fire\core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "D:\Users\Yang\Anaconda3\lib\site-packages\fire\core.py", line 366, in _Fire
component, remaining_args)
File "D:\Users\Yang\Anaconda3\lib\site-packages\fire\core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 54, in train
model.load(opt.load_model_path)
AttributeError: module 'models.resnet34' has no attribute 'to'
直接运行的代码,只不过改了下数据源路径和cpu加速。请问这是什么原因?

@piddnad
Copy link

piddnad commented Mar 21, 2019

@nemonameless 你好,我也出现了这个问题,loss一直在下降,验证集的准确率也在不断上升,但是测试集就崩了,全是0.5左右……
请问你找到问题所在了么?

@piddnad
Copy link

piddnad commented Mar 22, 2019

我找到问题所在了:README文档给出的测试命令不对,预训练模型应该使用--load-model-path参数,而不是--load-path

@halfsummer583
Copy link

作者,您好,运行第六章代码一直报错找不到resnet34这个模块,该怎么改啊
图片

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests