Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LevelDB core dump while snapshotting (out of files) #38

Closed
sguada opened this issue Jan 19, 2014 · 12 comments
Closed

LevelDB core dump while snapshotting (out of files) #38

sguada opened this issue Jan 19, 2014 · 12 comments
Labels

Comments

@sguada
Copy link
Contributor

sguada commented Jan 19, 2014

Unexpected core dumped in io.cpp
WriteProtoToBinaryFile
CHECK(proto.SerializeToOstream(&output));
After writing 1 or 2 times correctly the next one fail.

@lifeiteng
Copy link

I also have this problem. I change the code, let the program continue if this problem occurs.

@sguada
Copy link
Contributor Author

sguada commented Jan 19, 2014

Thanks for sharing. Do you know if later was able to write a snapshot?
Otherwise it will be useless.
I will check other options too.

Sergio
On Jan 18, 2014 7:30 PM, "Feiteng Li" notifications@github.com wrote:

I also have this problem. I change the code, let the program continue if
this problem occurs.


Reply to this email directly or view it on GitHubhttps://github.com//issues/38#issuecomment-32700185
.

@sguada
Copy link
Contributor Author

sguada commented Jan 21, 2014

I have changed the code to let program keep running, but later is not able to snapshot the network again, and therefore become useless, since I can never save the parameters.
For now what I'm doing is training for 10000 iteractions, and making 2 snapshots and then resuming from there.

@lifeiteng
Copy link

The program can work normally after ...(NO, cannot work normally!)
I change the function like this:

void WriteProtoToBinaryFile(const Message& proto, const char* filename) {
fstream output(filename, ios::out | ios::trunc | ios::binary);
// CHECK(proto.SerializeToOstream(&output));
if ( ! proto.SerializeToOstream(&output) ) { //add by LiFT
fstream out("SerializeToOstream_Error.txt", fstream::out | fstream::app);
out << "---- SerializeToOstream Error: file " << filename << " ----\n";
out.close();
}
output.close(); //add by LiFT

@sguada
Copy link
Contributor Author

sguada commented Jan 23, 2014

@lifeiteng thanks for sharing your code, I tried something similar on my own, and was able to keep the code running. But the problem is that after the first failure on WriteProtoToBinaryFile then it fails all later attempts, so I can never get a snapshot of the network for later use.

What I did is change the parameter snapshot_prefix: and then the code start working again. I don't yet why it was failing in the first case, I cannot think of any explanation why it fails sometimes.

@lifeiteng
Copy link

What is the cause of the problem?
It always happens when I use the code to do Acoustic Modeling(I have changed the code to make it OK for Acoustic Modeling).

@sguada
Copy link
Contributor Author

sguada commented Feb 8, 2014

I haven't found the reason yet, but I just changed the snapshot_prefix and
it worked. My only explanation is that maybe there were too many snapshots
with that name in the disk already.

Sergio

2014-02-07 Feiteng Li notifications@github.com:

What is the cause of the problem?
It always happens when I use the code to do Acoustic Modeling(I have
changed the code to make it OK for Acoustic Modeling).


Reply to this email directly or view it on GitHubhttps://github.com//issues/38#issuecomment-34526219
.

@lifeiteng
Copy link

snapshot name store in an string value.

template
void Solver::Snapshot() {
NetParameter net_param;
// For intermediate results, we will also dump the gradient values.
net_->ToProto(&net_param, param_.snapshot_diff());
string filename(param_.snapshot_prefix());
char iter_str_buffer[20];
sprintf(iter_str_buffer, "iter%d", iter_);
filename += iter_str_buffer;
LOG(INFO) << "Snapshotting to " << filename;
WriteProtoToBinaryFile(net_param, filename.c_str()); //write error in here
SolverState state;
SnapshotSolverState(&state);
state.set_iter(iter_);
state.set_learned_net(filename);
filename += ".solverstate";
LOG(INFO) << "Snapshotting solver state to " << filename;
WriteProtoToBinaryFile(state, filename.c_str());
}

void WriteProtoToBinaryFile(const Message& proto, const char* filename) {
fstream output(filename, ios::out | ios::trunc | ios::binary);
CHECK(proto.SerializeToOstream(&output));
}

error information:
I0210 14:28:27.514936 22349 solver.cpp:126] Snapshotting to cnn_iter_10000
F0210 14:28:27.960814 22349 io.cpp:69] Check failed: proto.SerializeToOstream(&output)
*** Check failure stack trace: ***
@ 0x7f5a486c9b7d (unknown)
@ 0x7f5a486cbc7f (unknown)
@ 0x7f5a486c976c (unknown)
@ 0x7f5a486cc51d (unknown)
@ 0x41fbfd (unknown)
@ 0x4212e8 (unknown)
@ 0x4251db (unknown)
@ 0x40f3be (unknown)
@ 0x7f5a474cb76d (unknown)
@ 0x4109ad (unknown)


if restart the training using latest cnn_xx_xx.solverstate, this problem will occur every 10 Snapshot.

@Yangqing
Copy link
Member

@sguada Sergio, I was visiting a friend who was using caffe in their work, and he pointed me to his solution: it turns out that leveldb is opening too many files for caching - the default is 1000, and the ubuntu default open file limit is 1024. This makes it dangerously near the limit so you are seeing random crashes from SerializeToOstream().

You could try either reducing the leveldb cache size (see #13), or increase the number of open file limit:

http://posidev.com/blog/2009/06/04/set-ulimit-parameters-on-ubuntu/

Let me know if it works :)

@sguada
Copy link
Contributor Author

sguada commented Feb 22, 2014

Thanks @Yangqing, that probably explains why the error was a bit random some times. I think we should make level-db options.max_open_files = 10 the default since we are reading in sequence and having multiple open files will not help. I guess that would be useful in random access.

@shelhamer
Copy link
Member

Symptom of the same leveldb number of open files issue as #13.

Solution is to modify src/caffe/layers/data_layer.cpp by setting options.max_open_files = 100 (or any number significantly lower than 1000) as discovered and confirmed by @reedscot, @Yangqing and @sguada.

Fixed by #154.

shelhamer added a commit that referenced this issue Feb 25, 2014
Set leveldb options.max_open_files = 100 and Fix #13 and #38
shelhamer pushed a commit to shelhamer/caffe that referenced this issue Feb 26, 2014
shelhamer added a commit to shelhamer/caffe that referenced this issue Feb 26, 2014
Set leveldb options.max_open_files = 100 and Fix BVLC#13 and BVLC#38
shelhamer added a commit that referenced this issue Feb 26, 2014
Set leveldb options.max_open_files = 100 and Fix #13 and #38
shelhamer added a commit that referenced this issue Feb 26, 2014
Set leveldb options.max_open_files = 100 and Fix #13 and #38
mitmul pushed a commit to mitmul/caffe that referenced this issue Sep 30, 2014
@xiadaoxun
Copy link

Solution is to modify src/caffe/layers/data_layer.cpp by setting options.max_open_files = 100 (or any number significantly lower than 1000)

Can you give some examples? Where is the code inserted?

anandthakker pushed a commit to anandthakker/caffe that referenced this issue Jul 19, 2016
anandthakker pushed a commit to anandthakker/caffe that referenced this issue Jul 19, 2016
anandthakker pushed a commit to anandthakker/caffe that referenced this issue Aug 2, 2016
mindcont pushed a commit to mindcont/caffe that referenced this issue Apr 25, 2017
twmht pushed a commit to twmht/caffe that referenced this issue Aug 20, 2018
soulsheng pushed a commit to soulsheng/caffe that referenced this issue Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants