New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LevelDB core dump while snapshotting (out of files) #38
Comments
I also have this problem. I change the code, let the program continue if this problem occurs. |
Thanks for sharing. Do you know if later was able to write a snapshot? Sergio
|
I have changed the code to let program keep running, but later is not able to snapshot the network again, and therefore become useless, since I can never save the parameters. |
The program can work normally after ...(NO, cannot work normally!) void WriteProtoToBinaryFile(const Message& proto, const char* filename) { |
@lifeiteng thanks for sharing your code, I tried something similar on my own, and was able to keep the code running. But the problem is that after the first failure on WriteProtoToBinaryFile then it fails all later attempts, so I can never get a snapshot of the network for later use. What I did is change the parameter snapshot_prefix: and then the code start working again. I don't yet why it was failing in the first case, I cannot think of any explanation why it fails sometimes. |
What is the cause of the problem? |
I haven't found the reason yet, but I just changed the snapshot_prefix and Sergio 2014-02-07 Feiteng Li notifications@github.com:
|
snapshot name store in an string value. template void WriteProtoToBinaryFile(const Message& proto, const char* filename) { error information: if restart the training using latest cnn_xx_xx.solverstate, this problem will occur every 10 Snapshot. |
@sguada Sergio, I was visiting a friend who was using caffe in their work, and he pointed me to his solution: it turns out that leveldb is opening too many files for caching - the default is 1000, and the ubuntu default open file limit is 1024. This makes it dangerously near the limit so you are seeing random crashes from SerializeToOstream(). You could try either reducing the leveldb cache size (see #13), or increase the number of open file limit: http://posidev.com/blog/2009/06/04/set-ulimit-parameters-on-ubuntu/ Let me know if it works :) |
Thanks @Yangqing, that probably explains why the error was a bit random some times. I think we should make |
Solution is to modify src/caffe/layers/data_layer.cpp by setting options.max_open_files = 100 (or any number significantly lower than 1000) Can you give some examples? Where is the code inserted? |
BVLC#38: pre-class accuracy fix
Unexpected core dumped in io.cpp
WriteProtoToBinaryFile
CHECK(proto.SerializeToOstream(&output));
After writing 1 or 2 times correctly the next one fail.
The text was updated successfully, but these errors were encountered: