-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError during the training example with OIM #4
Comments
I wonder if it is fine to run the official mnist example? |
Hi @Cysu I also tried to train the inception net in example: The interesting thing is that I tried to train ResNet again:
Is this caused by the GPU resource usage? |
I'm not sure if it is caused by some deadlocks between pytorch and caffe, especially when both are using NCCL. You may try to run it again when the caffe experiments are finished. |
Hi @Cysu The training will be frozen when the -j (worker) argument is set to be bigger than 1. |
I tested the MNIST example with 16 workers. Everything is correct |
Sorry but currently I have no idea why it happened. There should be no much difference between our data loader with the mnist ones. I'm not sure if it is related to using root instead of normal user on Linux. |
Thanks @Cysu |
@GBJim any luck on this? |
Hi @Cysu Is there new tutorial of how to do a training or testing? |
It seems like the codes are re-organized into oim_loss.py, softmax_loss.py and, triplet_loss.py |
I tried these commands:
|
The process is still frozen when I set to single job. (Maybe I should wait for the process for longer time) I set job to 1 and tried the following combinations: OIM + ResNet --> OIM + Inception --> SOFTMAX + ResNet --> SOFTMAX + Inception --> And thank you for updating the documentation! |
That's weird... What's the script for OIM + Inception? |
@Cysu |
@lzj322 Do you use Nvidia-docker to host the environment? |
@GBJim yes. Would that be a problem? I don't know much about it. I asked the administrator to reset the docker. Now it has normal results. But we don't know why. |
@GBJim, @Cysu I guess that Dataparallel of pytorch doesn't work well with Nvidia-docker. Or maybe it is caused by pytorch pytorch forum |
@lzj322 Yeah, two programs cannot run on the same device if using NCCL. |
Hi all
After I executed the command
python examples/resnet.py -d viper -b 64 -j 2 --loss oim --logs-dir logs/resnet-viper-oim
I encountered the following errors:
When switch to the xentropy loss with
python examples/resnet.py -d viper -b 64 -j 1 --loss xentropy --logs-dir logs/resnet-viper-xentropy
The following error occured:
In both situations, the terminal is frozen after these errors prompt. I have to kill the corresponding Python process in order to exit.
Any suggestions to solve this?
The text was updated successfully, but these errors were encountered: