RuntimeError during the training example with OIM #4

GBJim · 2017-06-14T06:24:23Z

Hi all

After I executed the command
python examples/resnet.py -d viper -b 64 -j 2 --loss oim --logs-dir logs/resnet-viper-oim
I encountered the following errors:

Process Process-4:
Traceback (most recent call last):
File "/root/miniconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/root/miniconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 45, in _worker_loop
data_queue.put((idx, samples))
File "/root/miniconda2/lib/python2.7/multiprocessing/queues.py", line 392, in put
return send(obj)
File "/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 17, in send
ForkingPickler(buf, pickle.HIGHEST_PROTOCOL).dump(obj)
File "/root/miniconda2/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/root/miniconda2/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/root/miniconda2/lib/python2.7/pickle.py", line 554, in save_tuple
save(element)
File "/root/miniconda2/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/root/miniconda2/lib/python2.7/pickle.py", line 606, in save_list
self._batch_appends(iter(obj))
File "/root/miniconda2/lib/python2.7/pickle.py", line 639, in _batch_appends
save(x)
File "/root/miniconda2/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/root/miniconda2/lib/python2.7/multiprocessing/forking.py", line 67, in dispatcher
self.save_reduce(obj=obj, *rv)
File "/root/miniconda2/lib/python2.7/pickle.py", line 401, in save_reduce
save(args)
File "/root/miniconda2/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/root/miniconda2/lib/python2.7/pickle.py", line 554, in save_tuple
save(element)
File "/root/miniconda2/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/root/miniconda2/lib/python2.7/multiprocessing/forking.py", line 66, in dispatcher
rv = reduce(obj)
File "/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 113, in reduce_storage
fd, size = storage.share_fd()
RuntimeError: unable to write to file </torch_29225_1654046705> at /py/conda-bld/pytorch_1493669264383/work/torch/lib/TH/THAllocator.c:267

When switch to the xentropy loss with
python examples/resnet.py -d viper -b 64 -j 1 --loss xentropy --logs-dir logs/resnet-viper-xentropy
The following error occured:

Exception in thread Thread-1:
Traceback (most recent call last):
File "/root/miniconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/miniconda2/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/root/miniconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 51, in _pin_memory_loop
r = in_queue.get()
File "/root/miniconda2/lib/python2.7/multiprocessing/queues.py", line 378, in get
return recv()
File "/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
return pickle.loads(buf)
File "/root/miniconda2/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/root/miniconda2/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/root/miniconda2/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/root/miniconda2/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
fd = multiprocessing.reduction.rebuild_handle(df)
File "/root/miniconda2/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/root/miniconda2/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/root/miniconda2/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/root/miniconda2/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused

In both situations, the terminal is frozen after these errors prompt. I have to kill the corresponding Python process in order to exit.
Any suggestions to solve this?

The text was updated successfully, but these errors were encountered:

Cysu · 2017-06-14T06:57:51Z

I wonder if it is fine to run the official mnist example?

GBJim · 2017-06-15T09:38:06Z

Hi @Cysu
After going through the MNIST example. No errors happen.

I also tried to train the inception net in example: python examples/inception.py -d viper -b 64 -j 2 --loss xentropy --logs-dir logs/inception-viper-xentropy
No errors happen as well.

The interesting thing is that I tried to train ResNet again:
The training process froze like the following:, but no errors.

Files already downloaded and verified
VIPeR dataset loaded
subset | # ids | # images

train | 216 | 432
val | 100 | 200
trainval | 316 | 632
query | 316 | 632
gallery | 316 | 632
Epoch: [0][1/7] Time 160.275 (160.275) Data 0.446 (0.446) Loss 5.375 (5.375) Prec 0.00% (0.00%)
Epoch: [0][2/7] Time 0.563 (80.419) Data 0.001 (0.223) Loss 10.057 (7.716) Prec 0.00% (0.00%)

Is this caused by the GPU resource usage?
Currently, some Caffe process is also using my GPUs.

Cysu · 2017-06-15T11:47:49Z

I'm not sure if it is caused by some deadlocks between pytorch and caffe, especially when both are using NCCL. You may try to run it again when the caffe experiments are finished.

GBJim · 2017-06-19T02:41:16Z

Hi @Cysu
Sorry for late response.
I tried it again after my Caffe process is terminated.

The training will be frozen when the -j (worker) argument is set to be bigger than 1.
If the -j argument is set to be 1, I get error: [Errno 111] Connection refused

Cysu · 2017-06-19T04:52:26Z

@GBJim Could you please change the num_workers in the official mnist example and see if it has the same problem?

GBJim · 2017-06-19T08:03:00Z

@Cysu:

I tested the MNIST example with 16 workers. Everything is correct

Cysu · 2017-06-19T08:30:38Z

Sorry but currently I have no idea why it happened. There should be no much difference between our data loader with the mnist ones. I'm not sure if it is related to using root instead of normal user on Linux.

GBJim · 2017-06-19T08:40:19Z

Thanks @Cysu
I will try to figure it out!

Cysu · 2017-07-04T01:44:05Z

@GBJim any luck on this?

GBJim · 2017-07-04T06:33:01Z

Hi @Cysu
I've built a new environment for open RE-ID and cloned the latest commit.
But it seems like the resnet.py and inception.py are removed from the example folder.

Is there new tutorial of how to do a training or testing?
Thanks!

GBJim · 2017-07-04T07:03:47Z

It seems like the codes are re-organized into oim_loss.py, softmax_loss.py and, triplet_loss.py
Let me check if my these scripts can work

GBJim · 2017-07-04T09:02:44Z

@Cysu

I tried these commands: python examples/oim_loss.py -d viper or python examples/softmax_loss.py -d viper and python examples/triplet_loss.py -d viper as well.
Tthe following output is prompted and then the process was frozen. I need to use ctrl+z to exit for the process

root@e50f76502ce4:~/open-reid# python examples/oim_loss.py -d viper
Files already downloaded and verified
VIPeR dataset loaded
  subset   | # ids | # images
  ---------------------------
  train    |   216 |      432
  val      |   100 |      200
  trainval |   316 |      632
  query    |   316 |      632
  gallery  |   316 |      632

Cysu · 2017-07-04T09:30:48Z

@GBJim Oh, I forgot to update the tutorials. Just finished. Please check here.

Does the previous error still occur when -j 1 is use?

GBJim · 2017-07-04T10:00:55Z

@Cysu

The process is still frozen when I set to single job. (Maybe I should wait for the process for longer time)

I set job to 1 and tried the following combinations:

OIM + ResNet --> Frozen

OIM + Inception --> RuntimeError: The expanded size of the tensor (128) must match the existing size (64) at non-singleton dimension 1. at /root/pytorch/torch/lib/THC/generic/THCTensor.c:323

SOFTMAX + ResNet --> Frozen

SOFTMAX + Inception --> Works Normally

And thank you for updating the documentation!

Cysu · 2017-07-05T01:53:47Z

That's weird... What's the script for OIM + Inception?

GBJim · 2017-07-05T05:32:50Z

@Cysu
python examples/oim_loss.py -d viper -a inception -j 1

lzj322 · 2017-07-17T09:10:48Z

I meet the same issue. The problems that @GBJim had happen to me as well. Particularly, this, inception.py has nothing wrong, but resnet.py is Frozen.

GBJim · 2017-07-17T09:28:36Z

@lzj322 Do you use Nvidia-docker to host the environment?

lzj322 · 2017-07-17T13:22:32Z

@GBJim yes. Would that be a problem? I don't know much about it. I asked the administrator to reset the docker. Now it has normal results. But we don't know why.
I am afraid that this issue could happen someday again.

lzj322 · 2017-07-17T14:30:54Z

@GBJim, @Cysu I guess that Dataparallel of pytorch doesn't work well with Nvidia-docker. Or maybe it is caused by pytorch pytorch forum

Cysu · 2017-07-18T07:15:06Z

@lzj322 Yeah, two programs cannot run on the same device if using NCCL.

Cysu closed this as completed Sep 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError during the training example with OIM #4

RuntimeError during the training example with OIM #4

GBJim commented Jun 14, 2017

Cysu commented Jun 14, 2017

GBJim commented Jun 15, 2017

Files already downloaded and verified
VIPeR dataset loaded
subset | # ids | # images

Cysu commented Jun 15, 2017

GBJim commented Jun 19, 2017

Cysu commented Jun 19, 2017

GBJim commented Jun 19, 2017

Cysu commented Jun 19, 2017

GBJim commented Jun 19, 2017 •

edited

Loading

Cysu commented Jul 4, 2017

GBJim commented Jul 4, 2017

GBJim commented Jul 4, 2017

GBJim commented Jul 4, 2017

Cysu commented Jul 4, 2017

GBJim commented Jul 4, 2017

Cysu commented Jul 5, 2017

GBJim commented Jul 5, 2017 •

edited

Loading

lzj322 commented Jul 17, 2017 •

edited

Loading

GBJim commented Jul 17, 2017

lzj322 commented Jul 17, 2017

lzj322 commented Jul 17, 2017 •

edited

Loading

Cysu commented Jul 18, 2017

RuntimeError during the training example with OIM #4

RuntimeError during the training example with OIM #4

Comments

GBJim commented Jun 14, 2017

Cysu commented Jun 14, 2017

GBJim commented Jun 15, 2017

Files already downloaded and verified VIPeR dataset loaded subset | # ids | # images

Cysu commented Jun 15, 2017

GBJim commented Jun 19, 2017

Cysu commented Jun 19, 2017

GBJim commented Jun 19, 2017

Cysu commented Jun 19, 2017

GBJim commented Jun 19, 2017 • edited Loading

Cysu commented Jul 4, 2017

GBJim commented Jul 4, 2017

GBJim commented Jul 4, 2017

GBJim commented Jul 4, 2017

Cysu commented Jul 4, 2017

GBJim commented Jul 4, 2017

Cysu commented Jul 5, 2017

GBJim commented Jul 5, 2017 • edited Loading

lzj322 commented Jul 17, 2017 • edited Loading

GBJim commented Jul 17, 2017

lzj322 commented Jul 17, 2017

lzj322 commented Jul 17, 2017 • edited Loading

Cysu commented Jul 18, 2017

Files already downloaded and verified
VIPeR dataset loaded
subset | # ids | # images

GBJim commented Jun 19, 2017 •

edited

Loading

GBJim commented Jul 5, 2017 •

edited

Loading

lzj322 commented Jul 17, 2017 •

edited

Loading

lzj322 commented Jul 17, 2017 •

edited

Loading