Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training transition model is too resource intensive, uses too much memory. Possible bug #27

Open
kamal94 opened this issue Sep 24, 2016 · 4 comments

Comments

@kamal94
Copy link

kamal94 commented Sep 24, 2016

After training the autoencode, i try to train the transition model as described by the same document.

using

./server.py --time 60 --batch 64

and

./train_generative_model.py transition --batch 64 --name transition

on two different tmux sessions.

Soon (a minute) after running the training command, the process is killed because my memory and swap (16 + 10 GB) are used up, and I'm still on epoch one.

Here is a dump:

/train_generative_model.py transition --batch 64 --name transition                                                                                                  [0/0]
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce GTX 1060 6GB
major: 6 minor: 1 memoryClockRate (GHz) 1.7085
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.58GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0)
T.shape:  (64, 14, 512)
Transition variables:
transition/dreamyrnn_1_W:0
transition/dreamyrnn_1_U:0
transition/dreamyrnn_1_b:0
transition/dreamyrnn_1_V:0
transition/dreamyrnn_1_ext_b:0
Epoch 1/200
Killed
@EderSantana
Copy link
Contributor

it is super resource intensive yes. I saw elsewhere that Keras does a lot of memory leaks. I used to have a tensorflow only implementation that seemed lighter. But it was less convenient, that was why I opted for Keras in the release.

@sunny1986
Copy link

@kamal94 : Were you able to resolve that issue? I am having the same problem and my train fails sometimes on epoch 1/200 or 2/200 and never goes beyond that. Any suggestions??

@zhaohuaqing1993
Copy link

how do you train the train_generative_model.py autoencoder successfully ,i meet some difficuty , have to doing somehting in code?

@pandamax
Copy link

pandamax commented May 17, 2017

Have you solved this issue? I am having the same problem and my train fails sometimes on epoch 10/200 or 40/200 and never goes beyond that. Any suggestions?
Traceback (most recent call last):
File "./train_generative_model.py", line 168, in
nb_epoch=args.epoch, verbose=1, saver=saver
File "./train_generative_model.py", line 84, in train_model
z, x = next(generator)
File "./train_generative_model.py", line 31, in gen
X = cleanup(tup)
File "/home/deep-learning/research-master/models/transition.py", line 34, in cleanup
X = X/127.5 - 1.
MemoryError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants