Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResourceExhaustedError after several iterations in a grid search #482

Closed
4 tasks done
bjtho08 opened this issue Apr 19, 2020 · 33 comments
Closed
4 tasks done

ResourceExhaustedError after several iterations in a grid search #482

bjtho08 opened this issue Apr 19, 2020 · 33 comments
Assignees
Labels
priority: MEDIUM medium priority topic: tensorflow relates with tensorflow backend value: ⭐⭐⭐ high value

Comments

@bjtho08
Copy link

bjtho08 commented Apr 19, 2020

First off, make sure to check your support options.

The preferred way to resolve usage related matters is through the docs which are maintained up-to-date with the latest version of Talos.

If you do end up asking for support in a new issue, make sure to follow the below steps carefully.

1) Confirm the below

  • I have looked for an answer in the Docs
  • My Python version is 3.5 or higher
  • I have searched through the issues Issues for a duplicate
  • I've tested that my Keras model works as a stand-alone

2) Include the output of:

talos.__version__ == 0.6.7

3) Explain clearly what you are trying to achieve

I am running a grid search that gives 36 rounds.
After about 4 or 5 rounds, during a model.fit I suddenly get hit by a ResourceExhaustedError. I think this is very odd given that I am able to complete at least 3 rounds of fitting on the GPU (with a model and batch size that takes up pretty much all the gpu memory), so it seems that there is a small but significant memory leak somewhere. Any ideas what it could be?

@bjtho08
Copy link
Author

bjtho08 commented Apr 19, 2020

My parameter dictionary is:

p = {
    "sigma_noise": [0, 0.01],
    "nb_filters_0": [16, 32, 64],
    "loss_func": ["cat_CE", "tversky_loss", "cat_FL"],
    "arch": ["U-Net"],
    "act": [Swish, ReLU],
}

And I'm running a U-net with 34 million trainable parameters (for nb_filters_0 == 64) and input dimensions of (208, 208, 3) with a batch size of 12 and 400 epochs.

@bjtho08
Copy link
Author

bjtho08 commented Apr 20, 2020

UPDATE: I did a "quick" test, where I ran each model for only 50 epochs and I got a ResourceExhaustedError again in round 4 during the 5th epoch, and I think that was actually the exact same spot as before when each of the 3 previous model had run for +100 epochs. This tells me, that the models are not properly cleaned out of the GPU memory and on top of that, I might have a memory leak in my generator. @mikkokotila, what do you think?

@mikkokotila
Copy link
Contributor

Very interesting. Can you post your full trace.

@bjtho08
Copy link
Author

bjtho08 commented Apr 21, 2020

Of course! See below. I also added the output leading up, because I think it gives some idea of how the exception occurs.

oom-crash

---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-10-d1427f7c3b24> in <module>
    104     params=p,
    105     experiment_name="talos/" + date_string,
--> 106     reduction_method='gamify',
    107 )
    108 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/Scan.py in __init__(self, x, y, params, model, experiment_name, x_val, y_val, val_split, random_method, seed, performance_target, fraction_limit, round_limit, time_limit, boolean_limit, reduction_method, reduction_interval, reduction_window, reduction_threshold, reduction_metric, minimize_loss, disable_progress_bar, print_params, clear_session, save_weights)
    194         # start runtime
    195         from .scan_run import scan_run
--> 196         scan_run(self)

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_run.py in scan_run(self)
     24         # otherwise proceed with next permutation
     25         from .scan_round import scan_round
---> 26         self = scan_round(self)
     27         self.pbar.update(1)
     28 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_round.py in scan_round(self)
     17     # fit the model
     18     from ..model.ingest_model import ingest_model
---> 19     self.model_history, self.round_model = ingest_model(self)
     20     self.round_history.append(self.model_history.history)
     21 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/model/ingest_model.py in ingest_model(self)
      8                       self.x_val,
      9                       self.y_val,
---> 10                       self.round_params)

~/myapps/mmciad/src/mmciad/utils/hyper.py in talos_model(x, y, val_x, val_y, talos_params)
    301                 class_weight=class_weights,
    302                 verbose=internal_params["verbose"],
--> 303                 callbacks=model_callbacks + opti_callbacks,
    304             )
    305         return history, model

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
     89                 warnings.warn('Update your `' + object_name + '` call to the ' +
     90                               'Keras 2 API: ' + signature, stacklevel=2)
---> 91             return func(*args, **kwargs)
     92         wrapper._original_function = func
     93         return wrapper

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1730             use_multiprocessing=use_multiprocessing,
   1731             shuffle=shuffle,
-> 1732             initial_epoch=initial_epoch)
   1733 
   1734     @interfaces.legacy_generator_methods_support

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
    218                                             sample_weight=sample_weight,
    219                                             class_weight=class_weight,
--> 220                                             reset_metrics=False)
    221 
    222                 outs = to_list(outs)

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics)
   1512             ins = x + y + sample_weights
   1513         self._make_train_function()
-> 1514         outputs = self.train_function(ins)
   1515 
   1516         if reset_metrics:

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
   3290 
   3291     fetched = self._callable_fn(*array_vals,
-> 3292                                 run_metadata=self.run_metadata)
   3293     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   3294     output_structure = nest.pack_sequence_as(

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: OOM when allocating tensor with shape[16,192,208,208] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node training/Adam/gradients/block1_u_conv1/convolution_grad/Conv2DBackpropInput}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

@mikkokotila
Copy link
Contributor

Have you looked at this SO post.

@mikkokotila
Copy link
Contributor

How much memory your GPU has?

@bjtho08
Copy link
Author

bjtho08 commented Apr 22, 2020

I just check out your link and it does not appear to describe the issue I am having, though at first it did look similar. I am running with an Nvidia GeForce GTX 1080 TI with 11 GB ram.

@mikkokotila
Copy link
Contributor

Can you do this:

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

...and share the output you get.

@bjtho08
Copy link
Author

bjtho08 commented Apr 23, 2020

I would love to, but that option crashes my python kernel, so it's not really possible. This is a long-standing Keras bug, I believe.

@mikkokotila
Copy link
Contributor

Yes, it most certainly is an upstream bug in Keras or TensorFlow.

To avoid doubt, can you share your Scan() command.

Also, how about giving Talos 1.0 a shot. It will use different backend, so you might have better luck.

@bjtho08
Copy link
Author

bjtho08 commented Apr 23, 2020

Sure! I use custom keras.utils.Sequence data generators, so I have two dummy variables for my scan command as shown below:

  dummy_x = np.empty((1, BATCH_SIZE, 208, 208))
   dummy_y = np.empty((1, BATCH_SIZE))

   scan_object = ta.Scan(
       x=dummy_x,
       y=dummy_y,
       disable_progress_bar=False,
       print_params=True,
       model=talos_model,
       params=p,
       experiment_name="talos/" + date_string,
       reduction_method='gamify',
   )

I will take a look at talos 1.0 right away!

@bjtho08
Copy link
Author

bjtho08 commented Apr 23, 2020

So running talos 1.0 had the same outcome, but with a slightly different error message at the end:

 14% |█▌        | 5/36 [1:52:10<11:43:52, 1362.35s/it]
{'act': <class 'keras_contrib.layers.advanced_activations.swish.Swish'>, 'arch': 'U-Net', 'loss_func': 'cat_CE', 'nb_filters_0': 64, 'sigma_noise': 0.01}
tracking <tf.Variable 'block1_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block1_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block5_bottom_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block5_bottom_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block1_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block1_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
Training |          | 0% 0/5 [00:00<?, ?it/s]
Epoch 0  |██▌        | [loss: 2.2988, acc: 0.1848, jaccard1_coef: 0.0575] : 25% 113/451 [01:19<03:12, 1.76it/s]
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-10-050e8f7c8199> in <module>
    104         params=p,
    105         experiment_name="talos/" + date_string,
--> 106         reduction_method='gamify',
    107     )
    108 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/Scan.py in __init__(self, x, y, params, model, experiment_name, x_val, y_val, val_split, random_method, seed, performance_target, fraction_limit, round_limit, time_limit, boolean_limit, reduction_method, reduction_interval, reduction_window, reduction_threshold, reduction_metric, minimize_loss, disable_progress_bar, print_params, clear_session, save_weights)
    194         # start runtime
    195         from .scan_run import scan_run
--> 196         scan_run(self)

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_run.py in scan_run(self)
     24         # otherwise proceed with next permutation
     25         from .scan_round import scan_round
---> 26         self = scan_round(self)
     27         self.pbar.update(1)
     28 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_round.py in scan_round(self)
     17     # fit the model
     18     from ..model.ingest_model import ingest_model
---> 19     self.model_history, self.round_model = ingest_model(self)
     20     self.round_history.append(self.model_history.history)
     21 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/model/ingest_model.py in ingest_model(self)
      8                       self.x_val,
      9                       self.y_val,
---> 10                       self.round_params)

~/myapps/mmciad/src/mmciad/utils/hyper.py in talos_model(x, y, val_x, val_y, talos_params)
    303                 class_weight=class_weights,
    304                 verbose=internal_params["verbose"],
--> 305                 callbacks=model_callbacks + opti_callbacks,
    306             )
    307         return history, model

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
     89                 warnings.warn('Update your `' + object_name + '` call to the ' +
     90                               'Keras 2 API: ' + signature, stacklevel=2)
---> 91             return func(*args, **kwargs)
     92         wrapper._original_function = func
     93         return wrapper

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1730             use_multiprocessing=use_multiprocessing,
   1731             shuffle=shuffle,
-> 1732             initial_epoch=initial_epoch)
   1733 
   1734     @interfaces.legacy_generator_methods_support

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
    218                                             sample_weight=sample_weight,
    219                                             class_weight=class_weight,
--> 220                                             reset_metrics=False)
    221 
    222                 outs = to_list(outs)

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics)
   1512             ins = x + y + sample_weights
   1513         self._make_train_function()
-> 1514         outputs = self.train_function(ins)
   1515 
   1516         if reset_metrics:

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
   3290 
   3291     fetched = self._callable_fn(*array_vals,
-> 3292                                 run_metadata=self.run_metadata)
   3293     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   3294     output_structure = nest.pack_sequence_as(

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[16,64,208,208] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node GaussianNoise_preout/cond/add-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[metrics/acc/Identity/_1095]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[16,64,208,208] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node GaussianNoise_preout/cond/add-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

@mikkokotila
Copy link
Contributor

Can you run the input model in a loop a few times and see if you can get the same result. If yes, I suggest posting this directly to TensorFlow.

@mikkokotila mikkokotila self-assigned this Apr 27, 2020
@mikkokotila mikkokotila added the topic: tensorflow relates with tensorflow backend label Apr 27, 2020
@bjtho08
Copy link
Author

bjtho08 commented Apr 27, 2020

Do you mean a simple loop like this:

model = create_model(*args, **kwargs)

for _ in range(6):
    model.compile(**kwargs)
    model.fit(train_generator, epochs=10)

Or should I try to add some sort of garbage collection to this?

@mikkokotila
Copy link
Contributor

No, just the simplest possible loop.

@bjtho08
Copy link
Author

bjtho08 commented Apr 28, 2020

Is the above code example simple enough or can it be even simpler?

EDIT: BTW, can you maybe elaborate a bit on what was changed in talos 1.0? I tried upgrading to tf-2.2.0rc3 because it fixed a memory leak in fit method related to the keras sequence class.

@bjtho08
Copy link
Author

bjtho08 commented Apr 28, 2020

So far, when running a loop like the one I wrote earlier, I am not getting any ResourceExhaustedError. I have almost completed 5 iterations of the loop with 50 epochs pr iteration. With talos it crashed in the beginning of the fifth iteration.

@bjtho08
Copy link
Author

bjtho08 commented Apr 29, 2020

Okay, so the loop was set to do ten training sessions of 50 epochs since I knew that 50 epochs was enough to get the ResourceExhaustedError after 5 iterations in the talos Scan(). Now it has completed all 10 passes of the loop without any errors whatsoever. I assume this rules out it being a TensorFlow bug?

@bjtho08
Copy link
Author

bjtho08 commented Apr 29, 2020

For good measure, I redid the Scan() just to confirm that updating some of the packages did not alter the outcome. Rather than getting the ResourceExhaustedError, my kernel crashed completely (though it may still be due to a ResourceExhaustedError). Any ideas on how to proceed?

@bjtho08
Copy link
Author

bjtho08 commented May 1, 2020

I have now tested it on a different machine with a larger GPU (the NVIDIA Quatro RTX6000 with 24 GB ram) and the same thing happens.

@bjtho08
Copy link
Author

bjtho08 commented May 3, 2020

To summarize

The bug(?) appears on two systems with the below configuration(s):

  • Nvidia GTX 1080 TI or Nvidia Quattro RTX 6000
  • Nvidia 418.87.00
  • CUDA 10.1
  • CuDNN 7.6.5
  • Python >=3.7.6
  • Both TensorFlow 1.13, 2.1, and >=2.2.0rc2
  • Talos >= 0.6.0

It does not appear to happen if a model is compiled and fitted several times in a simple loop, which seem to eliminate the possibility of this being a TensorFlow problem.

@mikkokotila
Copy link
Contributor

mikkokotila commented May 5, 2020

Is it possible for you to share a Jupyter notebook or colab which is self-contained, so I can just run and repeat.

Also, is create_model identical in both cases?

@bjtho08
Copy link
Author

bjtho08 commented May 7, 2020

Yes, create_model is identical. I will try and see if I can make a self-contained notebook. Currently, my solution has been to modify talos to accept a new boolean parameter allow_resume, which if True savew ParamSpace, list of keys/metrics and the various stores to files on disk and in the event of a crash (or interrupt) it will read these files and restore the important parts of the Scan object before executing scan_run(). It might sacrifice some efficiency, but it sure beats never getting to the finish line ;)

BTW, are there any special considerations behind doing method level imports rather than module/top level imports?

EDIT: If you want, you can have a look at my fork of talos and see what i changed. I haven't committed the latest addition yet, but the primary stuff is in place.

@mikkokotila
Copy link
Contributor

Sorry, I totally missed this.

BTW, are there any special considerations behind doing method level imports rather than module/top level imports?

Yes. Chunks of code are self-contained, readability improves, import only if need etc.

How about we implement the above-said feature into v1.1?

@bjtho08
Copy link
Author

bjtho08 commented Nov 11, 2020

We could do that, but I'm not sure my hack is the best way to go at this point. It makes sense for me, but it could be a lot cleaner, I think. Perhaps storing everything in one file rather than having about three different files to read from :) I will be happy to show you the changes I made, though, and you can decide for yourself what you think of it.

@bjtho08
Copy link
Author

bjtho08 commented Nov 11, 2020

You can look at the changes and additions here: https://github.com/bjtho08/talos/tree/1.0.1-dev

@mikkokotila
Copy link
Contributor

Thanks. Do I understand correctly, the feature is simply to:

  • allow storing a "restore point" as an option of Scan()
  • being able to refer to a file the "restore point" is stored

Is there anything I'm missing?

@bjtho08
Copy link
Author

bjtho08 commented Nov 12, 2020

Yep, that pretty much sums it up.

In the project directory (where the logging csv file is stored), three additional files are created: a pickle that contains the various stores from each run; a yaml that lists the remaining instances in the paramspace (dumped from self.param_object) and and a yaml file containing the self._all_keys, self._metric_keys, and self._val_keys.

As I said, it can most likely be done in a cleaner fashion. I just hacked this together in a few days to work around my issue with constant crashing after a few iterations :)

@Kristin-Schwarzmuller
Copy link

Hello,
have you already found a working solution or a workaround for your problem? Because I am currently facing the same issue.

@mikkokotila
Copy link
Contributor

I will try to work on this next week.

@mikkokotila mikkokotila added value: ⭐⭐⭐ high value priority: MEDIUM medium priority labels Nov 15, 2020
@Kristin-Schwarzmuller
Copy link

Disabling eager execution solved the problem for me:
tf.compat.v1.disable_eager_execution()

@mikkokotila
Copy link
Contributor

@MolineraNegra wonderful.

@bjtho08 can you confirm if this works for you?

@bjtho08
Copy link
Author

bjtho08 commented Dec 23, 2020

@mikkokotila
I'm confused, I thought keras disabled eager execution by default even for tf2.x?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: MEDIUM medium priority topic: tensorflow relates with tensorflow backend value: ⭐⭐⭐ high value
Projects
None yet
Development

No branches or pull requests

3 participants