Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis tests no longer pass. #1123

Open
de-vri-es opened this issue Oct 2, 2019 · 29 comments
Open

Travis tests no longer pass. #1123

de-vri-es opened this issue Oct 2, 2019 · 29 comments

Comments

@de-vri-es
Copy link
Contributor

The travis tests no longer pass as can be seen here: https://travis-ci.org/fizyr/keras-retinanet/builds/592460276 .

At a glance, it seems to be a problem with an incompatible version of tensorflow, but I didn't dig very deep.

@de-vri-es de-vri-es changed the title Travis tests no longer passes. Travis tests no longer pass. Oct 2, 2019
@hgaiser
Copy link
Contributor

hgaiser commented Oct 2, 2019

Tensorflow recently updated to 2.0, probably related :)

@de-vri-es
Copy link
Contributor Author

Sounds plausible ^^

@anuar12
Copy link

anuar12 commented Oct 2, 2019

Any chance necessary changes will be made for TF2.0 support?

@hgaiser
Copy link
Contributor

hgaiser commented Oct 3, 2019

Yes I expect this week or next week I'll push an update to fix compatibility with tensorflow 2.0.

@de-vri-es
Copy link
Contributor Author

If I try to run the tests locally on a machine without CUDA the tests also hang forever. In my case, all 8GB of RAM was exhausted and the whole computer froze.

Perhabs we have a memory leak with TF2 somewhere, or perhaps TF2 or its TF1 compatibility layer is doing something very inefficiently. I didn't try to pinpoint it to anything specific. I also don't think I'll have time for that anywhere soon.

@hgaiser
Copy link
Contributor

hgaiser commented Oct 14, 2019

all 8GB of RAM

You need an upgrade :)

I was testing only tests/models/test_mobilenet.py and it was using a lot of memory, but it did finish. Interestingly, adding the following:

import tensorflow as tf
tf.compat.v1.disable_v2_behavior()

Seems to make everything "normal". I don't want to add that code everywhere, but so far it's the only solution I have. I don't have much time to look into this issue further, but I will see if I can find out the underlying issue. In the meantime, @de-vri-es could you verify that disabling v2 behavior fixes the tests on your system too?

@hgaiser
Copy link
Contributor

hgaiser commented Oct 14, 2019

I feel like this could be related: tensorflow/tensorflow#32052 . Sounds like there are some memory issues with tf2 and the travis build most likely gets killed because it uses too much memory. Maybe as a temporary fix we can disable the regression tests (the ones that actually train on a bit of data) and only test specific functions for now?

@de-vri-es
Copy link
Contributor Author

all 8GB of RAM

You need an upgrade :)

It was actually 16 GB, though maybe I still should :)

Anyway, to set a baseline I tried running the tests again without any modifications, and python segfaulted in some multiprocessing operation 💃

tests/test_losses.py .                                                                                          [  1%]
tests/backend/test_common.py ..                                                                                 [  5%]
tests/bin/test_train.py Fatal Python error: Segmentation fault

Thread 0x00007ffa7d5fb700 (most recent call first):
  File "/usr/lib/python3.7/multiprocessinzsh: segmentation fault (core dumped)  pytest

Ran it again, no segfault, but memory usage crept up in tests/models/test_densenet.py until I killed it when it reached 14 GB. Then I added the change you suggested to that test, and the same test (also densenet) passed using "only" 4 GB of RAM.

However, then I copied the change into keras_retinanet/models/retinanet.py to get it everywhere, and the train.py tests started failing:

______________________________________________________ test_coco ______________________________________________________

    def test_coco():
        # ignore warnings in this test
        warnings.simplefilter('ignore')
    
        # run training / evaluation
        keras_retinanet.bin.train.main([
            '--epochs=1',
            '--steps=1',
            '--no-weights',
            '--no-snapshots',
            'coco',
>           'tests/test-data/coco',
        ])

tests/bin/test_train.py:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
keras_retinanet/bin/train.py:516: in main
    validation_data=validation_generator
/usr/lib/python3.7/site-packages/keras/legacy/interfaces.py:91: in wrapper
    return func(*args, **kwargs)
/usr/lib/python3.7/site-packages/keras/engine/training.py:1732: in fit_generator
    initial_epoch=initial_epoch)
/usr/lib/python3.7/site-packages/keras/engine/training_generator.py:260: in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
/usr/lib/python3.7/site-packages/keras/callbacks/callbacks.py:152: in on_epoch_end
    callback.on_epoch_end(epoch, logs)
/usr/lib/python3.7/site-packages/keras/callbacks/callbacks.py:1036: in on_epoch_end
    logs['lr'] = K.get_value(self.model.optimizer.lr)
/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:2927: in get_value
    return x.numpy()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tf.Variable 'Adam/learning_rate:0' shape=() dtype=float32>

    def numpy(self):
      if context.executing_eagerly():
        return self.read_value().numpy()
      raise NotImplementedError(
>         "numpy() is only available when eager execution is enabled.")
E     NotImplementedError: numpy() is only available when eager execution is enabled.

/usr/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:579: NotImplementedError

So it looks like we can't just do this for everything blindly.

@de-vri-es
Copy link
Contributor Author

hmm, for some reason I'm now getting that error even without modifications. It may not be related. Besides that error, the test suite is passing with the suggested change added only to the densenet and mobilenet tests.

@de-vri-es
Copy link
Contributor Author

de-vri-es commented Oct 14, 2019

Ah, I see, pytest isn't running each tests in a clean, isolated environment. It appears it already imported the densenet and mobilenet tests, which already called tensorflow.compat.v1.disable_v2_behavior(). So it is that call that's causing the train.py test to fail.

I glanced at pytest --help but I didn't immediately see an option to fork a clean python interpreter for each test.

@hgaiser
Copy link
Contributor

hgaiser commented Oct 14, 2019

Actually I think we had that a few commits ago with python-xdist but I thought at first that that was causing issues so I removed it to test in Travis but then it got merged a bit prematurely.

@hgaiser
Copy link
Contributor

hgaiser commented Oct 17, 2019

I put the forking of tests back, it stops the cumulating of memory usage, but it still doesn't work on travis (it does run fine locally for the record).

Also, on two comparable systems, but one running tensorflow 2 and the other running tensorflow 1.14, the times for test_mobilenet.py are quite different. 100s for tensorflow 2 and 46s for tensorflow 1.14 =\

Disabling eager execution in tensorflow reduces the time to 70s, disabling v2 behavior gives approximately the same time (47s) as 1.14.

@Borda
Copy link
Contributor

Borda commented Oct 17, 2019

So what is advantage of using FT 2.x?

@hgaiser
Copy link
Contributor

hgaiser commented Oct 24, 2019

So what is advantage of using FT 2.x?

It's going to be the version of tensorflow many people will use, so we have to support it. But the main advantage of tf 2 is eager execution (which is being annoying at the moment) and the move to tf.keras for their API.

@Borda
Copy link
Contributor

Borda commented Oct 29, 2019

according to experimenting in #1137 it is kind of magic circle, you disable v2 behaviour which contains eager execution which is needed for #1123 (comment) and enabling this execution overflow memory and time limit... so we would need to disable the behaviour and replace .numpy() calling...

@hgaiser
Copy link
Contributor

hgaiser commented Oct 31, 2019

Yep you're right :)

I'm hoping keras-team/keras#13476 will get merged soon, then we can just continue with disabling v2 behavior. Seems to me the most easy way forward.

@Borda
Copy link
Contributor

Borda commented Oct 31, 2019

Well, the time of being merged and being release can be different... 8-)

@hgaiser
Copy link
Contributor

hgaiser commented Nov 4, 2019

Well, the time of being merged and being release can be different... 8-)

Yes, but merging that will break usage when using tensorflow 2 :) At least now it should work with tensorflow 1.14/1.15.

@Borda
Copy link
Contributor

Borda commented Nov 4, 2019

should is not very convincing, I would merge #1137 even it not fully fixed, but it has tests for TF 1.x and 2.x so it is clear what passes and what not...

@hgaiser
Copy link
Contributor

hgaiser commented Nov 4, 2019

There's no point in merging it yet without having the fix in Keras to go with it.

@Borda
Copy link
Contributor

Borda commented Nov 4, 2019

let me clarify... what you expect from Keras fix, that the disable_v2_behavior will work fine with .numpy(), right? so then you do not expect any other change on side of this repo... so the mentioned PR after dropping enable_eager_execution is what should be the final solution, right?
Does it also mean that until Keras is fixed you do not allow merge any other PR because there is working CI to check if everything is running correctly?

@ThanhNhann
Copy link

Excuse me, when I read .travis.yml I see the line
travis_wait coverage run --source keras_retinanet -m py.test keras_retinanet tests --doctest-modules --forked --flake8
when I run, Its says travis_wait: command not found
and I also find travis_wait but don't see, can you explain to me about this.
Thanks!

@Borda
Copy link
Contributor

Borda commented Dec 15, 2019

travis_wait is a pure Travis command, which does not work elsewhere...
in the end, it does not have much effect, so you can omit it completely :]
there a work-around in this PR - #1137

@ThanhNhann
Copy link

Thanks for the quick reply @Borda, when I run the file train.py, I met this problem
File "keras_retinanet/bin/train.py", line 528, in main() File "keras_retinanet/bin/train.py", line 521, in main validation_steps = args.steps_for_validation, AttributeError: 'Namespace' object has no attribute 'steps_for_validation'
I dont see another steps_for_validation in train.py, can you explain for this ?
Thanks

@Borda
Copy link
Contributor

Borda commented Dec 16, 2019

You are probably missing a paremeter steps_for_validation while you call the training script... So there may be a bug since it should be mandatory paraneter or with a default value...

@ThanhNhann
Copy link

When I find the parameter steps_for_validation in the train.py, it's just called but not defined. Do you think this is an omission?

@pendex900x
Copy link

same error here

/usr/local/lib/python3.6/dist-packages/keras/callbacks/tensorboard_v2.py:92: UserWarning: The TensorBoard callback batch_size argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored.
warnings.warn('The TensorBoard callback batch_size argument '
Traceback (most recent call last):
File "/content/keras-retinanet/keras_retinanet/bin/train.py", line 528, in
main()
File "/content/keras-retinanet/keras_retinanet/bin/train.py", line 521, in main
validation_steps = args.steps_for_validation,
AttributeError: 'Namespace' object has no attribute 'steps_for_validation'

@hgaiser
Copy link
Contributor

hgaiser commented Dec 19, 2019

I just reverted some commits, can you try again?

Note: You still need keras-team/keras#13476 to pass the tests and in case you are using tensorflow 2.1, you also need tensorflow/tensorflow#34870 (or remove Tensorboard).

@SalahAdDin
Copy link
Contributor

What's about the next warning?
warnings.warn('The TensorBoard callback batch_size argument '

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants