Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trains causes ETA for epoch to be 3 time more slower #206

Closed
oak-tree opened this issue Sep 8, 2020 · 28 comments
Closed

trains causes ETA for epoch to be 3 time more slower #206

oak-tree opened this issue Sep 8, 2020 · 28 comments

Comments

@oak-tree
Copy link

oak-tree commented Sep 8, 2020

Hey,
We are using trains with tf 2.3.0 and tf.keras. We notice that the ETA with trains for a single epoch takes us about 7.5 Hours

with trains: ETA: 7:46:59
without trains: ETA: 2:14:54

Any ideas/solution for this trains bottleneck?

Thanks,
Alon

EDIT:

trains server version is 0.13, trains packages (trains.__version__) is 0.16.1

@bmartinn
Copy link
Member

bmartinn commented Sep 8, 2020

Hi @oak-tree ,
This seems very strange, Trains should not cause any slowdown in training, diffidently not a 3x slowdown ...
Is the ETA accurate (i.e. a single epoch does take 3 times with trains?)

Could you provide some information on your setup?
Machine OS?
trains-server OS?
Are you storing / loading a lot of models / artifacts ?
Can you isolate the problem? create a reproducible code snippet, or point to a repository that can reproduce the issue?

@oak-tree
Copy link
Author

oak-tree commented Sep 9, 2020

Hey @bmartinn
I agree, it took us a while to understand that trains is the cause for very long epoch and not our data_generator

Outsetup:

Machine OS:

Training is done from Kubeflow over GKE. Machines from GKE manged version running:
Container-Optimized OS from Google. Version 77
The container runs trains over this machine runs:
Ubuntu 18.04.3 LTS (Bionic Beaver)

Trains server OS

We are using K8S over Azure for trains, i.e running AKS:
trains-server OS - Ubuntu-1604-202003

Store

We are not storing/loading lots of model. We do have callback to store weights localy over end of epoc. But the ETA is before the first epoch finish. So I'm find it hard to believe, that this callback have something related to long ETA

@bmartinn
Copy link
Member

Hi @oak-tree
Any chance you can share some logs?
I'm still trying to understand what could delay the training by such a huge factor, and I can't think of any.
Are you reporting a lot of images/audio in TB maybe (so network upload maybe)?
Also could you test with TF auto-connect disabled?

task = Task.init('examples', 'test speed', auto_connect_frameworks={'tensorflow': False})

@oak-tree
Copy link
Author

oak-tree commented Sep 10, 2020

@bmartinn
can you tell me which logs can help ?

We report 1 image every 1000 steps so I do not consider it as a network load .
We also tested it without the reporting and got the same ETA

We will try to disable to TF autoconnect

@oak-tree
Copy link
Author

oak-tree commented Sep 10, 2020

We tested it with the following

task = Task.init('examples', 'test speed', auto_connect_frameworks={'tensorflow': False})
and with trains version 0.16.2rc0

but the ETA is still high with a magnitude of 3.

2020-09-10 17:10:42
31/875 [>.............................] - ETA: 6:40:48 - loss: 0.4945

Note that it seems that the ETA is keep raising

@bmartinn
Copy link
Member

@oak-tree this is very very strange...
what happens when you do:

task = Task.init('examples', 'test speed', auto_connect_frameworks=False)

And what happens if you do:

task = Task.init('examples', 'test speed', auto_connect_frameworks=False, auto_resource_monitoring=False)

@oak-tree
Copy link
Author

oak-tree commented Sep 10, 2020

Hey @bmartinn
We ran it, it looked promising but after the 12 steps we started to get errors as describe in #205) and then the ETA started to grow. At the 50 steps it was more than double.

Before the error:
image

50th step.
image

The error:
image

Note that it is still faster comparing trains without the flags above.
But very slow comparing runs without trains

@bmartinn
Copy link
Member

@oak-tree what exactly are the flags you are using?
Also, what happens at step 50-ish ?
Last request, any chance you can put a break point : https://github.com/allegroai/trains/blob/10ec4d56fb4a1f933128b35d68c727189310aae8/trains/backend_api/session/session.py#L344 and send the traceback (a.k.a stack)?

@oak-tree
Copy link
Author

task = Task.init(project_name=project_name, task_name=model_name, auto_connect_frameworks=False, auto_resource_monitoring=False)
trains_logger = Task.current_task().get_logger()

``

@bmartinn
Copy link
Member

@oak-tree any chance to get a traceback, and I have the feeling you are correct and somehow the log error, has something to do with the slowdown?!

@oak-tree
Copy link
Author

I'm trying to think how to get the trackback. Because this runs as kubeflow pipeline on gpu so it is tricky to debug. Anyideas how to get the traceback? maybe you can patched the code to print the stacktrace?

@bmartinn
Copy link
Member

@oak-tree if I compile a debug wheel for you, could you package it in the docker and use it ?
(Then I'll print the traceback when raising the error)

@oak-tree
Copy link
Author

@bmartinn sure, its possible, just tell me how to download this package

@oak-tree
Copy link
Author

@bmartinn do you use trains with kubeflow pipeline? I the begning I thought Kale related to the slowness but now,after testing it looks like trains and kubeflow pipeline do not play nice together?

@bmartinn
Copy link
Member

bmartinn commented Sep 13, 2020

Hi @oak-tree , I'm not aware of any issue running trains on K8s (kubeflow is just a pipeline on top), and since I assume the docker/yaml are the same and the only part that is different is calling Task.init , what comes to mind is network or configuration problem. Just making sure, you have the correct trains.conf in your docker (or you set the environment variables) pointing to your trains-server (not the demo server), correct?

BTW:
I'm attaching a wheel that will print the stack trace when raising the aforementioned warning, could you run it and send back the log?
Notice that even though it is the same version as the one in PyPi, this is not the same package, so you need to uninstall the installed package and then reinstall manually the attached wheel.
trains-0.16.2rc0-py2.py3-none-any.zip

@oak-tree
Copy link
Author

Hey @bmartinn,
We do have trains.conf in our container. We are reporting to our own trains server.
besides traffic, note that we are using kale to build pipelines from the Jupyter notebooks. I did see that Kale runs the code on IPython as a notebook and I thought it might be the reason. But, from our testing it not. More info on that is in kubeflow-kale/kale#200

So you are welcome to suggest more ideas :)

Thanks for the devwheel. I'll try using the wheel to print the stacktrace on the error.

@oak-tree
Copy link
Author

Hey @bmartinn
Unfortunately, we got the same error but no additional output, maybe print does not go into the stdout/err?

note that we upgraded out trains server, we are running with 0.16.1

@oak-tree
Copy link
Author

oak-tree commented Sep 18, 2020

Hey @bmartinn
We got the stack, I did not find any thing specials there, maybe you will?
image

EDIT

your fix for the message worked. But I think its only a symptom and something is trying to logs something and fails to do so

EDIT 2

and maybe this something is the cause for trains to make the training slower by *3

@bmartinn
Copy link
Member

@oak-tree can you try using the wheel attached below?

trains-0.16.2rc3-py2.py3-none-any.whl.zip

@oak-tree
Copy link
Author

Hey @bmartinn
Sure, can you tell me what are the difference? what should I be expected?

@Shaked
Copy link
Contributor

Shaked commented Sep 22, 2020

Hey @bmartinn

I have ran the latest wheel that you have attached but I'm not seeing anything special.

The errors have disappeared, but looking into the code (I have git diff your first wheel with the latest one), it seems like I was suppose to see some requests coming in with an unknown value, right?

What should I expect?

Thank you for your help
Shaked

@bmartinn
Copy link
Member

@oak-tree I think it was wrong to raise this exception to begin with (this is more of a debug print, definitely not an exception), so this is what the dev wheel does (and what will probably get committed)

@Shaked what do you mean by " I was suppose to see some requests coming in with an unknown value," ?

I think the errors are unrelated to the issue, and I think it has to do with Kale spawning new processes (but this is just a theory). Any chance I could reproduce this behavior ?

@Shaked
Copy link
Contributor

Shaked commented Sep 24, 2020

@bmartinn

" I was suppose to see some requests coming in with an unknown value,"

I looked into the diff and I saw that you added this:

     @classmethod
+    def _make_raw_response(cls, request_cls=None, service=None, action=None, status_code=200, text=None):
+        service = service or (request_cls._service if request_cls else 'unknown')
+        action = action or (request_cls._action if request_cls else 'unknown')
+        return cls(request_cls=request_cls, meta=ResponseMeta.from_raw_data(
+            status_code=status_code, text=text, endpoint='%(service)s.%(action)s' % locals()))
+
+    @classmethod
     def from_result(cls, res, request_cls=None, logger=None, service=None, action=None, session=None):
         """ From requests result """
         response_cls = get_response_cls(request_cls)
+
+        if res is None:
+            return cls._make_raw_response(
+                request_cls=request_cls,
+                service=service,
+                action=action
+            )
+
         try:
             data = res.json()
         except ValueError:
-            service = service or (request_cls._service if request_cls else 'unknown')
-            action = action or (request_cls._action if request_cls else 'unknown')
-            return cls(request_cls=request_cls, meta=ResponseMeta.from_raw_data(
-                status_code=res.status_code, text=res.text, endpoint='%(service)s.%(action)s' % locals()))
+            return cls._make_raw_response(
+                request_cls=request_cls,
+                service=service,
+                action=action,
+                status_code=res.status_code,
+                text=res.text
+            )

I assumed that now when res is None there should be a call made, considering this part:

         if not data and not json:
-            import traceback
-            traceback.print_stack()
-            raise ValueError(
-                "Missing data (data or json), batch requests are meaningless without it."
-            )
+            # Missing data (data or json), batch requests are meaningless without it.
+            return None

Am I wrong about it? What does the following part of code suppose to do?

+            return cls._make_raw_response(
+                request_cls=request_cls,
+                service=service,
+                action=action
+            )

I think the errors are unrelated to the issue, and I think it has to do with Kale spawning new processes (but this is just a theory). Any chance I could reproduce this behavior ?

Unfortunately I have no idea what caused this problem, if there is a way for us to add some debugging, I'd be more than happy to investigate.

Thank you

@bmartinn
Copy link
Member

@Shaked this is mostly code refactoring, the idea is to skip logging requests with no actual data (i.e. nothing to log).

To the problem at hand, I don't think this is actually related to the slowness issue.
I think the slowness is somehow related to Kale creating a lot of sub-processes (or at least this is my current theory ;)

  1. Is this only Kale related ?
  2. If so, could you create a toy example for me to debug this issue ?

@oak-tree
Copy link
Author

oak-tree commented Sep 29, 2020

Hey @bmartinn
At the moment I do not think it kale related as the training without the error does not get x3 slowness( they do get 1.2 slowness factor)

EDIT

But I think the error is just a symptom. i.e something is trying to get a data and failed to do so. Maybe this failure is the reason for the slowness

@bmartinn
Copy link
Member

@oak-tree , I have the same feeling ...

  1. I'm happy to hear the slowness disappeared with the warning fix.
  2. This is the reason I attached a dev wheel printing the stack, that said you can see here there is a specific check, making sure zero size reporting is ignored...
    Any chance you can attache the full log from this experiment?

@bmartinn
Copy link
Member

@oak-tree / @Shaked are you still experiencing slowness with the latest (0.16.2) trains release ?

@clearml-bot
Copy link
Member

@oak-tree @Shaked please reopen if this is still an issue with the latest ClearML v0.17.4 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants