trains causes ETA for epoch to be 3 time more slower #206

oak-tree · 2020-09-08T14:25:57Z

Hey,
We are using trains with tf 2.3.0 and tf.keras. We notice that the ETA with trains for a single epoch takes us about 7.5 Hours

with trains: ETA: 7:46:59
without trains: ETA: 2:14:54

Any ideas/solution for this trains bottleneck?

Thanks,
Alon

EDIT:

trains server version is 0.13, trains packages (trains.__version__) is 0.16.1

The text was updated successfully, but these errors were encountered:

bmartinn · 2020-09-08T21:22:47Z

Hi @oak-tree ,
This seems very strange, Trains should not cause any slowdown in training, diffidently not a 3x slowdown ...
Is the ETA accurate (i.e. a single epoch does take 3 times with trains?)

Could you provide some information on your setup?
Machine OS?
trains-server OS?
Are you storing / loading a lot of models / artifacts ?
Can you isolate the problem? create a reproducible code snippet, or point to a repository that can reproduce the issue?

oak-tree · 2020-09-09T05:14:20Z

Hey @bmartinn
I agree, it took us a while to understand that trains is the cause for very long epoch and not our data_generator

Outsetup:

Machine OS:

Training is done from Kubeflow over GKE. Machines from GKE manged version running:
Container-Optimized OS from Google. Version 77
The container runs trains over this machine runs:
Ubuntu 18.04.3 LTS (Bionic Beaver)

Trains server OS

We are using K8S over Azure for trains, i.e running AKS:
trains-server OS - Ubuntu-1604-202003

Store

We are not storing/loading lots of model. We do have callback to store weights localy over end of epoc. But the ETA is before the first epoch finish. So I'm find it hard to believe, that this callback have something related to long ETA

bmartinn · 2020-09-10T09:58:35Z

Hi @oak-tree
Any chance you can share some logs?
I'm still trying to understand what could delay the training by such a huge factor, and I can't think of any.
Are you reporting a lot of images/audio in TB maybe (so network upload maybe)?
Also could you test with TF auto-connect disabled?

task = Task.init('examples', 'test speed', auto_connect_frameworks={'tensorflow': False})

oak-tree · 2020-09-10T13:01:07Z

@bmartinn
can you tell me which logs can help ?

We report 1 image every 1000 steps so I do not consider it as a network load .
We also tested it without the reporting and got the same ETA

We will try to disable to TF autoconnect

oak-tree · 2020-09-10T14:16:18Z

We tested it with the following

task = Task.init('examples', 'test speed', auto_connect_frameworks={'tensorflow': False})
and with trains version 0.16.2rc0

but the ETA is still high with a magnitude of 3.

2020-09-10 17:10:42
31/875 [>.............................] - ETA: 6:40:48 - loss: 0.4945

Note that it seems that the ETA is keep raising

bmartinn · 2020-09-10T14:50:36Z

@oak-tree this is very very strange...
what happens when you do:

task = Task.init('examples', 'test speed', auto_connect_frameworks=False)

And what happens if you do:

task = Task.init('examples', 'test speed', auto_connect_frameworks=False, auto_resource_monitoring=False)

oak-tree · 2020-09-10T16:27:06Z

Hey @bmartinn
We ran it, it looked promising but after the 12 steps we started to get errors as describe in #205) and then the ETA started to grow. At the 50 steps it was more than double.

Before the error:

50th step.

The error:

Note that it is still faster comparing trains without the flags above.
But very slow comparing runs without trains

bmartinn · 2020-09-10T16:43:24Z

@oak-tree what exactly are the flags you are using?
Also, what happens at step 50-ish ?
Last request, any chance you can put a break point : https://github.com/allegroai/trains/blob/10ec4d56fb4a1f933128b35d68c727189310aae8/trains/backend_api/session/session.py#L344 and send the traceback (a.k.a stack)?

oak-tree · 2020-09-10T16:57:00Z

task = Task.init(project_name=project_name, task_name=model_name, auto_connect_frameworks=False, auto_resource_monitoring=False)
trains_logger = Task.current_task().get_logger()

``

bmartinn · 2020-09-10T17:02:53Z

@oak-tree any chance to get a traceback, and I have the feeling you are correct and somehow the log error, has something to do with the slowdown?!

oak-tree · 2020-09-10T17:04:37Z

I'm trying to think how to get the trackback. Because this runs as kubeflow pipeline on gpu so it is tricky to debug. Anyideas how to get the traceback? maybe you can patched the code to print the stacktrace?

bmartinn · 2020-09-10T17:22:32Z

@oak-tree if I compile a debug wheel for you, could you package it in the docker and use it ?
(Then I'll print the traceback when raising the error)

oak-tree · 2020-09-11T07:57:37Z

@bmartinn sure, its possible, just tell me how to download this package

oak-tree · 2020-09-13T09:20:45Z

@bmartinn do you use trains with kubeflow pipeline? I the begning I thought Kale related to the slowness but now,after testing it looks like trains and kubeflow pipeline do not play nice together?

bmartinn · 2020-09-13T09:40:36Z

Hi @oak-tree , I'm not aware of any issue running trains on K8s (kubeflow is just a pipeline on top), and since I assume the docker/yaml are the same and the only part that is different is calling Task.init , what comes to mind is network or configuration problem. Just making sure, you have the correct trains.conf in your docker (or you set the environment variables) pointing to your trains-server (not the demo server), correct?

BTW:
I'm attaching a wheel that will print the stack trace when raising the aforementioned warning, could you run it and send back the log?
Notice that even though it is the same version as the one in PyPi, this is not the same package, so you need to uninstall the installed package and then reinstall manually the attached wheel.
trains-0.16.2rc0-py2.py3-none-any.zip

oak-tree · 2020-09-13T14:32:56Z

Hey @bmartinn,
We do have trains.conf in our container. We are reporting to our own trains server.
besides traffic, note that we are using kale to build pipelines from the Jupyter notebooks. I did see that Kale runs the code on IPython as a notebook and I thought it might be the reason. But, from our testing it not. More info on that is in kubeflow-kale/kale#200

So you are welcome to suggest more ideas :)

Thanks for the devwheel. I'll try using the wheel to print the stacktrace on the error.

oak-tree · 2020-09-16T07:11:15Z

Hey @bmartinn
Unfortunately, we got the same error but no additional output, maybe print does not go into the stdout/err?

note that we upgraded out trains server, we are running with 0.16.1

oak-tree · 2020-09-18T12:50:24Z

Hey @bmartinn
We got the stack, I did not find any thing specials there, maybe you will?

EDIT

your fix for the message worked. But I think its only a symptom and something is trying to logs something and fails to do so

EDIT 2

and maybe this something is the cause for trains to make the training slower by *3

bmartinn · 2020-09-20T16:40:55Z

@oak-tree can you try using the wheel attached below?

trains-0.16.2rc3-py2.py3-none-any.whl.zip

oak-tree · 2020-09-20T19:10:52Z

Hey @bmartinn
Sure, can you tell me what are the difference? what should I be expected?

Shaked · 2020-09-22T23:22:05Z

Hey @bmartinn

I have ran the latest wheel that you have attached but I'm not seeing anything special.

The errors have disappeared, but looking into the code (I have git diff your first wheel with the latest one), it seems like I was suppose to see some requests coming in with an unknown value, right?

What should I expect?

Thank you for your help
Shaked

bmartinn · 2020-09-23T22:33:28Z

@oak-tree I think it was wrong to raise this exception to begin with (this is more of a debug print, definitely not an exception), so this is what the dev wheel does (and what will probably get committed)

@Shaked what do you mean by " I was suppose to see some requests coming in with an unknown value," ?

I think the errors are unrelated to the issue, and I think it has to do with Kale spawning new processes (but this is just a theory). Any chance I could reproduce this behavior ?

Shaked · 2020-09-24T00:44:38Z

@bmartinn

" I was suppose to see some requests coming in with an unknown value,"

I looked into the diff and I saw that you added this:

     @classmethod
+    def _make_raw_response(cls, request_cls=None, service=None, action=None, status_code=200, text=None):
+        service = service or (request_cls._service if request_cls else 'unknown')
+        action = action or (request_cls._action if request_cls else 'unknown')
+        return cls(request_cls=request_cls, meta=ResponseMeta.from_raw_data(
+            status_code=status_code, text=text, endpoint='%(service)s.%(action)s' % locals()))
+
+    @classmethod
     def from_result(cls, res, request_cls=None, logger=None, service=None, action=None, session=None):
         """ From requests result """
         response_cls = get_response_cls(request_cls)
+
+        if res is None:
+            return cls._make_raw_response(
+                request_cls=request_cls,
+                service=service,
+                action=action
+            )
+
         try:
             data = res.json()
         except ValueError:
-            service = service or (request_cls._service if request_cls else 'unknown')
-            action = action or (request_cls._action if request_cls else 'unknown')
-            return cls(request_cls=request_cls, meta=ResponseMeta.from_raw_data(
-                status_code=res.status_code, text=res.text, endpoint='%(service)s.%(action)s' % locals()))
+            return cls._make_raw_response(
+                request_cls=request_cls,
+                service=service,
+                action=action,
+                status_code=res.status_code,
+                text=res.text
+            )

I assumed that now when res is None there should be a call made, considering this part:

         if not data and not json:
-            import traceback
-            traceback.print_stack()
-            raise ValueError(
-                "Missing data (data or json), batch requests are meaningless without it."
-            )
+            # Missing data (data or json), batch requests are meaningless without it.
+            return None

Am I wrong about it? What does the following part of code suppose to do?

+            return cls._make_raw_response(
+                request_cls=request_cls,
+                service=service,
+                action=action
+            )

I think the errors are unrelated to the issue, and I think it has to do with Kale spawning new processes (but this is just a theory). Any chance I could reproduce this behavior ?

Unfortunately I have no idea what caused this problem, if there is a way for us to add some debugging, I'd be more than happy to investigate.

Thank you

bmartinn · 2020-09-24T21:28:40Z

@Shaked this is mostly code refactoring, the idea is to skip logging requests with no actual data (i.e. nothing to log).

To the problem at hand, I don't think this is actually related to the slowness issue.
I think the slowness is somehow related to Kale creating a lot of sub-processes (or at least this is my current theory ;)

Is this only Kale related ?
If so, could you create a toy example for me to debug this issue ?

oak-tree · 2020-09-29T08:26:18Z

Hey @bmartinn
At the moment I do not think it kale related as the training without the error does not get x3 slowness( they do get 1.2 slowness factor)

EDIT

But I think the error is just a symptom. i.e something is trying to get a data and failed to do so. Maybe this failure is the reason for the slowness

bmartinn · 2020-09-29T18:50:08Z

@oak-tree , I have the same feeling ...

I'm happy to hear the slowness disappeared with the warning fix.
This is the reason I attached a dev wheel printing the stack, that said you can see here there is a specific check, making sure zero size reporting is ignored...
Any chance you can attache the full log from this experiment?

bmartinn · 2020-10-11T23:32:53Z

@oak-tree / @Shaked are you still experiencing slowness with the latest (0.16.2) trains release ?

clearml-bot · 2021-01-17T16:08:45Z

@oak-tree @Shaked please reopen if this is still an issue with the latest ClearML v0.17.4 release

oak-tree mentioned this issue Sep 11, 2020

Slow performance: Is it possible to run code not from Jupyter Kernel kubeflow-kale/kale#200

Open

clearml-bot closed this as completed Jan 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trains causes ETA for epoch to be 3 time more slower #206

trains causes ETA for epoch to be 3 time more slower #206

oak-tree commented Sep 8, 2020 •

edited

Loading

bmartinn commented Sep 8, 2020

oak-tree commented Sep 9, 2020 •

edited

Loading

bmartinn commented Sep 10, 2020

oak-tree commented Sep 10, 2020 •

edited

Loading

oak-tree commented Sep 10, 2020 •

edited

Loading

bmartinn commented Sep 10, 2020

oak-tree commented Sep 10, 2020 •

edited

Loading

bmartinn commented Sep 10, 2020

oak-tree commented Sep 10, 2020

bmartinn commented Sep 10, 2020

oak-tree commented Sep 10, 2020

bmartinn commented Sep 10, 2020

oak-tree commented Sep 11, 2020

oak-tree commented Sep 13, 2020

bmartinn commented Sep 13, 2020 •

edited

Loading

oak-tree commented Sep 13, 2020

oak-tree commented Sep 16, 2020

oak-tree commented Sep 18, 2020 •

edited

Loading

bmartinn commented Sep 20, 2020

oak-tree commented Sep 20, 2020

Shaked commented Sep 22, 2020

bmartinn commented Sep 23, 2020

Shaked commented Sep 24, 2020

bmartinn commented Sep 24, 2020

oak-tree commented Sep 29, 2020 •

edited

Loading

bmartinn commented Sep 29, 2020

bmartinn commented Oct 11, 2020

clearml-bot commented Jan 17, 2021

trains causes ETA for epoch to be 3 time more slower #206

trains causes ETA for epoch to be 3 time more slower #206

Comments

oak-tree commented Sep 8, 2020 • edited Loading

EDIT:

bmartinn commented Sep 8, 2020

oak-tree commented Sep 9, 2020 • edited Loading

Machine OS:

Trains server OS

Store

bmartinn commented Sep 10, 2020

oak-tree commented Sep 10, 2020 • edited Loading

oak-tree commented Sep 10, 2020 • edited Loading

bmartinn commented Sep 10, 2020

oak-tree commented Sep 10, 2020 • edited Loading

bmartinn commented Sep 10, 2020

oak-tree commented Sep 10, 2020

bmartinn commented Sep 10, 2020

oak-tree commented Sep 10, 2020

bmartinn commented Sep 10, 2020

oak-tree commented Sep 11, 2020

oak-tree commented Sep 13, 2020

bmartinn commented Sep 13, 2020 • edited Loading

oak-tree commented Sep 13, 2020

oak-tree commented Sep 16, 2020

oak-tree commented Sep 18, 2020 • edited Loading

EDIT

EDIT 2

bmartinn commented Sep 20, 2020

oak-tree commented Sep 20, 2020

Shaked commented Sep 22, 2020

bmartinn commented Sep 23, 2020

Shaked commented Sep 24, 2020

bmartinn commented Sep 24, 2020

oak-tree commented Sep 29, 2020 • edited Loading

EDIT

bmartinn commented Sep 29, 2020

bmartinn commented Oct 11, 2020

clearml-bot commented Jan 17, 2021

oak-tree commented Sep 8, 2020 •

edited

Loading

oak-tree commented Sep 9, 2020 •

edited

Loading

oak-tree commented Sep 10, 2020 •

edited

Loading

oak-tree commented Sep 10, 2020 •

edited

Loading

oak-tree commented Sep 10, 2020 •

edited

Loading

bmartinn commented Sep 13, 2020 •

edited

Loading

oak-tree commented Sep 18, 2020 •

edited

Loading

oak-tree commented Sep 29, 2020 •

edited

Loading