-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trains causes ETA for epoch to be 3 time more slower #206
Comments
Hi @oak-tree , Could you provide some information on your setup? |
Hey @bmartinn Outsetup: Machine OS:Training is done from Kubeflow over GKE. Machines from GKE manged version running: Trains server OSWe are using K8S over Azure for trains, i.e running AKS: StoreWe are not storing/loading lots of model. We do have |
Hi @oak-tree task = Task.init('examples', 'test speed', auto_connect_frameworks={'tensorflow': False}) |
@bmartinn We report 1 image every 1000 steps so I do not consider it as a network load . We will try to disable to |
We tested it with the following
but the ETA is still high with a magnitude of 3.
Note that it seems that the ETA is keep raising |
@oak-tree this is very very strange... task = Task.init('examples', 'test speed', auto_connect_frameworks=False) And what happens if you do: task = Task.init('examples', 'test speed', auto_connect_frameworks=False, auto_resource_monitoring=False) |
Hey @bmartinn Note that it is still faster comparing |
@oak-tree what exactly are the flags you are using? |
|
@oak-tree any chance to get a traceback, and I have the feeling you are correct and somehow the log error, has something to do with the slowdown?! |
I'm trying to think how to get the trackback. Because this runs as |
@oak-tree if I compile a debug wheel for you, could you package it in the docker and use it ? |
@bmartinn sure, its possible, just tell me how to download this package |
@bmartinn do you use |
Hi @oak-tree , I'm not aware of any issue running BTW: |
Hey @bmartinn, So you are welcome to suggest more ideas :) Thanks for the |
Hey @bmartinn note that we upgraded out |
Hey @bmartinn EDITyour fix for the message worked. But I think its only a symptom and something is trying to logs something and fails to do so EDIT 2and maybe this |
@oak-tree can you try using the wheel attached below? |
Hey @bmartinn |
Hey @bmartinn I have ran the latest wheel that you have attached but I'm not seeing anything special. The errors have disappeared, but looking into the code (I have What should I expect? Thank you for your help |
@oak-tree I think it was wrong to raise this exception to begin with (this is more of a debug print, definitely not an exception), so this is what the dev wheel does (and what will probably get committed) @Shaked what do you mean by " I was suppose to see some requests coming in with an unknown value," ? I think the errors are unrelated to the issue, and I think it has to do with Kale spawning new processes (but this is just a theory). Any chance I could reproduce this behavior ? |
I looked into the diff and I saw that you added this:
I assumed that now when
Am I wrong about it? What does the following part of code suppose to do?
Unfortunately I have no idea what caused this problem, if there is a way for us to add some debugging, I'd be more than happy to investigate. Thank you |
@Shaked this is mostly code refactoring, the idea is to skip logging requests with no actual data (i.e. nothing to log). To the problem at hand, I don't think this is actually related to the slowness issue.
|
Hey @bmartinn EDITBut I think the error is just a |
@oak-tree , I have the same feeling ... |
Hey,
We are using
trains
withtf 2.3.0
andtf.keras
. We notice that theETA
with trains for a single epoch takes us about 7.5 Hourswith trains: ETA: 7:46:59
without trains: ETA: 2:14:54
Any ideas/solution for this trains bottleneck?
Thanks,
Alon
EDIT:
trains
server version
is 0.13, trains packages (trains.__version__
) is0.16.1
The text was updated successfully, but these errors were encountered: