-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
local rank 0 logging random values of loss when used auto_metric_logging is set to true ? #461
Comments
Hello @nsriniva03. Looking into this. Just to confirm, are you running distributed training with Comet? Would it be possible to share the code used to run these experiments? It would help our team have more context around what could be happening. |
@DN6 , I can't share the code but I could put together a skeletal code that highlights the main steps. |
Thank you! That would be very helpful. Additionally, are you using data parallel training? If so, is your data being distributed the same way across your training nodes? Or is it being shuffled and batches are randomly assigned to nodes across runs? |
yes, I am using distributed data parallel for training. I am using the distributed data sampler in pytorch to handle data distribution across ranks. It loads a subset of the data to each rank that is exclusive to it. |
Understood. Would it be possible to run your experiment again and send the data that is currently being sent to rank 0 to another rank to see if the same behavior occurs in this rank? |
@DN6 , This will take sometime as currently I have a large experiment running. But I will keep you posted. |
Got it. Would it be possible to take a look at the data being sent to rank 0 to see if there is anything unusual about it? Also if you can share a skeletal version of your code I can try to replicate the issue. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue was closed because it has been stalled for 5 days with no activity. |
Describe the Bug
While logging the loss metric comet_ml logs random values for local rank 0 and the right value of loss for other ranks. Look at the graphs below logged by comet_ml.
For rank 0:
For rank 1:
Loss metric logged on different ranks:
I noticed that a very large value is logged for rank 0.
Expected behavior
I expect the loss logged on rank 0 to be very similar to that of rank 1.
Where is the issue?
The text was updated successfully, but these errors were encountered: