Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local rank 0 logging random values of loss when used auto_metric_logging is set to true ? #461

Closed
1 task
nsriniva03 opened this issue Feb 7, 2022 · 9 comments
Labels

Comments

@nsriniva03
Copy link

nsriniva03 commented Feb 7, 2022

Describe the Bug

While logging the loss metric comet_ml logs random values for local rank 0 and the right value of loss for other ranks. Look at the graphs below logged by comet_ml.

For rank 0:

Screen Shot 2022-02-07 at 8 22 06 AM

For rank 1:
Screen Shot 2022-02-07 at 8 22 50 AM

Loss metric logged on different ranks:

Screen Shot 2022-02-07 at 8 22 35 AM

I noticed that a very large value is logged for rank 0.

Expected behavior

I expect the loss logged on rank 0 to be very similar to that of rank 1.

Where is the issue?

  • [x ] Comet Python SDK
  • [x ] Comet UI
  • Third Party Integrations (Huggingface, TensorboardX, Pytorch Lighting etc)
@DN6
Copy link
Contributor

DN6 commented Feb 7, 2022

Hello @nsriniva03. Looking into this. Just to confirm, are you running distributed training with Comet?

Would it be possible to share the code used to run these experiments? It would help our team have more context around what could be happening.

@nsriniva03
Copy link
Author

@DN6 , I can't share the code but I could put together a skeletal code that highlights the main steps.

@DN6
Copy link
Contributor

DN6 commented Feb 7, 2022

Thank you! That would be very helpful. Additionally, are you using data parallel training? If so, is your data being distributed the same way across your training nodes? Or is it being shuffled and batches are randomly assigned to nodes across runs?

@nsriniva03
Copy link
Author

yes, I am using distributed data parallel for training. I am using the distributed data sampler in pytorch to handle data distribution across ranks. It loads a subset of the data to each rank that is exclusive to it.

@DN6
Copy link
Contributor

DN6 commented Feb 9, 2022

Understood.

Would it be possible to run your experiment again and send the data that is currently being sent to rank 0 to another rank to see if the same behavior occurs in this rank?

@nsriniva03
Copy link
Author

@DN6 , This will take sometime as currently I have a large experiment running. But I will keep you posted.

@DN6
Copy link
Contributor

DN6 commented Feb 10, 2022

Got it. Would it be possible to take a look at the data being sent to rank 0 to see if there is anything unusual about it? Also if you can share a skeletal version of your code I can try to replicate the issue.

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 19, 2023
@github-actions
Copy link

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants