local rank 0 logging random values of loss when used auto_metric_logging is set to true ? #461

nsriniva03 · 2022-02-07T14:25:35Z

Describe the Bug

While logging the loss metric comet_ml logs random values for local rank 0 and the right value of loss for other ranks. Look at the graphs below logged by comet_ml.

For rank 0:

For rank 1:

Loss metric logged on different ranks:

I noticed that a very large value is logged for rank 0.

Expected behavior

I expect the loss logged on rank 0 to be very similar to that of rank 1.

Where is the issue?

[x ] Comet Python SDK
[x ] Comet UI
Third Party Integrations (Huggingface, TensorboardX, Pytorch Lighting etc)

DN6 · 2022-02-07T14:45:00Z

Hello @nsriniva03. Looking into this. Just to confirm, are you running distributed training with Comet?

Would it be possible to share the code used to run these experiments? It would help our team have more context around what could be happening.

nsriniva03 · 2022-02-07T15:26:25Z

@DN6 , I can't share the code but I could put together a skeletal code that highlights the main steps.

DN6 · 2022-02-07T18:36:12Z

Thank you! That would be very helpful. Additionally, are you using data parallel training? If so, is your data being distributed the same way across your training nodes? Or is it being shuffled and batches are randomly assigned to nodes across runs?

nsriniva03 · 2022-02-07T23:33:14Z

yes, I am using distributed data parallel for training. I am using the distributed data sampler in pytorch to handle data distribution across ranks. It loads a subset of the data to each rank that is exclusive to it.

DN6 · 2022-02-09T07:49:36Z

Understood.

Would it be possible to run your experiment again and send the data that is currently being sent to rank 0 to another rank to see if the same behavior occurs in this rank?

nsriniva03 · 2022-02-09T13:59:52Z

@DN6 , This will take sometime as currently I have a large experiment running. But I will keep you posted.

DN6 · 2022-02-10T12:46:35Z

Got it. Would it be possible to take a look at the data being sent to rank 0 to see if there is anything unusual about it? Also if you can share a skeletal version of your code I can try to replicate the issue.

github-actions · 2023-10-19T21:21:55Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2023-10-24T21:22:21Z

This issue was closed because it has been stalled for 5 days with no activity.

github-actions bot added the Stale label Oct 19, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local rank 0 logging random values of loss when used auto_metric_logging is set to true ? #461

local rank 0 logging random values of loss when used auto_metric_logging is set to true ? #461

nsriniva03 commented Feb 7, 2022 •

edited

Loading

DN6 commented Feb 7, 2022

nsriniva03 commented Feb 7, 2022

DN6 commented Feb 7, 2022

nsriniva03 commented Feb 7, 2022

DN6 commented Feb 9, 2022

nsriniva03 commented Feb 9, 2022

DN6 commented Feb 10, 2022

github-actions bot commented Oct 19, 2023

github-actions bot commented Oct 24, 2023

local rank 0 logging random values of loss when used auto_metric_logging is set to true ? #461

local rank 0 logging random values of loss when used auto_metric_logging is set to true ? #461

Comments

nsriniva03 commented Feb 7, 2022 • edited Loading

Describe the Bug

Expected behavior

Where is the issue?

DN6 commented Feb 7, 2022

nsriniva03 commented Feb 7, 2022

DN6 commented Feb 7, 2022

nsriniva03 commented Feb 7, 2022

DN6 commented Feb 9, 2022

nsriniva03 commented Feb 9, 2022

DN6 commented Feb 10, 2022

github-actions bot commented Oct 19, 2023

github-actions bot commented Oct 24, 2023

nsriniva03 commented Feb 7, 2022 •

edited

Loading