Skip to content

[TB] log restarts#234

Merged
stas00 merged 1 commit intomainfrom
tb-resume-world-size
Jan 29, 2022
Merged

[TB] log restarts#234
stas00 merged 1 commit intomainfrom
tb-resume-world-size

Conversation

@stas00
Copy link
Copy Markdown
Contributor

@stas00 stas00 commented Jan 20, 2022

As we have recently discovered there was a very subtle issue with the optimizer that wasn't fully restored on resume. And it's possible that some spikes we have encountered were related to that.

So I think it'd help to log resume events.

I'm thinking perhaps simply logging world_size once per run, which will also have a second use which will tell us if we switched to a higher or lower number of nodes and perhaps see some additional correlations there.

Since TB will end up creating a flat line and we won't see the restart events I came up with the following hack, where on restart it logs world_size, and then logs another step with 0, so we end up having nice spikes for each restart with the height of the spike also showing the world size for that run.

Here is an example of 3 restarts with a world_size=2 on my machine:

snapshot_114

Perhaps TB has another way to flag sparse points and not making them disappear by drawing a line between them. Like it'd be nice to have some way of making those points extra fat or something... but I didn't find a way.

Feedback and suggestions for improvements are very welcome.

@stas00 stas00 merged commit dfb5d68 into main Jan 29, 2022
@stas00 stas00 deleted the tb-resume-world-size branch January 29, 2022 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant