Skip to content

Conversation

@emersodb
Copy link
Collaborator

@emersodb emersodb commented Jul 11, 2023

This PR is meant to address the issue we were seeing with Batch Normalization layers ending up with negative variance estimates due to momentum based aggregation in FedAdam. The effect is that sometimes the variance estimates for a batch normalization layer becomes negative. This causes failure in the forward pass during evaluation. The fix is to configure the batch normalization layers to not use training estimates of the batch mean and variance during evaluation, which means that the estimates tracked during training have no effect.

I've tested this fix for the Fed Isic 2019 EfficientNet model and it now appears to be avoiding NaN values in the eval stage throughout training. Will be post-processing the HP search and measuring performance tomorrow as long as the scripts run well through the night.

emersodb added 2 commits July 11, 2023 17:31
…ization layers. This should alleviate the issue we were seeing where nans crept into the model due to negative variance coming from momentum on the server side aggregation.
Copy link
Collaborator

@yuchongzhang yuchongzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

David and I had a call to discuss the math behind this and why setting the track_running-stats boolean to false is insufficient (the official PyTorch documentation is kind of misleading on this point). Changes look good to me.

@emersodb emersodb changed the base branch from dbe/expand_basic_client to main July 13, 2023 21:40
@emersodb emersodb changed the base branch from main to dbe/expand_basic_client July 13, 2023 21:41
Base automatically changed from dbe/expand_basic_client to main July 14, 2023 15:17
@emersodb emersodb merged commit 212d698 into main Jul 14, 2023
@emersodb emersodb deleted the dbe/fix_flamby_fedadam_bn_issue branch July 14, 2023 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants