Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError in distributed mode -- (Avoid type conversion outside worker) #198

Closed
rayrayraykk opened this issue Jun 30, 2022 · 2 comments · Fixed by #232
Closed

AttributeError in distributed mode -- (Avoid type conversion outside worker) #198

rayrayraykk opened this issue Jun 30, 2022 · 2 comments · Fixed by #232
Labels
bug Something isn't working

Comments

@rayrayraykk
Copy link
Collaborator

rayrayraykk commented Jun 30, 2022

Describe the bug
When model contains BN layer, the param bn.num_batches_tracked would be convert to int by grpc. But the trainer.update can't handle this situation well.

image

A dummy solution:

  def update(self, model_parameters):
      '''
          Called by the FL client to update the model parameters
      Arguments:
          model_parameters (dict): PyTorch Module object's state_dict.
      '''
      for key in model_parameters:
          if isinstance(model_parameters[key], list):
              model_parameters[key] = torch.FloatTensor(
                  model_parameters[key])
          elif isinstance(model_parameters[key], int):
              model_parameters[key] = torch.tensor(model_parameters[key], dtype=torch.long)
              print(key, model_parameters[key])
          elif isinstance(model_parameters[key], float):
              model_parameters[key] = torch.tensor(model_parameters[key], dtype=torch.float)
      self.ctx.model.load_state_dict(self._param_filter(model_parameters),
                                     strict=False)

or can we solve it before sending the model_param?

@rayrayraykk rayrayraykk added the bug Something isn't working label Jun 30, 2022
@rayrayraykk
Copy link
Collaborator Author

rayrayraykk commented Jul 7, 2022

Avoid type conversion outside worker

There is another type conversion bug in aggregator.

If the value is int, like 8, FloatTensor(8), would convert 8 to a random tensor with shape 8.
Should we handle these two situation in the message buffer instead of in aggregator?

Need a discussion @xieyxclack @joneswong .

A dummy solution in aggregator:

  if not self.cfg.federate.use_ss:
      if isinstance(local_model[key], torch.Tensor):
          local_model[key] = local_model[key].float()
      elif isinstance(local_model[key], list):
          local_model[key] = torch.FloatTensor(local_model[key])
      elif isinstance(local_model[key], int):
          local_model[key] = torch.tensor(local_model[key], dtype=torch.long)
      elif isinstance(local_model[key], float):
          local_model[key] = torch.tensor(local_model[key], dtype=torch.float)

@rayrayraykk rayrayraykk changed the title AttributeError in distributed mode AttributeError in distributed mode -- (Avoid type conversion outside worker) Jul 7, 2022
@xieyxclack
Copy link
Collaborator

Avoid type conversion outside worker

IMO, the type conversion cannot be happened inside the worker, since the worker should be language-independent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants