AttributeError in distributed mode -- (Avoid type conversion outside worker) #198

rayrayraykk · 2022-06-30T03:20:17Z

Describe the bug
When model contains BN layer, the param bn.num_batches_tracked would be convert to int by grpc. But the trainer.update can't handle this situation well.

A dummy solution:

  def update(self, model_parameters):
      '''
          Called by the FL client to update the model parameters
      Arguments:
          model_parameters (dict): PyTorch Module object's state_dict.
      '''
      for key in model_parameters:
          if isinstance(model_parameters[key], list):
              model_parameters[key] = torch.FloatTensor(
                  model_parameters[key])
          elif isinstance(model_parameters[key], int):
              model_parameters[key] = torch.tensor(model_parameters[key], dtype=torch.long)
              print(key, model_parameters[key])
          elif isinstance(model_parameters[key], float):
              model_parameters[key] = torch.tensor(model_parameters[key], dtype=torch.float)
      self.ctx.model.load_state_dict(self._param_filter(model_parameters),
                                     strict=False)

or can we solve it before sending the model_param?

The text was updated successfully, but these errors were encountered:

rayrayraykk · 2022-07-07T06:12:30Z

Avoid type conversion outside worker

There is another type conversion bug in aggregator.

If the value is int, like 8, FloatTensor(8), would convert 8 to a random tensor with shape 8.
Should we handle these two situation in the message buffer instead of in aggregator?

Need a discussion @xieyxclack @joneswong .

A dummy solution in aggregator:

  if not self.cfg.federate.use_ss:
      if isinstance(local_model[key], torch.Tensor):
          local_model[key] = local_model[key].float()
      elif isinstance(local_model[key], list):
          local_model[key] = torch.FloatTensor(local_model[key])
      elif isinstance(local_model[key], int):
          local_model[key] = torch.tensor(local_model[key], dtype=torch.long)
      elif isinstance(local_model[key], float):
          local_model[key] = torch.tensor(local_model[key], dtype=torch.float)

xieyxclack · 2022-07-08T02:27:45Z

Avoid type conversion outside worker

IMO, the type conversion cannot be happened inside the worker, since the worker should be language-independent

rayrayraykk added the bug Something isn't working label Jun 30, 2022

rayrayraykk changed the title ~~AttributeError in distributed mode~~ AttributeError in distributed mode -- (Avoid type conversion outside worker) Jul 7, 2022

rayrayraykk mentioned this issue Jul 7, 2022

[HotFix] workaround for type convertion #213

Merged

rayrayraykk linked a pull request Jul 14, 2022 that will close this issue

[Bugfix]Fix type conversion error in distributed mode #232

Merged

xieyxclack closed this as completed in #232 Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError in distributed mode -- (Avoid type conversion outside worker) #198

AttributeError in distributed mode -- (Avoid type conversion outside worker) #198

rayrayraykk commented Jun 30, 2022 •

edited

rayrayraykk commented Jul 7, 2022 •

edited

xieyxclack commented Jul 8, 2022

AttributeError in distributed mode -- (Avoid type conversion outside worker) #198

AttributeError in distributed mode -- (Avoid type conversion outside worker) #198

Comments

rayrayraykk commented Jun 30, 2022 • edited

rayrayraykk commented Jul 7, 2022 • edited

Avoid type conversion outside worker

xieyxclack commented Jul 8, 2022

rayrayraykk commented Jun 30, 2022 •

edited

rayrayraykk commented Jul 7, 2022 •

edited