[BUG] RuntimeError: expected scalar type Float but found Half

Hi everyone, i'm having a similar issue to the one reported in [#1233](https://github.com/microsoft/DeepSpeed/issues/1233).
I'm using:

- DeepSpeed 0.8.0 
- pytorch-lightning==1.9.0    
- torch==1.12.1+cu116 
- torch-cluster==1.6.0+pt112cu116                                                                                                                                                                  
- torch-geometric==2.2.0                                                                                                                                                                           
- torch-geometric-temporal==0.54.0                                                                                                                                                                 
- torch-scatter==2.1.0+pt112cu116                                                                                                                                                                  
- torch-sparse==0.6.16+pt112cu116                                                                                                                                                                  
- torch-spline-conv==1.2.1+pt112cu116                                                                                                                                                              
- torchaudio==0.12.1+cu116                                                                                                                                                                         
- torchfile==0.1.0 
- torchmetrics==0.9.3                                                                                                                                                                              
- torchvision==0.13.1+cu116  

Setting  32 precision on Lightning Trainer on single GPU Quadro RTX 6000 everything works fine.
Switching to 16 precision i have the following Traceback, (even calling `torch.Tensor.half()` on model, or on input, or both).

```
Traceback (most recent call last):
  File "/user/projects/myproject/utils.py", line 153, in train
    trainer.fit(model, train_dataloaders=dataset_train, val_dataloaders=dataset_val)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1103, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1182, in _run_stage
    self._run_train()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run_train
    self._run_sanity_check()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1267, in _run_sanity_check
    val_loop.run()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
    output = self._evaluation_step(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
    output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1485, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/deepspeed.py", line 917, in validation_step
    return self.model(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1836, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/overrides/base.py", line 110, in forward
    return self._forward_module.validation_step(*inputs, **kwargs)
  File "/user/projects/myproject/GraphNN.py", line 127, in validation_step
    return self.step(batch)
  File "/user/projects/myproject/GraphNN.py", line 141, in step
    net_outputs = self(x.half(), batch.edge_index, batch.edge_attr.half(), batch.batch.half())
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/user/projects/myproject/GraphNN.py", line 105, in forward
    x = f(x, edge_index, edge_attr)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 166, in forward
    Z = self._calculate_update_gate(X, edge_index, edge_weight, H, lambda_max)
  File "/usr/local/lib/python3.8/dist-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 121, in _calculate_update_gate
    Z = Z + self.conv_h_z(H, edge_index, edge_weight, lambda_max=lambda_max)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch_geometric/nn/conv/cheb_conv.py", line 170, in forward
    out = self.lins[0](Tx_0)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch_geometric/nn/dense/linear.py", line 136, in forward
    return F.linear(x, self.weight, self.bias)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/linear.py", line 116, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight)
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/autocast_mode.py", line 110, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/linear.py", line 61, in forward
    output = input.matmul(weight.t())
RuntimeError: expected scalar type Float but found Half

<traceback object at 0x7f614d365200>
```

                                                                   


**To Reproduce**

```
import torch
from torch_geometric.loader import DataLoader
from torch_geometric.datasets import TUDataset
from pytorch_lightning.strategies import DeepSpeedStrategy
from torch_geometric.nn import global_mean_pool
from torch_geometric_temporal import GConvGRU
import torch.nn.functional as F
from torch.nn import Linear
import pytorch_lightning as pl
import time
from typing import Union, Dict, List
import math
import torchmetrics as tm


class GraphNN(pl.LightningModule):
    def __init__(self,
                 num_features,
                 num_classes,
                 k,
                 dropout: Union[int, float] = 0.2,
                 learning_rate: float = 1e-4,
                 weight_decay: float = 1e-4,
                 num_layers: int = 2,
                 batch_size: int = 32,
                 ):
        super(GraphNN, self).__init__()
        self.dropout = dropout
        self.learning_rate = learning_rate
        self.criterion = torch.nn.CrossEntropyLoss()
        self.weight_decay = weight_decay
        self.k = k
        self.num_classes = num_classes+1
        self.layers = torch.nn.ModuleList()
        self.batch_size = batch_size
        h_channels = round(num_features/4)
        for layer in range(num_layers):
            if layer == 0:
                self.layers.append(GConvGRU(num_features, h_channels, k))

            else:
                output = round(math.sqrt(h_channels * self.num_classes))
                self.layers.append(GConvGRU(h_channels, output, k))
                h_channels = output

        output = round(math.sqrt(h_channels * self.num_classes))
        self.fc = Linear(h_channels, output)
        self.output = Linear(output, self.num_classes)
        self.save_hyperparameters()

    def forward(self, x, edge_index, edge_attr, batch):
        for f in self.layers:
            x = f(x, edge_index, edge_attr)
            x = x.relu()
        x = self.fc(x)
        x = global_mean_pool(x, batch)  # [batch_size, hidden_channels]
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.output(x)
        return x

    def training_step(self, batch, batch_idx):
        return self.step(batch)

    def validation_step(self, batch, batch_idx):
        return self.step(batch)

    def step(self, batch):
        phase: str = "train" if self.training is True else "val"
        x: torch.Tensor = batch.x
        x.requires_grad = True
        y = batch.y
        starting_time = time.time()
        net_outputs = self(x, batch.edge_index, batch.edge_attr, batch.batch)
        loss = self.criterion(net_outputs, y)

        results = {
            "loss": loss,
            "acc": torch.as_tensor(
                tm.functional.accuracy(net_outputs, y, average="micro")),
            "time": time.time() - starting_time,
        }
        if phase == "val":
            results["val_preds"] = net_outputs
            results["val_target"] = y

        elif phase == "train":
            results["train_preds"] = net_outputs
            results["train_target"] = y
        return results

    def training_epoch_end(self, outputs: List[Dict[str, torch.Tensor]]):

        self.log_stats(outputs)

    def validation_epoch_end(self, outputs: List[Dict[str, torch.Tensor]]):
        self.log_stats(outputs)

    def log_stats(self, outputs: List[Dict[str, torch.Tensor]]):

        phase: str = "train" if self.training is True else "val"

        self.log(f"time_{phase}", sum([e["time"] for e in outputs]) / len(outputs),
                 prog_bar=False, sync_dist=True)

        self.log(f"loss_{phase}", torch.stack([e["loss"] for e in outputs]).mean(),
                 prog_bar=True if phase == "val" else False, sync_dist=True)

        for metric in ["acc"]: #, "f1", "precision", "recall", "mcc"
            metric_data = torch.stack([e[metric] for e in outputs]).float()
            self.log(f"{metric}_mean_{phase}", metric_data.mean(),
                     prog_bar=True if metric == "acc" else False, sync_dist=True)
            del metric_data
        del outputs

    def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx):
        for param in self.parameters():
            param.grad = None

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate, weight_decay=self.weight_decay)
        return optimizer


def run():
    dataset = TUDataset(root='data/TUDataset', name='MUTAG')
    channels = 64
    classes = 2
    max_epochs = 100
    batch_size = 64
    k = 1
    in_channels = 2
    num_features = dataset.num_features
    lr = 0.001

    dropout = 0.1

    torch.manual_seed(12345)
    dataset = dataset.shuffle()
    train_dataset = dataset[:150]
    test_dataset = dataset[150:]

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


    model = GraphNN(
        dropout=dropout,
        num_classes=classes,
        num_features=num_features,
        k=k,
        learning_rate=lr,
        weight_decay=0,
        num_layers=in_channels
    )

    trainer = pl.Trainer(
        accelerator="cuda" if torch.cuda.is_available() else "cpu",
        devices=1,
        precision=16,
        max_epochs=max_epochs,
        gradient_clip_val=1,  # if disable_gradient_clipping is False else 0
        strategy=DeepSpeedStrategy(
            stage=3,
            offload_optimizer=True,
            offload_parameters=True,
            allgather_partitions=False,
        ),
    )
    trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=test_loader)


if __name__ == "__main__":
    run()
```


**Expected behavior**
I expected that the training starts normally but setting 16 precision, instead of 32, i receive the above error.

**ds_report output**
```
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
-------------------------------------------------- 
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]     
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]                      
quantizer .............. [YES] ...... [OKAY]      
random_ltd ............. [YES] ...... [OKAY]      
sparse_attn ............ [YES] ...... [OKAY]   
spatial_inference ...... [YES] ...... [OKAY]   
transformer ............ [YES] ...... [OKAY]   
stochastic_transformer . [YES] ...... [OKAY]   
transformer_inference .. [YES] ...... [OKAY]   
utils .................. [YES] ...... [OKAY]   
--------------------------------------------------  
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'  
DeepSpeed general environment info:                          
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']   
torch version .................... 1.12.1+cu116                                      
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown                                                
torch cuda version ............... 11.6                                                                  
torch hip version ................ None                                                                 
nvcc version ..................... 11.6                                                       
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6   
```


**System info:**
 - OS: Ubuntu 20.04.4 LTS
 - GPU count and types: 1 x Quadro RTX 6000
 - Python version: Python 3.8.10

**Launcher context**
I'm using Pytorch Lightning DeepSpeed integration

**Docker context**
I'm running in a Docker container based on Nvidia on HTCondor cluster.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RuntimeError: expected scalar type Float but found Half #2842

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] RuntimeError: expected scalar type Float but found Half #2842

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions