CUDNN_STATUS_MAPPING_ERROR when using DeepSpeed own DataLoader

Hello,

I started using DeepSpeed with the Caltech256 data set and pytorch data loaders.
The caltech data set is instantiated and data loaders are derived from the data set. These data loaders are not given to the DeepSpeed initializer but are used in the training, test and validation loops.
Example:

```py
caltech_ds = td.Caltech256(root="examples/data/DatasetCloud",
    download=False,
    transform=CaltechResnet.CALTECH_IMG_TRANSFORM
)
split_ratios = [0.8, 0.1, 0.1]
samplers = CaltechResnet.caltechSamplers(
    caltech_ds, split_ratios
)
caltech_dl_tr = dl.DataLoader(caltech_ds, 4, False, samplers[0])
caltech_dl_ts = dl.DataLoader(caltech_ds, 1, False, samplers[1])
caltech_dl_val = dl.DataLoader(caltech_ds, 1, False, samplers[2])
model = CaltechResnet.CaltechResnet()
optim = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
parameters = filter(lambda p: p.requires_grad, model.parameters())
model_engine, optimizer, _, _ = deepspeed.initialize(
    None,
    model,
    model_parameters=parameters,
    config=gdeep.trainer.dp_cfg.config,
    optimizer=optim
    )
for epoch in range(num_epochs):
    [...]
```

This works. I can at least reach the end of the train, test and validation loops.

So I tried also with DeepSpeed's data loader for the training data set. I split manually the Caltech256 data set and then used each split part. Example:

```py
# Train set, 80% of the original Caltech256 dataset
caltech_ds_tr = td.Caltech256(root="examples/data/DatasetCloud_train",
    download=False,
    transform=CaltechResnet.CALTECH_IMG_TRANSFORM
)
# Validation set, 20% of the original Caltech256 dataset
caltech_ds_val = td.Caltech256(root="examples/data/DatasetCloud_test",
    download=False,
    transform=CaltechResnet.CALTECH_IMG_TRANSFORM
)
# Split validation dataset into validation and test sets
split_ratios = [0.5, 0.5] # 50% for each dataset (test, validation)
samplers = CaltechResnet.caltechSamplers(
    caltech_ds_val, split_ratios
)
# Create validation dataloaders
caltech_dl_ts = dl.DataLoader(caltech_ds_val, 1, False, samplers[0])
caltech_dl_val = dl.DataLoader(caltech_ds_val, 1, False, samplers[1])
model = CaltechResnet.CaltechResnet()
optim = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
parameters = filter(lambda p: p.requires_grad, model.parameters())
model_engine, optimizer, caltech_dl_tr, _ = deepspeed.initialize(
    None,
    model,
    model_parameters=parameters,
    config=gdeep.trainer.dp_cfg.config,
    optimizer=optim,
    training_data=caltech_ds_tr
    )
for epoch in range(num_epochs):
    [...]
```
However, this raises the following error:
```text
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR
../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
```

It seems that DeepSpeed data loader does not execute the same operations as PyTorch's. Especially, DeepSpeed does not seem to translate the target from integer to tensor while PyTorch does it.

Any idea?
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDNN_STATUS_MAPPING_ERROR when using DeepSpeed own DataLoader #2722

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDNN_STATUS_MAPPING_ERROR when using DeepSpeed own DataLoader #2722

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions