Skip to content

CUDNN_STATUS_MAPPING_ERROR when using DeepSpeed own DataLoader #2722

@yorickbrunet

Description

@yorickbrunet

Hello,

I started using DeepSpeed with the Caltech256 data set and pytorch data loaders.
The caltech data set is instantiated and data loaders are derived from the data set. These data loaders are not given to the DeepSpeed initializer but are used in the training, test and validation loops.
Example:

caltech_ds = td.Caltech256(root="examples/data/DatasetCloud",
    download=False,
    transform=CaltechResnet.CALTECH_IMG_TRANSFORM
)
split_ratios = [0.8, 0.1, 0.1]
samplers = CaltechResnet.caltechSamplers(
    caltech_ds, split_ratios
)
caltech_dl_tr = dl.DataLoader(caltech_ds, 4, False, samplers[0])
caltech_dl_ts = dl.DataLoader(caltech_ds, 1, False, samplers[1])
caltech_dl_val = dl.DataLoader(caltech_ds, 1, False, samplers[2])
model = CaltechResnet.CaltechResnet()
optim = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
parameters = filter(lambda p: p.requires_grad, model.parameters())
model_engine, optimizer, _, _ = deepspeed.initialize(
    None,
    model,
    model_parameters=parameters,
    config=gdeep.trainer.dp_cfg.config,
    optimizer=optim
    )
for epoch in range(num_epochs):
    [...]

This works. I can at least reach the end of the train, test and validation loops.

So I tried also with DeepSpeed's data loader for the training data set. I split manually the Caltech256 data set and then used each split part. Example:

# Train set, 80% of the original Caltech256 dataset
caltech_ds_tr = td.Caltech256(root="examples/data/DatasetCloud_train",
    download=False,
    transform=CaltechResnet.CALTECH_IMG_TRANSFORM
)
# Validation set, 20% of the original Caltech256 dataset
caltech_ds_val = td.Caltech256(root="examples/data/DatasetCloud_test",
    download=False,
    transform=CaltechResnet.CALTECH_IMG_TRANSFORM
)
# Split validation dataset into validation and test sets
split_ratios = [0.5, 0.5] # 50% for each dataset (test, validation)
samplers = CaltechResnet.caltechSamplers(
    caltech_ds_val, split_ratios
)
# Create validation dataloaders
caltech_dl_ts = dl.DataLoader(caltech_ds_val, 1, False, samplers[0])
caltech_dl_val = dl.DataLoader(caltech_ds_val, 1, False, samplers[1])
model = CaltechResnet.CaltechResnet()
optim = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
parameters = filter(lambda p: p.requires_grad, model.parameters())
model_engine, optimizer, caltech_dl_tr, _ = deepspeed.initialize(
    None,
    model,
    model_parameters=parameters,
    config=gdeep.trainer.dp_cfg.config,
    optimizer=optim,
    training_data=caltech_ds_tr
    )
for epoch in range(num_epochs):
    [...]

However, this raises the following error:

RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR
../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered

It seems that DeepSpeed data loader does not execute the same operations as PyTorch's. Especially, DeepSpeed does not seem to translate the target from integer to tensor while PyTorch does it.

Any idea?
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions