-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
Hello,
I started using DeepSpeed with the Caltech256 data set and pytorch data loaders.
The caltech data set is instantiated and data loaders are derived from the data set. These data loaders are not given to the DeepSpeed initializer but are used in the training, test and validation loops.
Example:
caltech_ds = td.Caltech256(root="examples/data/DatasetCloud",
download=False,
transform=CaltechResnet.CALTECH_IMG_TRANSFORM
)
split_ratios = [0.8, 0.1, 0.1]
samplers = CaltechResnet.caltechSamplers(
caltech_ds, split_ratios
)
caltech_dl_tr = dl.DataLoader(caltech_ds, 4, False, samplers[0])
caltech_dl_ts = dl.DataLoader(caltech_ds, 1, False, samplers[1])
caltech_dl_val = dl.DataLoader(caltech_ds, 1, False, samplers[2])
model = CaltechResnet.CaltechResnet()
optim = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
parameters = filter(lambda p: p.requires_grad, model.parameters())
model_engine, optimizer, _, _ = deepspeed.initialize(
None,
model,
model_parameters=parameters,
config=gdeep.trainer.dp_cfg.config,
optimizer=optim
)
for epoch in range(num_epochs):
[...]This works. I can at least reach the end of the train, test and validation loops.
So I tried also with DeepSpeed's data loader for the training data set. I split manually the Caltech256 data set and then used each split part. Example:
# Train set, 80% of the original Caltech256 dataset
caltech_ds_tr = td.Caltech256(root="examples/data/DatasetCloud_train",
download=False,
transform=CaltechResnet.CALTECH_IMG_TRANSFORM
)
# Validation set, 20% of the original Caltech256 dataset
caltech_ds_val = td.Caltech256(root="examples/data/DatasetCloud_test",
download=False,
transform=CaltechResnet.CALTECH_IMG_TRANSFORM
)
# Split validation dataset into validation and test sets
split_ratios = [0.5, 0.5] # 50% for each dataset (test, validation)
samplers = CaltechResnet.caltechSamplers(
caltech_ds_val, split_ratios
)
# Create validation dataloaders
caltech_dl_ts = dl.DataLoader(caltech_ds_val, 1, False, samplers[0])
caltech_dl_val = dl.DataLoader(caltech_ds_val, 1, False, samplers[1])
model = CaltechResnet.CaltechResnet()
optim = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
parameters = filter(lambda p: p.requires_grad, model.parameters())
model_engine, optimizer, caltech_dl_tr, _ = deepspeed.initialize(
None,
model,
model_parameters=parameters,
config=gdeep.trainer.dp_cfg.config,
optimizer=optim,
training_data=caltech_ds_tr
)
for epoch in range(num_epochs):
[...]However, this raises the following error:
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR
../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
It seems that DeepSpeed data loader does not execute the same operations as PyTorch's. Especially, DeepSpeed does not seem to translate the target from integer to tensor while PyTorch does it.
Any idea?
Thanks!