FSDP: issues with inferencing #703

HITESHLPATEL · 2021-05-28T16:38:19Z

❓ Questions and Help

I am trying to integrate FSDP into my code. I have questions related to optimizer sharding. Does FSDP automatically shards the gradient, optimizer, and parameter, or do I need to call OSS to shard optimizer? Also can anyone suggest right way to do mixed precision with autocast and FSDP. Also does validation/test happens on GPU rank 0 or on all the nodes?

min-xu-ai · 2021-05-28T18:49:57Z

no, you don't need to call oss to shard parameter
fsdp automatically shard parameters, gradients and optimizer state
if you need the forward pass compute to overlap with network communication, you need to have nested fsdp, i.e. fsdp(module(fsdp(layer1), ...fsdp(layer_n)))
check the tests dir for a large number tests that uses FSDP, including mixed precision usage
the common way to use mixed precision is to use autocast context and fsdp with mixed_precision=True
validation still happens on rank 0 only unless you do distributed inference. However, you need all ranks to participate in the validation because rank 0 needs to fetch parameters from them. There isn't much examples on validations yet in the test dir.

HITESHLPATEL · 2021-05-28T18:54:21Z

Thank you for clarification. However when i use autocast and shared grad scaler i get this error
"assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer". without mixed precision and autocast my loss remains the same at all the epochs which is very weird.

Can you share an example how can I make all rank participate for validation?

min-xu-ai · 2021-05-28T18:58:52Z

cc @blefaudeux. Check this file tests/nn/data_parallel/test_fsdp_regnet.py. Did you use it this way:

For validation, I don't know an example in the test dir. But I could have missed it. Another good source to check is fairseq code to see their validation code with fsdp. But I don't have a direct pointer unfortunately.

blefaudeux · 2021-05-28T20:28:54Z

Thank you for clarification. However when i use autocast and shared grad scaler i get this error
"assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer". without mixed precision and autocast my loss remains the same at all the epochs which is very weird.

Can you share an example how can I make all rank participate for validation?

It's a bit hard to understand the context here, feels like there are a couple of hidden variables (could you share a code snippet if still stuck ?), but my guess is that you're missing a call, the assert is here precisely for that. On the grad scaler side we're following the PyTorch way of doing this (see the doc), the differences are tied to the sharding but are not visible on the user side, so this doc should apply entirely.

If I misunderstood and you would like to sync some metric across ranks, you can use an all_reduce to consolidate it across ranks, for instance like it's done here (but be mindful in that this is a collective operation, any rank skipping that will deadlock the whole fleet)

HITESHLPATEL · 2021-05-29T15:33:37Z

def train(rank, args, model,params, device, train_loader, num_epochs):
##############
# SETUP
dist_init(rank, WORLD_SIZE, BACKEND)
ddp = FullyShardedDataParallel(module=model,mixed_precision=True)
optimizer=torch.optim.SGD(params,lr=0.01)

ddp.train()

# Training loop
training_start = time.monotonic()

loss_fn = nn.CrossEntropyLoss()

model.train()
measurements = []
scaler = ShardedGradScaler()
for epoch in range(num_epochs):
    epoch_start = time.monotonic()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        model.zero_grad()
        with torch.cuda.amp.autocast(enabled=True):
            outputs = model(data)
            loss = loss_fn(outputs, target)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

blefaudeux · 2021-05-29T16:14:32Z

def train(rank, args, model,params, device, train_loader, num_epochs):
##############

SETUP

dist_init(rank, WORLD_SIZE, BACKEND)
ddp = FullyShardedDataParallel(module=model,mixed_precision=True)
optimizer=torch.optim.SGD(params,lr=0.01)

ddp.train()

# Training loop
training_start = time.monotonic()

loss_fn = nn.CrossEntropyLoss()

model.train()
measurements = []
scaler = ShardedGradScaler()
for epoch in range(num_epochs):
    epoch_start = time.monotonic()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        model.zero_grad()
        with torch.cuda.amp.autocast(enabled=True):
            outputs = model(data)
            loss = loss_fn(outputs, target)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

thanks for the snippet ! not visible here, could it be that your optimizer is not tied to the proper model parameters ? It's been a while I spent some time in that, but I think that it could explain the assert and your loss not changing (the optimizer sees no grads -> no inf check and the loss is constant)

HITESHLPATEL · 2021-05-29T16:17:47Z

Could you Please suggest what could be done in such scenarios? Or is it a bug ?

blefaudeux · 2021-05-29T17:41:20Z

Could you Please suggest what could be done in such scenarios? Or is it a bug ?

it's not possible to see from your snippet how the optimizer is defined, so I can just conjecture that maybe it does not have the right parameters ? I would expect something like optimizer = torch.optim.MyFancyOptimizer(model.parameters()) after your model has been properly defined. It may be there, no idea, but if it was not properly declared it could explain your symptoms

HITESHLPATEL · 2021-05-29T23:13:28Z

@blefaudeux Thank you. Could you point where I can see any sample code for Validation? In standard pytorch DDP validation happens on the rank 0. Since the model and states are shared does it still valid true?

blefaudeux · 2021-05-29T23:16:08Z

@blefaudeux Thank you. Could you point where I can see any sample code for Validation? In standard pytorch DDP validation happens on the rank 0. Since the model and states are shared does it still valid true?

The ranks are still in sync after an optimizer step(), so you should be able to run the validation anywhere, if I'm not mistaken (true for OSS and ShardedDDP, also for FSDP I think but pulling @min-xu-ai in case)

HITESHLPATEL · 2021-05-30T05:03:21Z

@blefaudeux I was able to get rid of infs check error and loss change by making flatten_parameter=False when doing mix training. However my training time is taking double the time than standard Fully shared data parallel.
self.model = FullyShardedDataParallel(self.model,mixed_precision=True,flatten_parameters=False) # Works Fine
self.model = FullyShardedDataParallel(self.model,mixed_precision=True,flatten_parameters=True) # Gettinng infs check error or illegal memory crash error. I assume there is a bug.

min-xu-ai · 2021-05-30T15:20:01Z

Thanks for narrowing it down a bit! There is likely a bug. Can you try to distill the test code into a small piece that demonstrate this issue?

HITESHLPATEL · 2021-05-31T15:08:24Z

@min-xu-ai Thank you. I noticed that my Error comes from the Validation phase. Could you provide me with a sample code on how to do a validation since I am doing validation on only rank 0 and it gives me a bad param error.

min-xu-ai · 2021-05-31T17:22:48Z

Sorry to hear. I haven’t done that myself. Maybe others can chime in. Another way is to provide the failing code for me to debug the error. No promises on time to fix though.

HITESHLPATEL · 2021-05-31T22:38:10Z

Here is the sample code from my training pipeline

Training Loop

dist_init(rank, WORLD_SIZE, BACKEND)
ddp = FullyShardedDataParallel(module=model,mixed_precision=True)
optimizer=torch.optim.SGD(params,lr=0.01)

ddp.train()

training_start = time.monotonic()

loss_fn = nn.CrossEntropyLoss()

model.train()
measurements = []
scaler = ShardedGradScaler()
for epoch in range(num_epochs):
    epoch_start = time.monotonic()
    for batch_idx, (data, target) in enumerate(train_loader[phase]):
        data, target = data.to(device), target.to(device)

        model.zero_grad()
        with torch.set_grad_enabled(phase == 'train') and torch.cuda.amp.autocast(enabled=True):
            outputs = model(data)
            loss = loss_fn(outputs, target)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

I am doing validation on rank 0. As soon as the train epochs gets completed I am hitting with this error.
x = F.conv2d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

By just changing flatten_parameters=False everything works fine but the fact the time is almost double.

HITESHLPATEL · 2021-06-04T21:23:17Z

@blefaudeux @min-xu-ai I was trying to save my trained model on rank 0 when inferencing on the validation set by calling the model. state_dict() but it happens to get stuck there. Is there anything else that needs to be done to save the model since I am calling this on rank 0?

blefaudeux · 2021-06-04T21:25:05Z

@blefaudeux @min-xu-ai I was trying to save my trained model on rank 0 when inferencing on the validation set by calling the model. state_dict() but it happens to get stuck there. Is there anything else that needs to be done to save the model since I am calling this on rank 0?

@min-xu-ai will chime in, but you probably need to consolidate the model on this rank prior to pulling out the state, it should be mentioned in the doc

Edit: this or call "state_dict" on all ranks, not sure now what FSDP asks for. The "stuck" syndrome is normal, there must be a collective operation which is just called on a single rank, hence deadlock

HITESHLPATEL · 2021-06-04T21:29:39Z

@min-xu-ai, can you point me to the document where I can find this on how FSDP saves the model. I am trying to search for this, but dint find any information on this.

min-xu-ai · 2021-06-04T21:53:42Z

Take a look at this test:

fairscale/tests/nn/data_parallel/test_fsdp_state_dict.py

Line 35 in 1e4a503

def _load_local_and_train(self, config, rank, group, d_model=16, d_vocab=23):

The doc is here:

But I must admit that I have not personally debugged issues around state dict lately. I will need to refresh my own knowledge about it.

HITESHLPATEL · 2021-06-22T17:27:54Z

@min-xu-ai @blefaudeux Thank you. I was able to save the model. One more question is whether there is any test case with a different resolution during train and validation? For e.g If the train resolution is 300 and the Val resolution is 384, It gives a CUDNN_STATUS_BAD_PARAM error. Having the same size for train and Val does not give an error. I think this has to do something with flatten_parameters.

min-xu-ai · 2021-06-22T20:00:58Z

Good to hear!

If you suspect flatten_parameters, perhaps you can set it to false and test?

To debug STATUS_BAD_PARAM, perhaps you can disable cudnn and try it?

torch.backends.cudnn.enabled = False would disable cuDNN.

CUDNN error msgs are hard to debug and lack useful info. Sometimes, it is related to out of memory, sometimes it is related to dtype and mixed precision.

HITESHLPATEL · 2021-06-23T23:27:24Z

Thank you. Setting flatten_parameters=False solves that issue, however this leads to increase in time.

min-xu-ai · 2021-06-23T23:48:14Z

Thank you. Setting flatten_parameters=False solves that issue, however this leads to increase in time.

Good. That's great for confirming. Now, for debugging the conv error, did you try the cudnn.enabled option above?

min-xu-ai · 2021-07-13T20:03:01Z

I assume this issue is largely solved. Reopen if not.

min-xu-ai added the FSDP FullyShardedDataParallel (zero-3) label May 28, 2021

min-xu-ai changed the title ~~Fully Sharded Data Parallel~~ FSDP: issues with inferencing Jun 2, 2021

min-xu-ai closed this as completed Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP: issues with inferencing #703

FSDP: issues with inferencing #703

HITESHLPATEL commented May 28, 2021 •

edited

min-xu-ai commented May 28, 2021

HITESHLPATEL commented May 28, 2021 •

edited

min-xu-ai commented May 28, 2021

blefaudeux commented May 28, 2021

HITESHLPATEL commented May 29, 2021

blefaudeux commented May 29, 2021

SETUP

HITESHLPATEL commented May 29, 2021

blefaudeux commented May 29, 2021

HITESHLPATEL commented May 29, 2021

blefaudeux commented May 29, 2021

HITESHLPATEL commented May 30, 2021

min-xu-ai commented May 30, 2021

HITESHLPATEL commented May 31, 2021

min-xu-ai commented May 31, 2021

HITESHLPATEL commented May 31, 2021 •

edited by min-xu-ai

HITESHLPATEL commented Jun 4, 2021 •

edited

blefaudeux commented Jun 4, 2021 •

edited

HITESHLPATEL commented Jun 4, 2021

min-xu-ai commented Jun 4, 2021

HITESHLPATEL commented Jun 22, 2021 •

edited

min-xu-ai commented Jun 22, 2021

HITESHLPATEL commented Jun 23, 2021

min-xu-ai commented Jun 23, 2021

min-xu-ai commented Jul 13, 2021

FSDP: issues with inferencing #703

FSDP: issues with inferencing #703

Comments

HITESHLPATEL commented May 28, 2021 • edited

❓ Questions and Help

min-xu-ai commented May 28, 2021

HITESHLPATEL commented May 28, 2021 • edited

min-xu-ai commented May 28, 2021

blefaudeux commented May 28, 2021

HITESHLPATEL commented May 29, 2021

blefaudeux commented May 29, 2021

SETUP

HITESHLPATEL commented May 29, 2021

blefaudeux commented May 29, 2021

HITESHLPATEL commented May 29, 2021

blefaudeux commented May 29, 2021

HITESHLPATEL commented May 30, 2021

min-xu-ai commented May 30, 2021

HITESHLPATEL commented May 31, 2021

min-xu-ai commented May 31, 2021

HITESHLPATEL commented May 31, 2021 • edited by min-xu-ai

HITESHLPATEL commented Jun 4, 2021 • edited

blefaudeux commented Jun 4, 2021 • edited

HITESHLPATEL commented Jun 4, 2021

min-xu-ai commented Jun 4, 2021

HITESHLPATEL commented Jun 22, 2021 • edited

min-xu-ai commented Jun 22, 2021

HITESHLPATEL commented Jun 23, 2021

min-xu-ai commented Jun 23, 2021

min-xu-ai commented Jul 13, 2021

HITESHLPATEL commented May 28, 2021 •

edited

HITESHLPATEL commented May 28, 2021 •

edited

HITESHLPATEL commented May 31, 2021 •

edited by min-xu-ai

HITESHLPATEL commented Jun 4, 2021 •

edited

blefaudeux commented Jun 4, 2021 •

edited

HITESHLPATEL commented Jun 22, 2021 •

edited