Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP: issues with inferencing #703

Closed
HITESHLPATEL opened this issue May 28, 2021 · 24 comments
Closed

FSDP: issues with inferencing #703

HITESHLPATEL opened this issue May 28, 2021 · 24 comments
Labels
FSDP FullyShardedDataParallel (zero-3)

Comments

@HITESHLPATEL
Copy link

HITESHLPATEL commented May 28, 2021

❓ Questions and Help

I am trying to integrate FSDP into my code. I have questions related to optimizer sharding. Does FSDP automatically shards the gradient, optimizer, and parameter, or do I need to call OSS to shard optimizer? Also can anyone suggest right way to do mixed precision with autocast and FSDP. Also does validation/test happens on GPU rank 0 or on all the nodes?

@min-xu-ai min-xu-ai added the FSDP FullyShardedDataParallel (zero-3) label May 28, 2021
@min-xu-ai
Copy link
Contributor

  • no, you don't need to call oss to shard parameter
  • fsdp automatically shard parameters, gradients and optimizer state
  • if you need the forward pass compute to overlap with network communication, you need to have nested fsdp, i.e. fsdp(module(fsdp(layer1), ...fsdp(layer_n)))
  • check the tests dir for a large number tests that uses FSDP, including mixed precision usage
  • the common way to use mixed precision is to use autocast context and fsdp with mixed_precision=True
  • validation still happens on rank 0 only unless you do distributed inference. However, you need all ranks to participate in the validation because rank 0 needs to fetch parameters from them. There isn't much examples on validations yet in the test dir.

@HITESHLPATEL
Copy link
Author

HITESHLPATEL commented May 28, 2021

Thank you for clarification. However when i use autocast and shared grad scaler i get this error
"assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer". without mixed precision and autocast my loss remains the same at all the epochs which is very weird.

Can you share an example how can I make all rank participate for validation?

@min-xu-ai
Copy link
Contributor

cc @blefaudeux. Check this file tests/nn/data_parallel/test_fsdp_regnet.py. Did you use it this way:

image

For validation, I don't know an example in the test dir. But I could have missed it. Another good source to check is fairseq code to see their validation code with fsdp. But I don't have a direct pointer unfortunately.

@blefaudeux
Copy link
Contributor

Thank you for clarification. However when i use autocast and shared grad scaler i get this error
"assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer". without mixed precision and autocast my loss remains the same at all the epochs which is very weird.

Can you share an example how can I make all rank participate for validation?

It's a bit hard to understand the context here, feels like there are a couple of hidden variables (could you share a code snippet if still stuck ?), but my guess is that you're missing a call, the assert is here precisely for that. On the grad scaler side we're following the PyTorch way of doing this (see the doc), the differences are tied to the sharding but are not visible on the user side, so this doc should apply entirely.

If I misunderstood and you would like to sync some metric across ranks, you can use an all_reduce to consolidate it across ranks, for instance like it's done here (but be mindful in that this is a collective operation, any rank skipping that will deadlock the whole fleet)

@HITESHLPATEL
Copy link
Author

def train(rank, args, model,params, device, train_loader, num_epochs):
##############
# SETUP
dist_init(rank, WORLD_SIZE, BACKEND)
ddp = FullyShardedDataParallel(module=model,mixed_precision=True)
optimizer=torch.optim.SGD(params,lr=0.01)

ddp.train()

# Training loop
training_start = time.monotonic()

loss_fn = nn.CrossEntropyLoss()

model.train()
measurements = []
scaler = ShardedGradScaler()
for epoch in range(num_epochs):
    epoch_start = time.monotonic()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        model.zero_grad()
        with torch.cuda.amp.autocast(enabled=True):
            outputs = model(data)
            loss = loss_fn(outputs, target)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

@blefaudeux
Copy link
Contributor

def train(rank, args, model,params, device, train_loader, num_epochs):
##############

SETUP

dist_init(rank, WORLD_SIZE, BACKEND)
ddp = FullyShardedDataParallel(module=model,mixed_precision=True)
optimizer=torch.optim.SGD(params,lr=0.01)

ddp.train()

# Training loop
training_start = time.monotonic()

loss_fn = nn.CrossEntropyLoss()

model.train()
measurements = []
scaler = ShardedGradScaler()
for epoch in range(num_epochs):
    epoch_start = time.monotonic()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        model.zero_grad()
        with torch.cuda.amp.autocast(enabled=True):
            outputs = model(data)
            loss = loss_fn(outputs, target)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

thanks for the snippet ! not visible here, could it be that your optimizer is not tied to the proper model parameters ? It's been a while I spent some time in that, but I think that it could explain the assert and your loss not changing (the optimizer sees no grads -> no inf check and the loss is constant)

@HITESHLPATEL
Copy link
Author

Could you Please suggest what could be done in such scenarios? Or is it a bug ?

@blefaudeux
Copy link
Contributor

Could you Please suggest what could be done in such scenarios? Or is it a bug ?

it's not possible to see from your snippet how the optimizer is defined, so I can just conjecture that maybe it does not have the right parameters ? I would expect something like optimizer = torch.optim.MyFancyOptimizer(model.parameters()) after your model has been properly defined. It may be there, no idea, but if it was not properly declared it could explain your symptoms

@HITESHLPATEL
Copy link
Author

@blefaudeux Thank you. Could you point where I can see any sample code for Validation? In standard pytorch DDP validation happens on the rank 0. Since the model and states are shared does it still valid true?

@blefaudeux
Copy link
Contributor

@blefaudeux Thank you. Could you point where I can see any sample code for Validation? In standard pytorch DDP validation happens on the rank 0. Since the model and states are shared does it still valid true?

The ranks are still in sync after an optimizer step(), so you should be able to run the validation anywhere, if I'm not mistaken (true for OSS and ShardedDDP, also for FSDP I think but pulling @min-xu-ai in case)

@HITESHLPATEL
Copy link
Author

@blefaudeux I was able to get rid of infs check error and loss change by making flatten_parameter=False when doing mix training. However my training time is taking double the time than standard Fully shared data parallel.
self.model = FullyShardedDataParallel(self.model,mixed_precision=True,flatten_parameters=False) # Works Fine
self.model = FullyShardedDataParallel(self.model,mixed_precision=True,flatten_parameters=True) # Gettinng infs check error or illegal memory crash error. I assume there is a bug.

@min-xu-ai
Copy link
Contributor

Thanks for narrowing it down a bit! There is likely a bug. Can you try to distill the test code into a small piece that demonstrate this issue?

@HITESHLPATEL
Copy link
Author

@min-xu-ai Thank you. I noticed that my Error comes from the Validation phase. Could you provide me with a sample code on how to do a validation since I am doing validation on only rank 0 and it gives me a bad param error.

@min-xu-ai
Copy link
Contributor

Sorry to hear. I haven’t done that myself. Maybe others can chime in. Another way is to provide the failing code for me to debug the error. No promises on time to fix though.

@HITESHLPATEL
Copy link
Author

HITESHLPATEL commented May 31, 2021

Here is the sample code from my training pipeline

Training Loop

dist_init(rank, WORLD_SIZE, BACKEND)
ddp = FullyShardedDataParallel(module=model,mixed_precision=True)
optimizer=torch.optim.SGD(params,lr=0.01)

ddp.train()

training_start = time.monotonic()

loss_fn = nn.CrossEntropyLoss()

model.train()
measurements = []
scaler = ShardedGradScaler()
for epoch in range(num_epochs):
    epoch_start = time.monotonic()
    for batch_idx, (data, target) in enumerate(train_loader[phase]):
        data, target = data.to(device), target.to(device)

        model.zero_grad()
        with torch.set_grad_enabled(phase == 'train') and torch.cuda.amp.autocast(enabled=True):
            outputs = model(data)
            loss = loss_fn(outputs, target)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

I am doing validation on rank 0. As soon as the train epochs gets completed I am hitting with this error.
x = F.conv2d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

By just changing flatten_parameters=False everything works fine but the fact the time is almost double.

@min-xu-ai min-xu-ai changed the title Fully Sharded Data Parallel FSDP: issues with inferencing Jun 2, 2021
@HITESHLPATEL
Copy link
Author

HITESHLPATEL commented Jun 4, 2021

@blefaudeux @min-xu-ai I was trying to save my trained model on rank 0 when inferencing on the validation set by calling the model. state_dict() but it happens to get stuck there. Is there anything else that needs to be done to save the model since I am calling this on rank 0?

@blefaudeux
Copy link
Contributor

blefaudeux commented Jun 4, 2021

@blefaudeux @min-xu-ai I was trying to save my trained model on rank 0 when inferencing on the validation set by calling the model. state_dict() but it happens to get stuck there. Is there anything else that needs to be done to save the model since I am calling this on rank 0?

@min-xu-ai will chime in, but you probably need to consolidate the model on this rank prior to pulling out the state, it should be mentioned in the doc

Edit: this or call "state_dict" on all ranks, not sure now what FSDP asks for. The "stuck" syndrome is normal, there must be a collective operation which is just called on a single rank, hence deadlock

@HITESHLPATEL
Copy link
Author

@min-xu-ai, can you point me to the document where I can find this on how FSDP saves the model. I am trying to search for this, but dint find any information on this.

@min-xu-ai
Copy link
Contributor

Take a look at this test:

def _load_local_and_train(self, config, rank, group, d_model=16, d_vocab=23):

The doc is here:
image

But I must admit that I have not personally debugged issues around state dict lately. I will need to refresh my own knowledge about it.

@HITESHLPATEL
Copy link
Author

HITESHLPATEL commented Jun 22, 2021

@min-xu-ai @blefaudeux Thank you. I was able to save the model. One more question is whether there is any test case with a different resolution during train and validation? For e.g If the train resolution is 300 and the Val resolution is 384, It gives a CUDNN_STATUS_BAD_PARAM error. Having the same size for train and Val does not give an error. I think this has to do something with flatten_parameters.

@min-xu-ai
Copy link
Contributor

Good to hear!

If you suspect flatten_parameters, perhaps you can set it to false and test?

To debug STATUS_BAD_PARAM, perhaps you can disable cudnn and try it?

torch.backends.cudnn.enabled = False would disable cuDNN.

CUDNN error msgs are hard to debug and lack useful info. Sometimes, it is related to out of memory, sometimes it is related to dtype and mixed precision.

@HITESHLPATEL
Copy link
Author

Thank you. Setting flatten_parameters=False solves that issue, however this leads to increase in time.

@min-xu-ai
Copy link
Contributor

Thank you. Setting flatten_parameters=False solves that issue, however this leads to increase in time.

Good. That's great for confirming. Now, for debugging the conv error, did you try the cudnn.enabled option above?

@min-xu-ai
Copy link
Contributor

I assume this issue is largely solved. Reopen if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FSDP FullyShardedDataParallel (zero-3)
Projects
None yet
Development

No branches or pull requests

3 participants