New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP: issues with inferencing #703
Comments
|
Thank you for clarification. However when i use autocast and shared grad scaler i get this error Can you share an example how can I make all rank participate for validation? |
cc @blefaudeux. Check this file For validation, I don't know an example in the test dir. But I could have missed it. Another good source to check is fairseq code to see their validation code with fsdp. But I don't have a direct pointer unfortunately. |
It's a bit hard to understand the context here, feels like there are a couple of hidden variables (could you share a code snippet if still stuck ?), but my guess is that you're missing a call, the assert is here precisely for that. On the grad scaler side we're following the PyTorch way of doing this (see the doc), the differences are tied to the sharding but are not visible on the user side, so this doc should apply entirely. If I misunderstood and you would like to sync some metric across ranks, you can use an all_reduce to consolidate it across ranks, for instance like it's done here (but be mindful in that this is a collective operation, any rank skipping that will deadlock the whole fleet) |
def train(rank, args, model,params, device, train_loader, num_epochs):
|
thanks for the snippet ! not visible here, could it be that your optimizer is not tied to the proper model parameters ? It's been a while I spent some time in that, but I think that it could explain the assert and your loss not changing (the optimizer sees no grads -> no inf check and the loss is constant) |
Could you Please suggest what could be done in such scenarios? Or is it a bug ? |
it's not possible to see from your snippet how the optimizer is defined, so I can just conjecture that maybe it does not have the right parameters ? I would expect something like |
@blefaudeux Thank you. Could you point where I can see any sample code for Validation? In standard pytorch DDP validation happens on the rank 0. Since the model and states are shared does it still valid true? |
The ranks are still in sync after an optimizer step(), so you should be able to run the validation anywhere, if I'm not mistaken (true for OSS and ShardedDDP, also for FSDP I think but pulling @min-xu-ai in case) |
@blefaudeux I was able to get rid of infs check error and loss change by making flatten_parameter=False when doing mix training. However my training time is taking double the time than standard Fully shared data parallel. |
Thanks for narrowing it down a bit! There is likely a bug. Can you try to distill the test code into a small piece that demonstrate this issue? |
@min-xu-ai Thank you. I noticed that my Error comes from the Validation phase. Could you provide me with a sample code on how to do a validation since I am doing validation on only rank 0 and it gives me a bad param error. |
Sorry to hear. I haven’t done that myself. Maybe others can chime in. Another way is to provide the failing code for me to debug the error. No promises on time to fix though. |
Here is the sample code from my training pipeline Training Loop
I am doing validation on rank 0. As soon as the train epochs gets completed I am hitting with this error. By just changing flatten_parameters=False everything works fine but the fact the time is almost double. |
@blefaudeux @min-xu-ai I was trying to save my trained model on rank 0 when inferencing on the validation set by calling the model. state_dict() but it happens to get stuck there. Is there anything else that needs to be done to save the model since I am calling this on rank 0? |
@min-xu-ai will chime in, but you probably need to consolidate the model on this rank prior to pulling out the state, it should be mentioned in the doc Edit: this or call "state_dict" on all ranks, not sure now what FSDP asks for. The "stuck" syndrome is normal, there must be a collective operation which is just called on a single rank, hence deadlock |
@min-xu-ai, can you point me to the document where I can find this on how FSDP saves the model. I am trying to search for this, but dint find any information on this. |
Take a look at this test:
But I must admit that I have not personally debugged issues around state dict lately. I will need to refresh my own knowledge about it. |
@min-xu-ai @blefaudeux Thank you. I was able to save the model. One more question is whether there is any test case with a different resolution during train and validation? For e.g If the train resolution is 300 and the Val resolution is 384, It gives a CUDNN_STATUS_BAD_PARAM error. Having the same size for train and Val does not give an error. I think this has to do something with flatten_parameters. |
Good to hear! If you suspect flatten_parameters, perhaps you can set it to false and test? To debug STATUS_BAD_PARAM, perhaps you can disable cudnn and try it? torch.backends.cudnn.enabled = False would disable cuDNN. CUDNN error msgs are hard to debug and lack useful info. Sometimes, it is related to out of memory, sometimes it is related to dtype and mixed precision. |
Thank you. Setting flatten_parameters=False solves that issue, however this leads to increase in time. |
Good. That's great for confirming. Now, for debugging the conv error, did you try the cudnn.enabled option above? |
I assume this issue is largely solved. Reopen if not. |
❓ Questions and Help
I am trying to integrate FSDP into my code. I have questions related to optimizer sharding. Does FSDP automatically shards the gradient, optimizer, and parameter, or do I need to call OSS to shard optimizer? Also can anyone suggest right way to do mixed precision with autocast and FSDP. Also does validation/test happens on GPU rank 0 or on all the nodes?
The text was updated successfully, but these errors were encountered: