Skip to content

[chkpt conversion] handle the case where tp=0 , should be 1#146

Open
stas00 wants to merge 1 commit into
mainfrom
chpt-conversion-fix
Open

[chkpt conversion] handle the case where tp=0 , should be 1#146
stas00 wants to merge 1 commit into
mainfrom
chpt-conversion-fix

Conversation

@stas00
Copy link
Copy Markdown
Contributor

@stas00 stas00 commented Oct 20, 2021

This PR is trying to fix:

Traceback (most recent call last): 
 File "/gpfswork/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py", line 83, in <module> 
   main() 
 File "/gpfswork/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py", line 22, in main 
   ds_checkpoint = DeepSpeedCheckpoint(args.input_folder, args.target_tp, args.target_pp) 
 File "/gpfsssd/worksf/projects/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_checkpoint.py", line 36, in __init__ 
   self.dp_degree = len(self.zero_files) // (self.original_pp_degree * self.original_tp_degree) 
ZeroDivisionError: integer division or modulo by zero

it seems we have original_pp_degree = 0 rather than 1

@stas00 stas00 changed the title handle the case where tp=1 [chkpt conversion] handle the case where tp=0 , should be 1 Oct 20, 2021
@stas00
Copy link
Copy Markdown
Contributor Author

stas00 commented Oct 20, 2021

@thomasw21, please feel free to close this one or build on top of it, either way works.

I think we should test too that original_pp_degree != 0

@thomasw21
Copy link
Copy Markdown
Member

@stas00 perfect, I'll probably convert all of them to asserts. I've yet to rule out that the checkpoint file is corrupted ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants