Skip to content

[Bugfix] Reproduce experimental results in docker image#9

Merged
chhzh123 merged 11 commits intoawslabs:mainfrom
chhzh123:fix_docker_exp
Jan 22, 2023
Merged

[Bugfix] Reproduce experimental results in docker image#9
chhzh123 merged 11 commits intoawslabs:mainfrom
chhzh123:fix_docker_exp

Conversation

@chhzh123
Copy link
Contributor

Description

This PR tries to reproduce the experimental results presented in the paper using the provided docker image. Most of the errors are dependency issues.

Checklist

Following items list the failed test cases and the root causes:

  • slapo-megatron, albert, gpu=1,2,4,8 (networkx is required by functorch, epoi must use the lastest commit to support the injection policy)
  • slapo-deepspeed, albert, gpu=2,4,8 (same issue)
  • slapo-megatron, gpt, gpu=2,4,8 (same issue)
  • deepspeed, gpt, gpu=2,4,8 (missing datasets library)
  • slapo-deepspeed, gpt, gpu=2,4,8 (same issue)
  • slapo-megatron, wideresnet, gpu=2,4,8 (TorchScript is used in multi-device)
  • slapo-deepspeed, wideresnet, gpu=2,4,8 (same issue)

@chhzh123
Copy link
Contributor Author

Has run all the end-to-end single-device and single-node experiments in the docker and verified the results can be reproduced.

@chhzh123 chhzh123 requested a review from comaniac January 20, 2023 09:33
Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the docker image accordingly.

@comaniac
Copy link
Contributor

Docker image updated.

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

chhzh123 and others added 2 commits January 22, 2023 15:17
Co-authored-by: Cody Yu <comaniac0422@gmail.com>
@chhzh123 chhzh123 merged commit a506761 into awslabs:main Jan 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants