New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[paved path] basic charnn with GPT example #27
Conversation
@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hudeven has updated the pull request. You must reimport the pull request before landing. |
@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hudeven has updated the pull request. You must reimport the pull request before landing. |
@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hudeven has updated the pull request. You must reimport the pull request before landing. |
@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hudeven has updated the pull request. You must reimport the pull request before landing. |
@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this ready for review? sorry i started adding comments before asking
@hudeven has updated the pull request. You must reimport the pull request before landing. |
@hudeven has updated the pull request. You must reimport the pull request before landing. |
@hudeven has updated the pull request. You must reimport the pull request before landing. |
@edward-io yes, thanks for taking a look! As it's the initial check in from previous prototyping repos, there are a lot of things to improve! Any feedback is welcome! |
@hudeven has updated the pull request. You must reimport the pull request before landing. |
@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
def get_device() -> torch.device: | ||
if torch.cuda.is_available(): | ||
local_rank = int(os.environ.get("LOCAL_RANK", "0")) | ||
device = torch.device(f"cuda:{local_rank}") | ||
torch.cuda.set_device(device) | ||
else: | ||
device = torch.device("cpu") | ||
return device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use the util here?
def get_ddp_model_and_optimizer( | ||
gpt_config: GPTConfig, opt_config: OptimizerConfig, checkpoint: Optional[Checkpoint] | ||
) -> Tuple[torch.nn.Module, torch.optim.Optimizer]: | ||
# Create new GPT Model on CPU | ||
model = GPT(gpt_config) | ||
# Load GPT model from checkpoint if present | ||
if checkpoint: | ||
model.load_state_dict(checkpoint.model_state) | ||
device = get_device() | ||
device_ids = None | ||
if device.type == "cuda": | ||
model = model.to(device) | ||
device_ids = [device.index] | ||
model = DistributedDataParallel( | ||
model, | ||
device_ids=device_ids, | ||
) | ||
optimizer = torch.optim.AdamW( | ||
model.parameters(), lr=opt_config.lr, weight_decay=opt_config.weight_decay | ||
) | ||
return model, optimizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we try out torchsnapshot here? @yifuwang ?
def setup_process_group() -> None: | ||
device = get_device() | ||
rank = int(os.environ["RANK"]) | ||
world_size = int(os.environ["WORLD_SIZE"]) | ||
if device.type == "cuda": | ||
dist.init_process_group("nccl", rank=rank, world_size=world_size) | ||
else: | ||
dist.init_process_group("gloo", rank=rank, world_size=world_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming we're launching with torchrun/torchx, this would initialize both device & process group:
|
||
|
||
def generate_seq(cfg: DictConfig, model: torch.nn.Module, dataset: CharDataset) -> None: | ||
if dist.get_rank() == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is distributed always initialized?
we cover this here: https://github.com/pytorch/tnt/blob/5fd77ae9ad8357079cebd14532cb40a2ce8e5761/torchtnt/utils/distributed.py#L98-L110
def set_seed(seed: int) -> None: | ||
random.seed(seed) | ||
torch.manual_seed(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ananthsub thanks for the comments! All of them make sense. I plan to adopt those libs soon. Currently, the main goal of this pull request is to add the basic example and then have pipeline setup config/instructions on top of it. |
Testing:
single GPU
python main.py
multi GPUs
torchrun --nnodes 1 --nproc_per_node 4 \ --rdzv_backend c10d \ --rdzv_endpoint localhost:29500 \ main.py
torch single GPU
torchx run -s local_cwd dist.ddp -j 1x1 --script main.py
torchx multi GPUs
torchx run -s local_cwd dist.ddp -j 1x4 --script main.py