Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[paved path] basic charnn with GPT example #27

Closed
wants to merge 14 commits into from

Conversation

hudeven
Copy link
Contributor

@hudeven hudeven commented Aug 8, 2022

  1. We decide to put paved path examples in torchrecipes repo
  2. This is an initial commit based on previous work https://github.com/aivanou/disttraining and https://github.com/dracifer/disttraining
  3. It's a basic example without any external dependencies(mlflow logging, airflow, torchx, etc), which will be added soon

Testing:

  • single GPU
    python main.py

  • multi GPUs
    torchrun --nnodes 1 --nproc_per_node 4 \ --rdzv_backend c10d \ --rdzv_endpoint localhost:29500 \ main.py

  • torch single GPU
    torchx run -s local_cwd dist.ddp -j 1x1 --script main.py

  • torchx multi GPUs
    torchx run -s local_cwd dist.ddp -j 1x4 --script main.py

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 8, 2022
@facebook-github-bot
Copy link
Contributor

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hudeven has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hudeven has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hudeven has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hudeven has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@edward-io edward-io left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this ready for review? sorry i started adding comments before asking

torchrecipes/paved_path/charnn/char_dataset.py Outdated Show resolved Hide resolved
torchrecipes/paved_path/charnn/char_dataset.py Outdated Show resolved Hide resolved
torchrecipes/paved_path/charnn/main.py Outdated Show resolved Hide resolved
torchrecipes/paved_path/charnn/main.py Outdated Show resolved Hide resolved
torchrecipes/paved_path/charnn/main.py Outdated Show resolved Hide resolved
torchrecipes/paved_path/charnn/main.py Outdated Show resolved Hide resolved
torchrecipes/paved_path/charnn/main.py Outdated Show resolved Hide resolved
torchrecipes/paved_path/charnn/model.py Outdated Show resolved Hide resolved
torchrecipes/paved_path/charnn/model.py Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@hudeven has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hudeven has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hudeven has updated the pull request. You must reimport the pull request before landing.

@hudeven
Copy link
Contributor Author

hudeven commented Aug 9, 2022

@edward-io yes, thanks for taking a look! As it's the initial check in from previous prototyping repos, there are a lot of things to improve! Any feedback is welcome!

@facebook-github-bot
Copy link
Contributor

@hudeven has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines +49 to +56
def get_device() -> torch.device:
if torch.cuda.is_available():
local_rank = int(os.environ.get("LOCAL_RANK", "0"))
device = torch.device(f"cuda:{local_rank}")
torch.cuda.set_device(device)
else:
device = torch.device("cpu")
return device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +60 to +80
def get_ddp_model_and_optimizer(
gpt_config: GPTConfig, opt_config: OptimizerConfig, checkpoint: Optional[Checkpoint]
) -> Tuple[torch.nn.Module, torch.optim.Optimizer]:
# Create new GPT Model on CPU
model = GPT(gpt_config)
# Load GPT model from checkpoint if present
if checkpoint:
model.load_state_dict(checkpoint.model_state)
device = get_device()
device_ids = None
if device.type == "cuda":
model = model.to(device)
device_ids = [device.index]
model = DistributedDataParallel(
model,
device_ids=device_ids,
)
optimizer = torch.optim.AdamW(
model.parameters(), lr=opt_config.lr, weight_decay=opt_config.weight_decay
)
return model, optimizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we try out torchsnapshot here? @yifuwang ?

Comment on lines +95 to +102
def setup_process_group() -> None:
device = get_device()
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
if device.type == "cuda":
dist.init_process_group("nccl", rank=rank, world_size=world_size)
else:
dist.init_process_group("gloo", rank=rank, world_size=world_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming we're launching with torchrun/torchx, this would initialize both device & process group:

https://github.com/pytorch/tnt/blob/5fd77ae9ad8357079cebd14532cb40a2ce8e5761/torchtnt/utils/env.py#L34-L38



def generate_seq(cfg: DictConfig, model: torch.nn.Module, dataset: CharDataset) -> None:
if dist.get_rank() == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +115 to +117
def set_seed(seed: int) -> None:
random.seed(seed)
torch.manual_seed(seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hudeven
Copy link
Contributor Author

hudeven commented Aug 11, 2022

@ananthsub thanks for the comments! All of them make sense. I plan to adopt those libs soon. Currently, the main goal of this pull request is to add the basic example and then have pipeline setup config/instructions on top of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants