[paved path] basic charnn with GPT example #27

hudeven · 2022-08-08T17:35:16Z

We decide to put paved path examples in torchrecipes repo
This is an initial commit based on previous work https://github.com/aivanou/disttraining and https://github.com/dracifer/disttraining
It's a basic example without any external dependencies(mlflow logging, airflow, torchx, etc), which will be added soon

Testing:

single GPU
python main.py
multi GPUs
torchrun --nnodes 1 --nproc_per_node 4 \ --rdzv_backend c10d \ --rdzv_endpoint localhost:29500 \ main.py
torch single GPU
torchx run -s local_cwd dist.ddp -j 1x1 --script main.py
torchx multi GPUs
torchx run -s local_cwd dist.ddp -j 1x4 --script main.py

facebook-github-bot · 2022-08-08T17:35:58Z

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-08T19:15:09Z

@hudeven has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-08T19:17:34Z

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-08T20:50:37Z

@hudeven has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-08T20:51:17Z

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-08T21:14:04Z

@hudeven has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-08T21:15:22Z

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-08T21:17:18Z

@hudeven has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-08T21:18:16Z

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

edward-io

is this ready for review? sorry i started adding comments before asking

torchrecipes/paved_path/charnn/char_dataset.py

torchrecipes/paved_path/charnn/main.py

torchrecipes/paved_path/charnn/model.py

facebook-github-bot · 2022-08-09T17:37:26Z

@hudeven has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-09T17:40:35Z

@hudeven has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-09T17:42:31Z

@hudeven has updated the pull request. You must reimport the pull request before landing.

hudeven · 2022-08-09T17:52:24Z

@edward-io yes, thanks for taking a look! As it's the initial check in from previous prototyping repos, there are a lot of things to improve! Any feedback is welcome!

facebook-github-bot · 2022-08-11T01:23:47Z

@hudeven has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-11T01:24:25Z

@hudeven has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ananthsub · 2022-08-11T01:39:42Z

torchrecipes/paved_path/charnn/main.py

+def get_device() -> torch.device:
+    if  torch.cuda.is_available():
+        local_rank = int(os.environ.get("LOCAL_RANK", "0"))
+        device = torch.device(f"cuda:{local_rank}")
+        torch.cuda.set_device(device)
+    else:
+        device = torch.device("cpu")
+    return device


should we use the util here?

https://github.com/pytorch/tnt/blob/5fd77ae9ad8357079cebd14532cb40a2ce8e5761/torchtnt/utils/device.py#L20-L48

ananthsub · 2022-08-11T01:40:09Z

torchrecipes/paved_path/charnn/main.py

+def get_ddp_model_and_optimizer(
+    gpt_config: GPTConfig, opt_config: OptimizerConfig, checkpoint: Optional[Checkpoint]
+) -> Tuple[torch.nn.Module, torch.optim.Optimizer]:
+    # Create new GPT Model on CPU
+    model = GPT(gpt_config)
+    # Load GPT model from checkpoint if present
+    if checkpoint:
+        model.load_state_dict(checkpoint.model_state)
+    device = get_device()
+    device_ids = None
+    if device.type == "cuda":
+        model = model.to(device)
+        device_ids = [device.index]
+    model = DistributedDataParallel(
+        model,
+        device_ids=device_ids,
+    )
+    optimizer = torch.optim.AdamW(
+        model.parameters(), lr=opt_config.lr, weight_decay=opt_config.weight_decay
+    )
+    return model, optimizer


should we try out torchsnapshot here? @yifuwang ?

ananthsub · 2022-08-11T01:40:57Z

torchrecipes/paved_path/charnn/main.py

+def setup_process_group() -> None:
+    device = get_device()
+    rank = int(os.environ["RANK"])
+    world_size = int(os.environ["WORLD_SIZE"])
+    if device.type == "cuda":
+        dist.init_process_group("nccl", rank=rank, world_size=world_size)
+    else:
+        dist.init_process_group("gloo", rank=rank, world_size=world_size)


assuming we're launching with torchrun/torchx, this would initialize both device & process group:

https://github.com/pytorch/tnt/blob/5fd77ae9ad8357079cebd14532cb40a2ce8e5761/torchtnt/utils/env.py#L34-L38

ananthsub · 2022-08-11T01:41:36Z

torchrecipes/paved_path/charnn/main.py

+
+
+def generate_seq(cfg: DictConfig, model: torch.nn.Module, dataset: CharDataset) -> None:
+    if dist.get_rank() == 0:


is distributed always initialized?

we cover this here: https://github.com/pytorch/tnt/blob/5fd77ae9ad8357079cebd14532cb40a2ce8e5761/torchtnt/utils/distributed.py#L98-L110

ananthsub · 2022-08-11T01:42:04Z

torchrecipes/paved_path/charnn/main.py

+def set_seed(seed: int) -> None:
+    random.seed(seed)
+    torch.manual_seed(seed)


https://github.com/pytorch/tnt/blob/5fd77ae9ad8357079cebd14532cb40a2ce8e5761/torchtnt/utils/seed.py#L19-L38

hudeven · 2022-08-11T17:37:17Z

@ananthsub thanks for the comments! All of them make sense. I plan to adopt those libs soon. Currently, the main goal of this pull request is to add the basic example and then have pipeline setup config/instructions on top of it.

[paved path] basic charnn with GPT example

a46a8e1

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 8, 2022

[paved path] add test for CLI

f5aea2a

format using ufmt

e43cdec

ignore hydra output directory

f5e244a

fix flake8 lints

bf27fc3

edward-io reviewed Aug 9, 2022

View reviewed changes

hudeven added 5 commits August 9, 2022 16:07

fix lint

111d934

add more type hints

fbecc2f

use torch.device instead of int or str

f43a3de

make GPT model device agnostic

91afbe1

fix typo and use unsqueeze

8347885

add copyright info

35dc19e

add more type hints

1f1deab

hudeven added 2 commits August 11, 2022 01:20

add seeding

40bf0a6

pass device directly to torch.tensor

7a5a272

ananthsub reviewed Aug 11, 2022

View reviewed changes

facebook-github-bot closed this in 8dd4b15 Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[paved path] basic charnn with GPT example #27

[paved path] basic charnn with GPT example #27

hudeven commented Aug 8, 2022 •

edited

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

edward-io left a comment

facebook-github-bot commented Aug 9, 2022

facebook-github-bot commented Aug 9, 2022

facebook-github-bot commented Aug 9, 2022

hudeven commented Aug 9, 2022

facebook-github-bot commented Aug 11, 2022

facebook-github-bot commented Aug 11, 2022

ananthsub Aug 11, 2022

ananthsub Aug 11, 2022

ananthsub Aug 11, 2022

ananthsub Aug 11, 2022

ananthsub Aug 11, 2022

hudeven commented Aug 11, 2022



		def generate_seq(cfg: DictConfig, model: torch.nn.Module, dataset: CharDataset) -> None:
		if dist.get_rank() == 0:

[paved path] basic charnn with GPT example #27

[paved path] basic charnn with GPT example #27

Conversation

hudeven commented Aug 8, 2022 • edited

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

facebook-github-bot commented Aug 8, 2022

edward-io left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Aug 9, 2022

facebook-github-bot commented Aug 9, 2022

facebook-github-bot commented Aug 9, 2022

hudeven commented Aug 9, 2022

facebook-github-bot commented Aug 11, 2022

facebook-github-bot commented Aug 11, 2022

ananthsub Aug 11, 2022

Choose a reason for hiding this comment

ananthsub Aug 11, 2022

Choose a reason for hiding this comment

ananthsub Aug 11, 2022

Choose a reason for hiding this comment

ananthsub Aug 11, 2022

Choose a reason for hiding this comment

ananthsub Aug 11, 2022

Choose a reason for hiding this comment

hudeven commented Aug 11, 2022

hudeven commented Aug 8, 2022 •

edited