Skip to content

Commit

Permalink
Merge pull request #1 from arcee-ai/mock-gpt-mp-training
Browse files Browse the repository at this point in the history
modifed the model parreleized gpt pre-trainign script
  • Loading branch information
shamanez committed Apr 22, 2024
2 parents ccfeda4 + f934094 commit 05a3bdf
Show file tree
Hide file tree
Showing 5 changed files with 96 additions and 1 deletion.
70 changes: 70 additions & 0 deletions examples/docker_setup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
## Quick Start Guide to Running Your PyTorch Docker Container

### Step 1: Create the Dockerfile

1. **Open Terminal**: Open a terminal on your Ubuntu machine.
2. **Create Dockerfile**: Enter `nano Dockerfile` to create and edit a new Dockerfile.
3. **Enter Dockerfile Content**:
```dockerfile
# Use an official PyTorch image as a base
FROM nvcr.io/nvidia/pytorch:latest

# Set the working directory inside the container
WORKDIR /workspace

# Install any necessary dependencies
RUN pip install -r requirements.txt

# Copy the local code to the container's workspace
COPY ./ /workspace/

# Set the default command to execute
CMD ["/bin/bash"]
```
Replace `latest` with the specific version of PyTorch you need. Modify `requirements.txt` to include all necessary Python packages.

### Step 2: Build Your Docker Image

1. **Build Image**: In your terminal, run:
```bash
docker build -t my-pytorch-app .
```
This command builds the Docker image named `my-pytorch-app` using the Dockerfile in the current directory.

### Step 2: Create the Docker Run Script

1. **Open Terminal**: Open a terminal on your Ubuntu machine.
2. **Create Script File**: Enter `nano run_pytorch_docker.sh` to create and edit a new shell script.
3. **Enter Script Content**:
```bash
#!/bin/bash
# This script runs a Docker container with necessary volume mounts for the PyTorch application.

docker run --gpus all -it --rm \
-v /path/to/megatron:/workspace/megatron \
-v /path/to/dataset:/workspace/dataset \
-v /path/to/checkpoints:/workspace/checkpoints \
my-pytorch-app \
/bin/bash
```
Replace `/path/to/megatron`, `/path/to/dataset`, and `/path/to/checkpoints` with the actual paths to your resources. This will take you the interactive window.
4. **Save and Exit**: Press `Ctrl+O`, hit `Enter` to save, then `Ctrl+X` to exit `nano`.
5. **Make Executable**: Run `chmod +x run_pytorch_docker.sh` to make your script executable.

### Step 3: Run the Docker Container

- **Execute the Script**: In your terminal, type `./run_pytorch_docker.sh` to start the Docker container. This script mounts specified directories and opens a container with GPU access enabled.


### Step 4: Debugging Inside the Container

Once your Docker container is running and you're inside its interactive shell, you can proceed as if you're in a typical development environment:

- **Full Access to Libraries**: All libraries and tools installed in the Docker image are at your disposal. You can run commands, execute scripts, and use your usual debugging tools just like on a local machine.
- **Normal Operation**: Interact with the terminal as you would in any Linux environment. Edit, execute, and debug your applications directly inside the container using the command line or any terminal-based editors like Vim or Nano.

This setup provides a seamless experience for development and debugging, ensuring that your work environment is both controlled and replicable.

### Step 5: Exit the Container

- **To Exit**: Type `exit` in the container's terminal. The container will stop, and due to the `--rm` flag, it will also be automatically removed, cleaning up your system.
10 changes: 10 additions & 0 deletions examples/docker_setup/docker_run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash

# This script runs a Docker container with the necessary volume mounts for the PyTorch application.

docker run --ipc=host --shm-size=512m --gpus all -it --rm \
-v /home/ubuntu/src/Megatron-LM:/workspace/megatron \
-v /home/ubuntu/src/dataset-dir:/workspace/dataset \
-v /home/ubuntu/src/checkpoint-dir:/workspace/checkpoints \
my-pytorch-app \
/bin/bash
11 changes: 11 additions & 0 deletions examples/docker_setup/dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Use NVIDIA's PyTorch image as the base
FROM nvcr.io/nvidia/pytorch:24.03-py3

# Set the working directory in the container
WORKDIR /app

# Copy the requirements file into the container at /app
COPY requirements.txt /app/

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
3 changes: 3 additions & 0 deletions examples/docker_setup/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
transformers
datasets
sentencepiece
3 changes: 2 additions & 1 deletion examples/pretrain_gpt_distributed_with_mp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ DISTRIBUTED_ARGS="
GPT_ARGS="
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 2 \
--attention-softmax-in-fp32 \
--sequence-parallel \
--num-layers 24 \
--hidden-size 1024 \
Expand All @@ -44,7 +45,7 @@ GPT_ARGS="
--weight-decay 1e-2 \
--lr-warmup-fraction .01 \
--clip-grad 1.0 \
--fp16
--bf16
"

DATA_ARGS="
Expand Down

0 comments on commit 05a3bdf

Please sign in to comment.