Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modifed the model parreleized gpt pre-trainign script #1

Merged
merged 1 commit into from
Apr 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions examples/docker_setup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
## Quick Start Guide to Running Your PyTorch Docker Container

### Step 1: Create the Dockerfile

1. **Open Terminal**: Open a terminal on your Ubuntu machine.
2. **Create Dockerfile**: Enter `nano Dockerfile` to create and edit a new Dockerfile.
3. **Enter Dockerfile Content**:
```dockerfile
# Use an official PyTorch image as a base
FROM nvcr.io/nvidia/pytorch:latest

# Set the working directory inside the container
WORKDIR /workspace

# Install any necessary dependencies
RUN pip install -r requirements.txt

# Copy the local code to the container's workspace
COPY ./ /workspace/

# Set the default command to execute
CMD ["/bin/bash"]
```
Replace `latest` with the specific version of PyTorch you need. Modify `requirements.txt` to include all necessary Python packages.

### Step 2: Build Your Docker Image

1. **Build Image**: In your terminal, run:
```bash
docker build -t my-pytorch-app .
```
This command builds the Docker image named `my-pytorch-app` using the Dockerfile in the current directory.

### Step 2: Create the Docker Run Script

1. **Open Terminal**: Open a terminal on your Ubuntu machine.
2. **Create Script File**: Enter `nano run_pytorch_docker.sh` to create and edit a new shell script.
3. **Enter Script Content**:
```bash
#!/bin/bash
# This script runs a Docker container with necessary volume mounts for the PyTorch application.

docker run --gpus all -it --rm \
-v /path/to/megatron:/workspace/megatron \
-v /path/to/dataset:/workspace/dataset \
-v /path/to/checkpoints:/workspace/checkpoints \
my-pytorch-app \
/bin/bash
```
Replace `/path/to/megatron`, `/path/to/dataset`, and `/path/to/checkpoints` with the actual paths to your resources. This will take you the interactive window.
4. **Save and Exit**: Press `Ctrl+O`, hit `Enter` to save, then `Ctrl+X` to exit `nano`.
5. **Make Executable**: Run `chmod +x run_pytorch_docker.sh` to make your script executable.

### Step 3: Run the Docker Container

- **Execute the Script**: In your terminal, type `./run_pytorch_docker.sh` to start the Docker container. This script mounts specified directories and opens a container with GPU access enabled.


### Step 4: Debugging Inside the Container

Once your Docker container is running and you're inside its interactive shell, you can proceed as if you're in a typical development environment:

- **Full Access to Libraries**: All libraries and tools installed in the Docker image are at your disposal. You can run commands, execute scripts, and use your usual debugging tools just like on a local machine.
- **Normal Operation**: Interact with the terminal as you would in any Linux environment. Edit, execute, and debug your applications directly inside the container using the command line or any terminal-based editors like Vim or Nano.

This setup provides a seamless experience for development and debugging, ensuring that your work environment is both controlled and replicable.

### Step 5: Exit the Container

- **To Exit**: Type `exit` in the container's terminal. The container will stop, and due to the `--rm` flag, it will also be automatically removed, cleaning up your system.
10 changes: 10 additions & 0 deletions examples/docker_setup/docker_run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash

# This script runs a Docker container with the necessary volume mounts for the PyTorch application.

docker run --ipc=host --shm-size=512m --gpus all -it --rm \
-v /home/ubuntu/src/Megatron-LM:/workspace/megatron \
-v /home/ubuntu/src/dataset-dir:/workspace/dataset \
-v /home/ubuntu/src/checkpoint-dir:/workspace/checkpoints \
my-pytorch-app \
/bin/bash
11 changes: 11 additions & 0 deletions examples/docker_setup/dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Use NVIDIA's PyTorch image as the base
FROM nvcr.io/nvidia/pytorch:24.03-py3

# Set the working directory in the container
WORKDIR /app

# Copy the requirements file into the container at /app
COPY requirements.txt /app/

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
3 changes: 3 additions & 0 deletions examples/docker_setup/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
transformers
datasets
sentencepiece
3 changes: 2 additions & 1 deletion examples/pretrain_gpt_distributed_with_mp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ DISTRIBUTED_ARGS="
GPT_ARGS="
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 2 \
--attention-softmax-in-fp32 \
--sequence-parallel \
--num-layers 24 \
--hidden-size 1024 \
Expand All @@ -44,7 +45,7 @@ GPT_ARGS="
--weight-decay 1e-2 \
--lr-warmup-fraction .01 \
--clip-grad 1.0 \
--fp16
--bf16
"

DATA_ARGS="
Expand Down