Merge pull request #1 from arcee-ai/mock-gpt-mp-training

modifed the model parreleized gpt pre-trainign script
arcee-ai · Apr 22, 2024 · 05a3bdf · 05a3bdf
2 parents ccfeda4 + f934094
commit 05a3bdf
Show file tree

Hide file tree

Showing 5 changed files with 96 additions and 1 deletion.
diff --git a/examples/docker_setup/README.md b/examples/docker_setup/README.md
@@ -0,0 +1,70 @@
+## Quick Start Guide to Running Your PyTorch Docker Container
+
+### Step 1: Create the Dockerfile
+
+1. **Open Terminal**: Open a terminal on your Ubuntu machine.
+2. **Create Dockerfile**: Enter `nano Dockerfile` to create and edit a new Dockerfile.
+3. **Enter Dockerfile Content**:
+    ```dockerfile
+    # Use an official PyTorch image as a base
+    FROM nvcr.io/nvidia/pytorch:latest
+
+    # Set the working directory inside the container
+    WORKDIR /workspace
+
+    # Install any necessary dependencies
+    RUN pip install -r requirements.txt
+
+    # Copy the local code to the container's workspace
+    COPY ./ /workspace/
+
+    # Set the default command to execute
+    CMD ["/bin/bash"]
+    ```
+    Replace `latest` with the specific version of PyTorch you need. Modify `requirements.txt` to include all necessary Python packages.
+
+### Step 2: Build Your Docker Image
+
+1. **Build Image**: In your terminal, run:
+   ```bash
+   docker build -t my-pytorch-app .
+   ```
+   This command builds the Docker image named `my-pytorch-app` using the Dockerfile in the current directory.
+
+### Step 2: Create the Docker Run Script
+
+1. **Open Terminal**: Open a terminal on your Ubuntu machine.
+2. **Create Script File**: Enter `nano run_pytorch_docker.sh` to create and edit a new shell script.
+3. **Enter Script Content**:
+    ```bash
+    #!/bin/bash
+    # This script runs a Docker container with necessary volume mounts for the PyTorch application.
+
+    docker run --gpus all -it --rm \
+      -v /path/to/megatron:/workspace/megatron \
+      -v /path/to/dataset:/workspace/dataset \
+      -v /path/to/checkpoints:/workspace/checkpoints \
+      my-pytorch-app \
+      /bin/bash
+    ```
+    Replace `/path/to/megatron`, `/path/to/dataset`, and `/path/to/checkpoints` with the actual paths to your resources. This will take you the interactive window.
+4. **Save and Exit**: Press `Ctrl+O`, hit `Enter` to save, then `Ctrl+X` to exit `nano`.
+5. **Make Executable**: Run `chmod +x run_pytorch_docker.sh` to make your script executable.
+
+### Step 3: Run the Docker Container
+
+- **Execute the Script**: In your terminal, type `./run_pytorch_docker.sh` to start the Docker container. This script mounts specified directories and opens a container with GPU access enabled.
+
+
+### Step 4: Debugging Inside the Container
+
+Once your Docker container is running and you're inside its interactive shell, you can proceed as if you're in a typical development environment:
+
+- **Full Access to Libraries**: All libraries and tools installed in the Docker image are at your disposal. You can run commands, execute scripts, and use your usual debugging tools just like on a local machine.
+- **Normal Operation**: Interact with the terminal as you would in any Linux environment. Edit, execute, and debug your applications directly inside the container using the command line or any terminal-based editors like Vim or Nano.
+
+This setup provides a seamless experience for development and debugging, ensuring that your work environment is both controlled and replicable.
+
+### Step 5: Exit the Container
+
+- **To Exit**: Type `exit` in the container's terminal. The container will stop, and due to the `--rm` flag, it will also be automatically removed, cleaning up your system.
diff --git a/examples/docker_setup/docker_run.sh b/examples/docker_setup/docker_run.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+# This script runs a Docker container with the necessary volume mounts for the PyTorch application.
+
+docker run --ipc=host --shm-size=512m --gpus all -it --rm \
+  -v /home/ubuntu/src/Megatron-LM:/workspace/megatron \
+  -v /home/ubuntu/src/dataset-dir:/workspace/dataset \
+  -v /home/ubuntu/src/checkpoint-dir:/workspace/checkpoints \
+  my-pytorch-app \
+  /bin/bash
diff --git a/examples/docker_setup/dockerfile b/examples/docker_setup/dockerfile
@@ -0,0 +1,11 @@
+# Use NVIDIA's PyTorch image as the base
+FROM nvcr.io/nvidia/pytorch:24.03-py3
+
+# Set the working directory in the container
+WORKDIR /app
+
+# Copy the requirements file into the container at /app
+COPY requirements.txt /app/
+
+# Install any needed packages specified in requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
diff --git a/examples/docker_setup/requirements.txt b/examples/docker_setup/requirements.txt
@@ -0,0 +1,3 @@
+transformers
+datasets
+sentencepiece
diff --git a/examples/pretrain_gpt_distributed_with_mp.sh b/examples/pretrain_gpt_distributed_with_mp.sh
@@ -28,6 +28,7 @@ DISTRIBUTED_ARGS="
 GPT_ARGS="
     --tensor-model-parallel-size 2 \
     --pipeline-model-parallel-size 2 \
+    --attention-softmax-in-fp32 \
     --sequence-parallel \
     --num-layers 24 \
     --hidden-size 1024 \
@@ -44,7 +45,7 @@ GPT_ARGS="
     --weight-decay 1e-2 \
     --lr-warmup-fraction .01 \
     --clip-grad 1.0 \
-    --fp16
+    --bf16
 "
 
 DATA_ARGS="