## 1. NGC Docker Login

Authenticate with NVIDIA NGC (GPU Cloud) container registry using your API key.

This step is required to pull private NVIDIA containers. Make sure the `NGC_API_KEY` environment variable is set before running this cell.

In [None]:
%%bash
echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin

## 2. Configure Cache & Storage Paths

Set up ephemeral storage locations for various caches to optimize performance and disk usage:

1. **NIM Cache**: Store NVIDIA Inference Microservices cache
2. **Docker Storage**: Relocate Docker data root to ephemeral storage
3. **Pip Cache**: Cache Python packages for faster installs
4. **HuggingFace Cache**: Store downloaded models and datasets
5. **Temp Directory**: Set custom temporary file location

‚ö†Ô∏è **Note**: This cell requires sudo permissions to modify Docker configuration.

In [None]:
import os, json, subprocess, time

# -------------------------------
# 1. Setup NeMo/NIM cache
# -------------------------------
os.environ["LOCAL_NIM_CACHE"] = "/ephemeral/cache/nim"
os.makedirs(os.environ["LOCAL_NIM_CACHE"], exist_ok=True)
print(f"LOCAL_NIM_CACHE set to {os.environ['LOCAL_NIM_CACHE']}")

# -------------------------------
# 2. Setup Docker ephemeral storage
# -------------------------------
storage_path = "/ephemeral/cache/docker"
os.makedirs(storage_path, exist_ok=True)

daemon_file = "/etc/docker/daemon.json"
config = {}
try:
    config = json.load(open(daemon_file)) if os.path.exists(daemon_file) else {}
except PermissionError:
    print("Cannot read daemon.json. Run with sudo or check path.")

# Update Docker root
config["data-root"] = storage_path
config_str = json.dumps(config, indent=4)

# Write daemon.json (requires sudo)
subprocess.run(f"echo '{config_str}' | sudo tee {daemon_file} > /dev/null", shell=True, check=True)

# Restart Docker
subprocess.run("sudo systemctl restart docker", shell=True, check=True)
time.sleep(5)

# Verify new Docker root
docker_root = subprocess.run(
    "docker info | grep 'Docker Root Dir'",
    shell=True, capture_output=True, text=True
).stdout.strip()
print("Docker Root Dir:", docker_root)

# -------------------------------
# 3. Setup pip cache
# -------------------------------
pip_cache = "/ephemeral/cache/pip"
os.makedirs(pip_cache, exist_ok=True)
os.environ["PIP_CACHE_DIR"] = pip_cache
print(f"PIP_CACHE_DIR set to {pip_cache}")

# -------------------------------
# 4. Setup HuggingFace cache
# -------------------------------
hf_cache = "/ephemeral/cache/huggingface"
os.makedirs(hf_cache, exist_ok=True)
os.environ["HF_HOME"] = hf_cache
print(f"HF_HOME set to {hf_cache}")

# -------------------------------
# 5. Setup tmpdir
# -------------------------------
tmp_dir = "/ephemeral/tmp"
os.makedirs(tmp_dir, exist_ok=True)
os.environ["TMPDIR"] = tmp_dir
print(f"TMPDIR set to {tmp_dir}")

## 3. Launch NeMo RL Container

Start the NeMo RL Docker container in detached mode with:
- GPU support enabled
- Port 9000 exposed for services
- Current directory mounted to `/workspace`
- NVIDIA NeMo RL v0.4.0 image

The container will run in the background, allowing us to execute commands inside it.


In [None]:
!docker run --gpus all --name nemo-rl -it \
  -p 9000:9000 \
  -v "$(pwd)":/workspace \
  -w /workspace \
  -d nvcr.io/nvidia/nemo-rl:v0.4.0

## 4. Setup NeMo RL Repository

Clone the NeMo RL repository and configure authentication:

1. **Clone Repository**: Get the latest NeMo RL code with all submodules
2. **Activate Virtual Environment**: Use the pre-configured NeMo RL Python environment
3. **HuggingFace Login**: Authenticate to download gated models (like Llama)
4. **Weights & Biases**: Set API key for experiment tracking

üîë **Security Note**: Replace placeholder tokens with your actual credentials.


### Container Started

The NeMo RL container is now running. The next cell will set up the repository and authentication inside the container.

In [None]:
container = "nemo-rl"

!docker exec {container} bash -c "git clone https://github.com/NVIDIA-NeMo/RL.git nemo-rl --recursive"
!docker exec {container} bash -c "cd nemo-rl && git submodule update --init --recursive"

# Activate NeMo RL venv
!docker exec {container} bash -c "source /opt/nemo_rl_venv/bin/activate"

# HuggingFace login
!docker exec {container} bash -c "huggingface-cli login --token hf_********"

# WANDB API key
!docker exec {container} bash -c 'export WANDB_API_KEY="*****"'

## 5. Run DPO Training

Execute Direct Preference Optimization (DPO) training using the NeMo RL framework.

### Training Configuration:
- **Model**: `meta-llama/Llama-3.2-1B-Instruct` (1B parameter instruction-tuned model)
- **GPUs**: 1 GPU per node
- **Steps**: 10 training steps (quick demo - increase for production)
- **Output**: Checkpoints saved to `./results/dpo/step_10/`

### What is DPO?
DPO is a reinforcement learning technique that trains models to align with human preferences by learning from preference pairs, without requiring a separate reward model.

‚è±Ô∏è **Expected Time**: Several minutes depending on GPU and model size.


### Repository Configured

Authentication and repository setup complete. Ready to run DPO training.

In [None]:
container = "nemo-rl"

!docker exec -it $container bash -c 'source /opt/nemo_rl_venv/bin/activate && \
uv run python nemo-rl/examples/run_dpo.py \
cluster.gpus_per_node=1 \
dpo.max_num_steps=10 \
policy.model_name=meta-llama/Llama-3.2-1B-Instruct \
policy.tokenizer.name=meta-llama/Llama-3.2-1B-Instruct'

## 6. Convert Model to HuggingFace Format

Convert the trained DCP (Distributed Checkpoint) format to HuggingFace format for wider compatibility.

### Conversion Details:
- **Input**: DCP checkpoint from `./results/dpo/step_10/policy/weights`
- **Output**: HuggingFace-compatible model at `./results/dpo/step_10/hf`
- **Benefits**: Enables use with HuggingFace Transformers library and ecosystem

This conversion makes the model easier to share, deploy, and integrate with standard tools.


### Training Complete

DPO training finished. The trained model checkpoints are saved in `./results/dpo/step_10/`. The next step converts these to HuggingFace format.

In [None]:
container = "nemo-rl"

!docker exec {container} bash -c "source /opt/nemo_rl_venv/bin/activate && \
    uv run nemo-rl/examples/converters/convert_dcp_to_hf.py \
    --config ./results/dpo/step_10/config.yaml \
    --dcp-ckpt-path ./results/dpo/step_10/policy/weights \
    --hf-ckpt-path ./results/dpo/step_10/hf"

### Conversion Complete

Model successfully converted to HuggingFace format. Now let's test the model with a local inference script.

## 7. Local Inference Testing

Test the converted model with a simple inference script to verify it works correctly.

### Inference Script:
- Loads the HuggingFace model and tokenizer
- Uses `bfloat16` precision for efficiency
- Generates up to 50 new tokens
- Tests with a science question about photosynthesis

This step validates that the model conversion was successful before deployment.


### Inference Script Created

The `inference.py` script has been written. Execute it in the next cell to verify the model works correctly.

In [None]:
%%writefile inference.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

hf_path = "./results/dpo/step_10/hf/"

tokenizer = AutoTokenizer.from_pretrained(hf_path)
model = AutoModelForCausalLM.from_pretrained(hf_path, torch_dtype=torch.bfloat16)
model.eval()

prompt = "How does photosynthesis work in plants?"
inputs = tokenizer(prompt, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(out[0], skip_special_tokens=True))


### Local Inference Test Complete

The model generated output successfully. Now let's convert to SafeTensors format for better security and performance.

In [None]:
container = "nemo-rl"
!docker exec {container} bash -c "source /opt/nemo_rl_venv/bin/activate && python inference.py"

In [None]:
%%writefile convert.py
from transformers import AutoModelForCausalLM, AutoTokenizer

src = "./results/dpo/step_10/hf"
dst = "./results/dpo/step_10/hf_st"

model = AutoModelForCausalLM.from_pretrained(src)
model.save_pretrained(dst, safe_serialization=True)

tok = AutoTokenizer.from_pretrained(src)
tok.save_pretrained(dst)

print("Saved to:", dst)


## 8. Convert to SafeTensors Format

Convert the HuggingFace model to use SafeTensors serialization for improved security and performance.

### SafeTensors Benefits:
- **Security**: Prevents arbitrary code execution vulnerabilities
- **Performance**: Faster loading times with zero-copy deserialization
- **Reliability**: Better error handling and validation
- **Compatibility**: Works with all major ML frameworks

The converted model is saved to `./results/dpo/step_10/hf_st` (st = safe tensors).


### SafeTensors Script Created

The `convert.py` script has been written. Execute it in the next cell to convert the model.

In [None]:
container = "nemo-rl"
!docker exec {container} bash -c "source /opt/nemo_rl_venv/bin/activate && python convert.py"

### SafeTensors Conversion Complete

Model successfully converted to SafeTensors format at `./results/dpo/step_10/hf_st`. Ready for NIM deployment.

## 9. Deploy with NVIDIA NIM

Deploy the trained model using NVIDIA Inference Microservices (NIM) for production-ready serving.

### NIM Configuration:
- **Container**: `MultiLLM-NIM` running the latest NVIDIA LLM-NIM image
- **Model Name**: Exposed as `dpo-llm` via the API
- **GPU Support**: All available GPUs with 16GB shared memory
- **Port**: Service available on `localhost:8000`
- **Caching**: Uses ephemeral storage for optimal performance

### What is NIM?
NVIDIA NIM provides optimized inference microservices with:
- Low latency and high throughput
- OpenAI-compatible API endpoints
- Built-in performance optimizations
- Easy deployment and scaling


### NIM Configuration Set

Container and model settings configured. The next cell will launch the NIM container.

In [None]:
# ===============================
#   MultiLLM-NIM Container Launcher
#   (Detached mode)
# ===============================

# Choose container name
CONTAINER_NAME = "MultiLLM-NIM"

# NGC Multi-LLM NIM repo
Repository = "nim/nvidia/llm-nim"
TAG = "latest"
IMG_NAME = f"nvcr.io/{Repository}:{TAG}"

# Path to your local HF DPO model
LOCAL_MODEL_DIR = "./results/dpo/step_10/hf_st"

# Name to expose the served model
NIM_SERVED_MODEL_NAME = "dpo-llm"

# Local NIM cache (you chose ephemeral)
LOCAL_NIM_CACHE = "/ephemeral/cache/nim"

# Create cache directory
!mkdir -p "{LOCAL_NIM_CACHE}"
!chmod -R a+w "{LOCAL_NIM_CACHE}"

print("Starting MultiLLM-NIM container in detached mode...")
print("Container:", CONTAINER_NAME)
print("Image:", IMG_NAME)
print("Model Path:", LOCAL_MODEL_DIR)
print("NIM Cache:", LOCAL_NIM_CACHE)

### NIM Container Launched

The MultiLLM-NIM container is starting up. Use the health check to wait for it to be ready.

In [None]:
# -------------------------------
# Run the container DETACHED
# -------------------------------
!docker run -d --rm --name={CONTAINER_NAME} \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_PROFILE="e2f00b2cbfb168f907c8d6d4d40406f7261111fbab8b3417a485dcd19d10cc98" \
  -e NIM_MODEL_NAME="/opt/models/local_model" \
  -e NIM_SERVED_MODEL_NAME={NIM_SERVED_MODEL_NAME} \
  -v "{LOCAL_MODEL_DIR}:/opt/models/local_model" \
  -v "{LOCAL_NIM_CACHE}:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  {IMG_NAME}

## 10. Health Check

Wait for the NIM service to be fully ready before sending inference requests.

This cell continuously polls the `/v1/health/ready` endpoint until the service reports ready status. The container needs time to:
- Load the model into GPU memory
- Initialize the inference engine
- Start the API server

‚è±Ô∏è **Expected Wait Time**: 1-5 minutes depending on model size and hardware.


### NIM Service Ready

The service is ready to accept inference requests. Test it with the completions API.

In [None]:
import requests

url = 'http://localhost:8000/v1/health/ready' #make sure the LLM NIM port is correct
headers = {'accept': 'application/json'}

print("Checking MultiLLM NIM readiness...")
while True:
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            data = response.json()
            if data.get("message") == "Service is ready.":
                print("LLM NIM is ready.")
                break
            else:
                print("LLM NIM is not ready. Waiting for 30 seconds...")
        else:
            print(f"Unexpected status code {response.status_code}. Waiting for 30 seconds...")
    except requests.ConnectionError:
        print("LLM NIM is not ready. Waiting for 30 seconds...")
    time.sleep(30)

### API Test Complete

The model responded to the completion request successfully. You can now stop the container or continue using the service.

## 11. Test Completions API

Test the deployed model using the OpenAI-compatible completions endpoint.

### Request Parameters:
- **model**: `dpo-llm` (our trained DPO model)
- **prompt**: A starter phrase to complete
- **max_tokens**: Maximum number of tokens to generate (64)

The API will return a JSON response with the model's completion of the prompt.


In [None]:
!curl -X POST 'http://localhost:8000/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{"model": "dpo-llm", "prompt": "The sky appears blue because", "max_tokens": 64}'

## 12. Cleanup

Stop the MultiLLM-NIM container to free up GPU resources.

üßπ **Note**: The `--rm` flag used when starting the container ensures it's automatically removed after stopping, keeping your Docker environment clean.


In [None]:
!docker stop MultiLLM-NIM nemo-rl