# Milestone Workshop: Cosmos Transfer 2.5
**Authors:** Aiden Chang, Akul Santhosh


This notebook is a hands on guide for Milestone data. The goal is for you to understand, create, and use the multi-control modalities that power Cosmos Transfer 2.5 (CT 2.5).

**Important** Select the Cosmos Transfer 2.5 Kernel

We will cover:

Let's begin by setting up our environment. 

In [4]:
!huggingface-cli login --token "YOUR TOKEN HERE"


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `hf`CLI if you want to set the git credential as well.
Token is valid (permission: read).
The token `read_token` has been saved to /home/nvidia/.cache/huggingface/stored_tokens
Your token has been saved to /home/nvidia/.cache/huggingface/token
Login successful.
The current active token is: `read_token`


In [1]:
import os
os.makedirs("prompts", exist_ok=True)
os.makedirs("outputs", exist_ok=True)
os.makedirs("control_modalities", exist_ok=True)

## Key Concepts: Governing Strength
Success with CT 2.5 depends on balancing two key principles: the text prompt's influence (Guidance Scale) and the influence of the visual controls (Control Weight Normalization).

### 1.1. Guidance Scale (Prompt Strength)
This dictates how strictly the model adheres to your text prompt versus the visual controls.
- What it is: Controls the influence of the text prompt.
- Good Starting Point: Guidance = 3.
- When to Increase: Increase to 5+ if the visual output fails to incorporate the changes described in your prompt (e.g., trying to change a shirt from "blue" to "red").

### 1.2. Control Weight Normalization 
This governs how the model balances multiple control modalities (e.g., Edge + Seg + Vis) against each other.
- Rule 1: Weights WILL NOT Normalize if the total sum of all control weights is 1.0 or less. The weights are applied as-is.
    - Example: {seg: 0.2, edge: 0.2} (sum 0.4) will be used as-is.
- Rule 2: Weights WILL NORMALIZE if the total sum is greater than 1.0. The weights are re-scaled proportionally so the new total sum equals 1.0.
    - Example: {seg: 4.0, edge: 1.0} (sum 5.0) will be normalized and run as {seg: 0.8, edge: 0.2}.

## 2. Technical Details: The Control Modalities
The system uses modalities to inject structural, semantic, relative, and visual consistency into the video. We will generate these control modalities using the following video:


In [2]:
from IPython.display import HTML
import sys

milestone_example = "milestone_data/clip_0_harder_version.mp4"

HTML(f"""
<video width="600" controls>
  <source src="{milestone_example}" type="video/mp4">
</video>
""")

### 2.1. Edge Control (Structure Preservation)

Function: Preserves the original structure, shape, and layout of the video.

Edge control is natively supported in CT2.5. However, when object and background contours are visually similar, the default Canny-based edge detection may miss important boundaries. In these situations, it’s helpful to run a preprocessing step within the CT2.5 repository to generate a cleaner, higher-contrast edge map.

Hands-On: Let's generate our own edge-control video with enhanced contrast and brightness. We can do this using the command line or by using the CT2.5 python function.

In [12]:
from cosmos_transfer2_5.cosmos_transfer2._src.transfer2.auxiliary.utils.generate_edges import generate_edges
from pathlib import Path

def generate_output_path(file_path, modality_type):
    name = Path(file_path).stem
    parent_dir = os.path.join("control_modalities", name)
    os.makedirs(parent_dir, exist_ok=True)
    return os.path.join(parent_dir, f"{modality_type}.mp4")

# --- KEY PARAMETERS ---
# We increase brightness to help distinguish contours
bright = 1
contrast = 0.2

# --- EXERCISE ---
in_path = milestone_example
out_path = generate_output_path(milestone_example, "edge")

generate_edges(in_path, out_path, bright=bright, contrast=contrast)

print("finished generating the new video!")


finished generating the new video!


In [13]:
!ffmpeg -y -i control_modalities/clip_0_harder_version/edge.mp4 -vcodec libx264 -acodec aac control_modalities/clip_0_harder_version/edge_h264.mp4 -v quiet

HTML(f"""
<video width="600" controls>
  <source src="control_modalities/clip_0_harder_version/edge_h264.mp4" type="video/mp4">
</video>
""")

### 2.2. Segmentation (Seg) Control (Structural Change & Semantic Replacement)

**Function**: Facilitates large, structural changes and semantic replacement. Used to completely transform or replace objects, people, or backgrounds. 

There are three parts:
1. **Identify objects in the scene:** Use an object detection model (e.g., [RAM++](https://github.com/xinyu1205/recognize-anything)) to obtain object labels and bounding boxes. *(We skip this step here, as multiple approaches can be applied.)*
2. **Prompt the objects:** Detection models such as [Grounding Dino](https://github.com/IDEA-Research/GroundingDINO) or [YOLO](https://docs.ultralytics.com/models/yolov9/) can generate either box or point prompts (e.g., coordinates). These prompts guide the segmentation process. *(If a standalone detector provides both labels and spatial prompts, Step 1 is implicitly covered.)*
3. **Generate pixel-accurate segmentations:** Feed the prompts (boxes or points) into SAM/SAM2 to obtain high-quality masks that drive the structural or semantic edits.

In CT2.5, step 2 and 3 are squished together. You can use the following command to automatically segment the objects.

In [1]:
!python -m pip install 'git+https://github.com/facebookresearch/sam3.git'

Collecting git+https://github.com/facebookresearch/sam3.git
  Cloning https://github.com/facebookresearch/sam3.git to /tmp/pip-req-build-pzz8ts5t
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/sam3.git /tmp/pip-req-build-pzz8ts5t
  Resolved https://github.com/facebookresearch/sam3.git to commit 757bbb0206a0b68bee81b17d7eb4877177025b2f
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting numpy==1.26 (from sam3==0.1.0)
  Downloading numpy-1.26.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
Collecting ftfy==6.1.1 (from sam3==0.1.0)
  Downloading ftfy-6.1.1-py3-none-any.whl.metadata (6.1 kB)
Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
Downloading numpy-1.26.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.

In [None]:
# You can also specify the tags or use models such as RAM++ to detect them. 
# NOTE: The tags must be lowercase and separated by a period! 
!{sys.executable} cosmos_transfer2_5/cosmos_transfer2/_src/transfer2/auxiliary/sam2/sam2_pipeline.py \
    --input_video assets/wave.mp4 \
    --output_video assets/seg.mp4 \
    --output_tensor assets/tensor.pt \
    --mode prompt \
    --prompt "floor. ceiling. wall. staircase. railing. bench. light. person. man. shirt. pants. shoes. badge. hand. arm. wave. pose. hall. atrium. building. door. corridor. background. reflection. shadow"

In [None]:
!ffmpeg -y -i assets/seg.mp4 -vcodec libx264 -acodec aac assets/seg_h264.mp4 -v quiet

HTML(f"""
<video width="600" controls>
  <source src="assets/seg_h264.mp4" type="video/mp4">
</video>
""")

### 2.3. Vis Control (Lighting & Background Feel)
Function: Preserves the original video’s background, lighting, and overall appearance. It applies a subtle smoothing/blur. Vis control is natively built into the CT 2.5 repository. No need to pre-generate.

### 2.4. Depth Control (Distance Consistancy)
Function: Preserves the original video’s 3D Geometry Depth control is natively built into the CT 2.5 repository. No need to pre-generate.

In [None]:
# from cosmos_transfer2._src.transfer2.auxiliary.depth_anything.video_depth_anything import VideoDepthAnythingModel

# depth_model = VideoDepthAnythingModel()
# depth_model.setup()
# depth_maps = model.generate(video_np)
# depth_tensor = torch.from_numpy(depth_maps.astype(np.float32))
# d_min, d_max = depth_tensor.min(), depth_tensor.max()
# depth_normalized = (depth_tensor - d_min) / (d_max - d_min + 1e-8) * 255.0
# depth_normalized = depth_normalized.unsqueeze(0)  # (1, T, H, W)


## 3. Hands-On: Multi-Control Recipes
| Task | Suggested Controls & Settings| 
|--|--|
|Change clothing or textures|Edge: 1.0, Guidance: 3|
|Change lighting|Edge: 1.0 + Vis: 0.2, Guidance: 3|
|Change background, keep subject|Filtered Edge: 1.0 + Seg (Mask Inverted): 0.4 + Vis: 0.6, Guidance: 3|


### Recipe 1: Color/Texture Change

Goal: Modify the color of the person's shirt. This is the simplest recipe. We already generated the edge modality from the previous steps!

```json
{
  "name": "color_change",
  "prompt_path": "prompts/prompt_color.txt",
  "video_path": "assets/wave.mp4",
  "guidance": 3,
  "edge": {
    "control_weight": 1.0
  }
}
```

*Note: Why no Vis? Vis would preserve the original colors. We rely only on Edge (1.0) to hold the shape and let the Prompt do all the color work.*

#### Recipe:
<img src="assets/color_change_recipe.png" width="300"/>

#### Example Results:
<div style="display: flex; gap: 20px;">
  <video src="assets/wave.mp4" width="45%" controls></video>
  <video src="assets/color.mp4" width="45%" controls></video>
</div>

In [2]:
# Feel free to change the prompt
prompt = "The camera pans over the person sitting on a chair wearing a red t-shirt"
with open("temp/background_change.txt", "w") as f:
    f.write(prompt)


In [3]:
# Due to time constraints, we have commented out this excersize. Feel free to run it on your own time.

# Run CT2.5 
# WARNING: This should take a couple of minutes
# !{sys.executable} cosmos_transfer2_5/examples/inference.py -i scripts/color_change.jsonl -o outputs/color_change
!torchrun --nproc_per_node=8 --master_port=12341 cosmos_transfer2_5/examples/inference.py -i temp/background_change.jsonl -o outputs/temp

W1119 04:58:17.562000 1570167 torch/distributed/run.py:766] 
W1119 04:58:17.562000 1570167 torch/distributed/run.py:766] *****************************************
W1119 04:58:17.562000 1570167 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1119 04:58:17.562000 1570167 torch/distributed/run.py:766] *****************************************
Fetching 102 files: 100%|███████████████████| 102/102 [00:00<00:00, 4529.92it/s]
[[32m11-19 04:58:26[0m|[1mINFO[0m|[36mcosmos_transfer2_5/cosmos_transfer2/_src/imaginaire/utils/checkpoint_db.py:171:path[0m] Downloading checkpoint nvidia/Cosmos-Transfer2.5-2B/general/edge(ecd0ba00-d598-4f94-aa09-e8627899c431)
[[32m11-19 04:58:26[0m|[1mINFO[0m|[36mcosmos_transfer2_5/cosmos_transfer2/_src/imaginaire/utils/checkpoint_db.py:96:_download[0m] Downloading c

In [None]:
HTML(f"""
<video width="600" controls>
  <source src="outputs/color_change/color_change.mp4" type="video/mp4">
</video>
""")

### Recipe 2: Lighting Change

Goal: Modify scene lighting (e.g., day to night) while keeping all objects the same. Configuration:

```json
{
  "name": "change_lighting",
  "prompt_path": "prompt_lighting.txt",
  "video_path": "assets/wave.mp4",
  "guidance": 3,
  "edge": {
    "control_weight": 1.0
  },
  "vis": {
    "control_weight": 0.2
  }
}
```

We already generated the edge modality from the previous steps, and vis can be computed on the fly.

*Note: We use full Edge (1.0) to lock the structure and a tiny bit of Vis (0.2) to maintain realism, but low enough to allow the prompt to change the lighting.*

#### Recipe:
<img src="assets/lighting_change_recipe.png" width="500"/>

#### Example Results:
<div style="display: flex; gap: 20px;">
  <video src="assets/wave.mp4" width="45%" controls></video>
  <video src="assets/lighting.mp4" width="45%" controls></video>
</div>

In [None]:
prompt = "A realistic, static full-body shot of a young man standing in the center of a spacious, modern atrium. He has short dark hair and is dressed casually in a dark grey t-shirt, loose black pants, and white sneakers, with an ID badge clipped to his waistband. He faces the camera directly and waves his right hand continuously in a friendly greeting. The surrounding space is bright and open, featuring a high industrial-style ceiling with exposed white beams and large, angular black structural supports. The floor is polished light grey concrete, subtly reflecting the warm, soft afternoon sunlight that pours in from large windows above. The overall lighting has a gentle golden tint, with natural shadows stretching slightly to the side in the way they do during late afternoon. In the background, a mezzanine level with glass railings is visible, along with several modern wooden benches and tables scattered throughout the area."
with open("prompts/prompt_lighting.txt", "w") as f:
    f.write(prompt)

In [13]:
# Due to time constraints, we have commented out this excersize. Feel free to run it on your own time.

# Run CT2.5 
# WARNING: This should take a couple of minutes
# !{sys.executable} cosmos_transfer2_5/examples/inference.py -i scripts/lighting_change.jsonl -o outputs/lighting_change

In [4]:
HTML(f"""
<video width="600" controls>
  <source src="outputs/lighting_change/lighting_change.mp4" type="video/mp4">
</video>
""")

### Recipe 3: Background Change

Goal: Modify Background while keeping selected objects and/or subjects the same. Configuration:

```json
{
  "name": "change_background",
  "prompt_path": "prompt.txt",
  "video_path": "original.mp4",
  "guidance": 3,
  "edge": {
    "control_weight": 1.0,
    "control_path": "filtered_edge.mp4"
  },
  "seg": {
    "control_weight": 0.4,
    "control_path": "segmentation.mp4",
    "mask_path": "mask_inverted.mp4"
  },
  "vis": {
    "control_weight": 0.4 // Adjust based on use case
  }
}
```

We have the filtered edge modality to generate:
- `filtered_edge.mp4`

#### Recipe:
<img src="assets/background_change_recipe.png" width="80%"/>

#### Example Results:
<div style="display: flex; gap: 20px;">
  <video src="assets/wave.mp4" width="45%" controls></video>
  <video src="assets/ocean.mp4" width="45%" controls></video>
</div>

#### Generating the filtered edge
Lets learn how to generate the filtered edge. We found that in practice, this works the best. We generated a mask from the previous step, and we already have our edge!


<img src="assets/filtered_edge_recipe.png" width="80%"/>

In [None]:
!{sys.executable} cosmos_transfer2_5/cosmos_transfer2/_src/transfer2/auxiliary/utils/filter_edges.py \
    assets/edge.mp4 \
    assets/mask.mp4 \
    assets/filtered_edge.mp4 \
    --threshold 0 \
    --grow_px 3 \
    --close_px 3 \
    --feather_px 2

# Encoding in a format we can view
!ffmpeg -y -i assets/filtered_edge.mp4 -vcodec libx264 -acodec aac assets/filtered_edge_h264.mp4 -v quiet

In [None]:
HTML(f"""
<video width="600" controls>
  <source src="assets/filtered_edge_h264.mp4" type="video/mp4">
</video>
""")

#### Changing the background
We have all four ingredients (Reverse mask, Filtered Edge, Seg, and Vis)! Let's now modify our background. 

*Note: Remember, vis is computed on the fly!*

Vis is a modality that can be tuned here depending on your background. Keeping vis generates a clearer background, but may backfire depending how much you change the background. For example, we can use vis to change to a similar background, such as a street. Using vis gives us a clear and crisp background.

| Original Video | Street background |
|---|---|
| ![wave](assets/wave.png) | ![street](assets/street.png) |


However, this could backfire when we modify the background to something that has very different visual elements. For example, if we change it to an ocean background, we can see that having vis leaves some artifacts in the background. In this case, we *don't want vis*. This does generate more of a blurry background, but it's better than our original result! 


| Original Video | With Vis | Without Vis |
|---|---|---|
| ![wave](assets/wave.png) | ![street](assets/ocean_vis.png) | ![street](assets/ocean_no_vis.png) |

In [6]:
prompt = "A realistic, static full-body shot of a young man standing outdoors near the coast. He has short dark hair and is dressed casually in a dark grey t-shirt, loose black pants, and white sneakers, with an ID badge clipped to his waistband. He faces the camera directly and waves his right hand continuously in a friendly greeting. The surrounding environment is bright and open. In the background, a vast ocean stretches out toward the horizon, with gentle waves, shimmering reflections, and a clear blue sky above. A coastal walkway with railings and scattered pedestrians lines the foreground, replacing the busy city street elements. Soft natural lighting from the sun enhances the calm, breezy seaside atmosphere."
with open("prompts/prompt_background.txt", "w") as f:
    f.write(prompt)

In [None]:
# Run CT2.5 
# WARNING: This should take a couple of minutes
!{sys.executable} cosmos_transfer2_5/examples/inference.py -i scripts/background_change.jsonl -o outputs/background_change

In [11]:
HTML(f"""
<video width="600" controls>
  <source src="outputs/background_change/background_change.mp4" type="video/mp4">
</video>
""")

## 4. Generating Realistic Data from Omniverse

An important robotics workflow is "Sim-to-Real." NVIDIA Omniverse can generate synthetic data, but we can use CT 2.5 to add real-world domain randomization (new lighting, textures, backgrounds) and generate photorealistic scenes.

The Workflow:
1. Generate in Omniverse: Create a base scenario (e.g., cars driving around) and export the video.
2. Extract Ground Truth: From Omniverse, also export the perfect ground-truth modalities (Depth, Segmentation, Edge).
3. Augment with CT 2.5: Use these perfect synthetic controls to run CT 2.5 with a new prompt (e.g., "in a dimly lit snowy day").
4. Package with Cosmos Writer: Save the new, augmented video alongside the original, ground-truth controls. This teaches a downstream model to associate the ground-truth controls with the new, realistic style.


### Omniverse Control Modalities

We start with the following control modalities:

| Original Video | Edge | Seg | Depth |
|----------|----------|----------|----------|
| <video src="simulation_data/simulator_rgb_input.mp4" controls width="300"></video> | <video src="simulation_data/simulator_edge.mp4" controls width="300"></video> | <video src="simulation_data/simulator_segmentation.mp4" controls width="300"></video> | <video src="simulation_data/simulator_depth.mp4" controls width="300"></video> |


### Recipe 


| Task | Suggested Controls & Settings| Example Results | Prompt |
|--|--|---|-|
|Original Video| N/A | <video src="TODO" width="100%" controls></video> | N/A |
|Photorealistic Generation|TODO| <video src="TODO" width="100%" controls></video> | [Prompt Location](simulation_data/prompt_location.txt) |
|Fog|TODO| <video src="TODO" width="100%" controls></video> | [Prompt Location](simulation_data/fog.txt) |
|Morning Sunlight|TODO| <video src="TODO" width="100%" controls></video> | [Prompt Location](simulation_data/morning_sun.txt) |
|Night|TODO| <video src="TODO" width="100%" controls></video> | [Prompt Location](simulation_data/night.txt) |
|Rain|TODO| <video src="TODO" width="100%" controls></video> | [Prompt Location](simulation_data/rain.txt) |
|Snow|TODO| <video src="TODO" width="100%" controls></video> | [Prompt Location](simulation_data/snow.txt) |
|Wooden Road|TODO| <video src="TODO" width="100%" controls></video> | [Prompt Location](simulation_data/wooden_road.txt) |

<!-- #### Example Results:
<div style="display: flex; gap: 20px;">
  <video src="TODO" width="45%" controls></video>
  <video src="TODO" width="45%" controls></video>
</div> -->

### Photorealistic Generation
Let us change our simulated environment to look realistic
```json
{
  "name": "omniverse_photorealistic",
  "prompt_path": "prompt.txt",
  "video_path": "original.mp4",
  "guidance": 7,
  "edge": {
    "control_weight": 1.0,
    "control_path": "edges.mp4"
  },
  "seg": {
    "control_weight": 0.9,
    "control_path": "segmentation.mp4",
    "mask_prompt": "battered orange safety cone"​
  },
  "depth": {
    "control_weight": 0.9,
    "control_path": "segmentation.mp4",
    "mask_path": "mask_inverted.mp4"
  }
}
```


In [None]:
# Run CT2.5 WARNING: This should take a couple of minutes
# Running the script:
    #- For single GPU: python examples/inference.py -i scripts/omniverse_av_configs.jsonl -o outputs/omniverse_generations_av
    #- For Multi GPU: torchrun --nproc_per_node=8 --master_port=12341 examples/inference.py -i scripts/omniverse_av_configs.jsonl -o outputs/omniverse_generations_av

torchrun --nproc_per_node=8 --master_port=12341 examples/inference.py -i scripts/omniverse_av_configs.jsonl -o outputs/omniverse_generations_av

## Prompt Generator for Scene Conditions
This module provides a configurable system for automatically generating natural-language prompts based on selected environmental, weather, and road-surface conditions. It is designed for data generation, augmentation workflows, or any pipeline where you want consistent, high-quality scene descriptions without manually rewriting prompts.

#### How It Works

The system uses:
- A SceneConfig dataclass
- Three condition dictionaries:
    - ENV_LIGHTING
    - WEATHER
    - ROAD_SURFACE
- A single function: generate_prompt(config)

It takes your base scene, inserts the selected conditions, and returns a polished final prompt. 

#### Code Structure:

```python
from dataclasses import dataclass
from typing import Optional, List

ENV_LIGHTING = { ... }
WEATHER = { ... }
ROAD_SURFACE = { ... }

@dataclass
class SceneConfig:
    base_scene: str
    env_lighting: Optional[str] = None
    weather: Optional[str] = None
    road_surface: Optional[str] = None
    extra_tags: Optional[List[str]] = None

def generate_prompt(config: SceneConfig) -> str:
    parts = [config.base_scene.strip()]
    if config.env_lighting: parts.append(f"The scene is {ENV_LIGHTING[config.env_lighting]}.")
    if config.weather: parts.append(WEATHER[config.weather])
    if config.road_surface: parts.append(ROAD_SURFACE[config.road_surface])
    parts.append("All visual elements should be consistent with these conditions.")
    return " ".join(p for p in parts if p)
```


You can find the full codebase at [src/prompt_generation.py](src/prompt_generation.py)

#### Example:

```python
config = SceneConfig(
    base_scene="A busy urban intersection with multiple vehicles.",
    env_lighting="sunrise",
    weather="fog",
    road_surface="wooden"
)

print(generate_prompt(config))
```

Output:
```
A busy urban intersection with multiple vehicles.
The scene is bathed in warm morning light.
A layer of fog softens distant structures.
The road surface is made of wooden planks.
All visual elements should be consistent with these conditions.
```

## Additional Recipes
Didn't find something you were looking for? There's a bunch of examples in the [cosmos cookbook](https://nvidia-cosmos.github.io/cosmos-cookbook/)!