# Image to Video Generation: A Demo Notebook

This Jupyter Notebook provides a runnable implementation of the end-to-end **Video Virtual Try-On (VVT)** pipeline detailed in the provided technical report, *'Architecting a Web-Based Video Virtual Try-On Demonstrator'*. 

The notebook orchestrates a series of state-of-the-art open-source models to take a video of a person and an image of a garment, and produce a new video of that person wearing the new garment. The core generative model used is **ViViD**, as selected in the report.

### ⚠️ **System Requirements**
This is a computationally intensive pipeline that requires specific hardware and significant storage for model weights.
- **GPU**: A powerful NVIDIA GPU with **at least 24 GB of VRAM** is strongly recommended. The process may fail on GPUs with less memory.
- **RAM**: 32 GB of system RAM or more.
- **Storage**: ~50 GB of free disk space for repositories and model checkpoints.
- **OS**: This notebook is intended for a Linux environment with `git` and `wget` installed.

## Step 1: Environment Setup

First, we clone the necessary open-source repositories. The pipeline is a complex system that relies on code from multiple projects.

In [None]:
print("Cloning required repositories...")
# Clone the main ViViD repository for the core VTON model
!git clone https://github.com/alibaba-yuanjing-aigclab/ViViD.git

# Clone vid2densepose for human pose extraction
!git clone https://github.com/Flode-Labs/vid2densepose.git

# Clone Segment Anything Model (SAM) for garment masking
!git clone https://github.com/facebookresearch/segment-anything.git

# Clone OOTDiffusion and its recommended human parser for the clothing-agnostic step
!git clone https://github.com/levihsu/OOTDiffusion.git
!git clone https://github.com/GoGoDuck912/Self-Correction-Human-Parsing.git
print("Repositories cloned successfully.")

Next, we install all the Python dependencies from the cloned repositories.

In [None]:
print("Installing dependencies...")
%pip install -r ViViD/requirements.txt
%pip install -r vid2densepose/requirements.txt
# Install Detectron2, a dependency for vid2densepose
%pip install "git+https://github.com/facebookresearch/detectron2.git"
# Install Segment Anything
%pip install -e segment-anything/
# Install OOTDiffusion dependencies
%pip install -r OOTDiffusion/requirements.txt
# Install additional required libraries
%pip install pyyaml
print("Dependencies installed successfully.")

## Step 2: Download Pre-trained Model Weights

This pipeline requires several large pre-trained model files (checkpoints). We will create directories for them and then download the files.

**Note:** Some download links, particularly for the main ViViD model, may change. Always refer to the official GitHub repositories for the most up-to-date links if a download fails.

In [None]:
import os

print("Creating directories for checkpoints...")
os.makedirs("ViViD/checkpoints", exist_ok=True)
os.makedirs("vid2densepose/checkpoints", exist_ok=True)
os.makedirs("segment-anything/checkpoints", exist_ok=True)
os.makedirs("OOTDiffusion/checkpoints", exist_ok=True)
os.makedirs("Self-Correction-Human-Parsing/checkpoints", exist_ok=True)

print("Downloading model weights...")

# --- ViViD Checkpoints ---
# Please download the ViViD checkpoints from their official repository or Hugging Face page
# and place them in the 'ViViD/checkpoints/' directory.
print("ACTION REQUIRED: Manually download ViViD weights into ViViD/checkpoints/")

# --- vid2densepose Checkpoint ---
# This is the model for DensePose extraction.
!wget -P vid2densepose/checkpoints/ https://dl.fbaipublicfiles.com/densepose/densepose_rcnn_R_50_FPN_s1x/165712039/model_final_162be9.pkl

# --- Segment Anything (SAM) Checkpoint ---
# Using the ViT-H SAM model.
!wget -P segment-anything/checkpoints/ https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

# --- Human Parsing Checkpoint (for Agnostic Video) ---
# This model is used by OOTDiffusion's preprocessing steps.
!wget -P Self-Correction-Human-Parsing/checkpoints/ https://github.com/GoGoDuck912/Self-Correction-Human-Parsing/releases/download/eccv2022/exp-schp-201908261155-lip.pth

print("\nDownloads complete. Please ensure you manually downloaded the ViViD weights.")

## Step 3: Configuration

Set the paths for your input files and the directory for the final output.

In [None]:
# --- USER INPUTS ---

# Path to the input video of the person
# IMPORTANT: The person video should be a short clip (e.g., 5-10 seconds).
person_video_path = "./path/to/your/person_video.mp4" 

# Path to the input image of the garment
garment_image_path = "./path/to/your/garment_image.png"

# Directory to save all intermediate and final results
output_dir = "./vton_results"

# --- END USER INPUTS ---

os.makedirs(output_dir, exist_ok=True)

# Define paths for all intermediate files
densepose_video_path = os.path.join(output_dir, "densepose.mp4")
garment_mask_path = os.path.join(output_dir, "garment_mask.png")
agnostic_video_path = os.path.join(output_dir, "agnostic.mp4")
final_video_path = os.path.join(output_dir, "result.mp4")

print(f"Input Person Video: {person_video_path}")
print(f"Input Garment Image: {garment_image_path}")
print(f"Output Directory: {output_dir}")

## Step 4: The VTON Pipeline
Here, we define Python functions to encapsulate each stage of the pipeline described in the report. This automates the series of command-line calls and script executions.

In [None]:
import subprocess
import cv2
import numpy as np
import yaml
from segment_anything import SamAutomaticMaskGenerator, sam_model_registry

def generate_densepose_video(input_video, output_video):
    """Runs the vid2densepose script to extract human pose."""
    print("\n--- Stage 1: Generating DensePose Video ---")
    vid2densepose_script = "vid2densepose/main.py"
    # Note: vid2densepose needs its own checkpoint for the detectron2 model.
    # The script is hardcoded to use a specific model, so we don't need to pass the path.
    command = [
        "python", vid2densepose_script,
        "-i", input_video,
        "-o", output_video
    ]
    subprocess.run(command, check=True)
    print("✅ DensePose video generated successfully.")

def generate_garment_mask(input_image, output_mask):
    """Uses Segment Anything Model (SAM) to isolate the garment."""
    print("\n--- Stage 2: Generating Garment Mask ---")
    sam_checkpoint = "segment-anything/checkpoints/sam_vit_h_4b8939.pth"
    model_type = "vit_h"
    device = "cuda"

    sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
    sam.to(device=device)
    mask_generator = SamAutomaticMaskGenerator(sam)

    image = cv2.imread(input_image)
    if image is None:
        raise FileNotFoundError(f"Could not read image file: {input_image}")
    image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    masks = mask_generator.generate(image_rgb)

    # Find the best mask - typically the one with the largest area that is centrally located.
    if not masks:
        raise ValueError("SAM did not find any objects in the garment image.")
    
    # Simple logic: select the largest mask
    sorted_masks = sorted(masks, key=lambda m: m['area'], reverse=True)
    best_mask = sorted_masks[0]['segmentation']
    
    # Create a binary image (0 or 255) and save it
    binary_mask_img = (best_mask * 255).astype(np.uint8)
    cv2.imwrite(output_mask, binary_mask_img)
    print("✅ Garment mask generated successfully.")

def generate_agnostic_video(input_video, output_video):
    """
    Generates the clothing-agnostic video. 
    This is a complex step involving human parsing and inpainting.
    The ViViD paper recommends the method from OOTDiffusion.
    This function will call a helper script from the OOTDiffusion project.
    *** NOTE: You may need to create or adapt a script in the OOTDiffusion 
    *** repository to perform this video-based task.
    """
    print("\n--- Stage 3: Generating Clothing-Agnostic Video ---")
    # This assumes a script exists at 'OOTDiffusion/run_agnostic_video.py'
    # that takes an input video, uses the human_parser, and creates an inpainted video.
    agnostic_script = "OOTDiffusion/run_agnostic_video.py" # You might need to create this script.
    
    if not os.path.exists(agnostic_script):
        print("⚠️ WARNING: Agnostic video generation script not found.")
        print(f"This step is being SKIPPED. You must implement '{agnostic_script}' using OOTDiffusion's human parsing and inpainting logic.")
        # As a fallback, we will just copy the original video.
        # The final result will be poor, but it allows the pipeline to complete.
        import shutil
        shutil.copyfile(input_video, output_video)
        return
        
    command = [
        "python", agnostic_script,
        "--input_video", input_video,
        "--output_video", output_video,
        "--parser_checkpoint", "Self-Correction-Human-Parsing/checkpoints/exp-schp-201908261155-lip.pth"
    ]
    subprocess.run(command, check=True)
    print("✅ Agnostic video generated successfully.")

def run_vivid_inference(config_path, person_video, densepose_video, garment_image, garment_mask, final_output_dir):
    """Runs the final ViViD model inference."""
    print("\n--- Stage 4: Running ViViD Inference ---")
    
    # 4a. Modify the ViViD config file dynamically
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    
    # Update paths in the config
    config['test_cases']['upper_1']['garment_path'] = os.path.abspath(garment_image)
    config['test_cases']['upper_1']['garment_mask_path'] = os.path.abspath(garment_mask)
    config['test_cases']['upper_1']['video_path'] = os.path.abspath(person_video) # The agnostic video path
    config['test_cases']['upper_1']['pose_video'] = os.path.abspath(densepose_video)
    config['out_dir'] = os.path.abspath(final_output_dir)
    
    # Save the modified config to a new file
    modified_config_path = os.path.join(output_dir, "run_config.yaml")
    with open(modified_config_path, 'w') as f:
        yaml.dump(config, f)
    print(f"Modified config saved to {modified_config_path}")

    # 4b. Run the inference script
    vivid_inference_script = "ViViD/vivid.py"
    command = [
        "python", vivid_inference_script,
        "--config", modified_config_path
    ]
    subprocess.run(command, check=True)
    print(f"✅ Final video generated in directory: {final_output_dir}")

## Step 5: Orchestration and Execution

Now, we'll create a master function to run the entire pipeline in sequence and then execute it.

In [None]:
def run_full_pipeline():
    """Orchestrates the entire VTON pipeline from raw inputs to final video."""
    print("===========================================")
    print("STARTING VIDEO VIRTUAL TRY-ON PIPELINE")
    print("===========================================")

    try:
        # Stage 1: DensePose Extraction
        generate_densepose_video(person_video_path, densepose_video_path)

        # Stage 2: Garment Masking
        generate_garment_mask(garment_image_path, garment_mask_path)

        # Stage 3: Clothing-Agnostic Video
        # For this demo, we assume the agnostic video is the same as the original.
        # Replace this with the real `generate_agnostic_video` call once implemented.
        print("\n--- Stage 3: Generating Clothing-Agnostic Video (SKIPPED) ---")
        print("Using original person video as the agnostic video. For best results, implement this step.")
        agnostic_input_video = person_video_path 
        # generate_agnostic_video(person_video_path, agnostic_video_path)
        # agnostic_input_video = agnostic_video_path

        # Stage 4: Run ViViD Inference
        vivid_config_template = "ViViD/configs/prompts/upper1.yaml"
        vivid_output_dir = os.path.join(output_dir, 'vivid_output')
        run_vivid_inference(
            config_path=vivid_config_template, 
            person_video=agnostic_input_video, 
            densepose_video=densepose_video_path, 
            garment_image=garment_image_path, 
            garment_mask=garment_mask_path, 
            final_output_dir=vivid_output_dir
        )
        
        print("\n===========================================")
        print("🎉 PIPELINE COMPLETED SUCCESSFULLY! 🎉")
        print(f"Check the '{vivid_output_dir}' directory for your video.")
        print("===========================================")

    except FileNotFoundError as e:
        print(f"\nERROR: A required file was not found. Please check your paths.")
        print(e)
    except subprocess.CalledProcessError as e:
        print(f"\nERROR: A script failed to execute. See the output above for details.")
        print(e)
    except Exception as e:
        print(f"\nAn unexpected error occurred: {e}")

# --- UNCOMMENT THE LINE BELOW TO RUN THE PIPELINE ---
# run_full_pipeline()