Skip to content

eseriles/Temporal-IP2P-ControlNet

Repository files navigation

Temporal IP2P-ControlNet

This project implements a video frame prediction and editing model based on InstructPix2Pix and ControlNet. It is designed to work with the Something-Something V2 (SSv2) dataset.

The model takes a sequence of history frames (e.g., 20 frames) and a text prompt (e.g., "Moving something down") to predict the next frame (Target). It utilizes a Temporal Adapter to aggregate historical context and feeds it into a ControlNet that conditions a frozen InstructPix2Pix UNet.

📂 Project Structure

.
├── config.yaml            # Main configuration file (Create this based on template below)
├── data.py                # SSv2 Dataset loader & transforms
├── model_control_v2.py    # Architecture: TemporalAdapter + IP2P_ControlNet
├── preprocess.py          # Data extraction: Video to Frames + Metadata CSV
├── sample.py              # Selects tasks/IDs from raw SSv2 JSON labels
├── train_control.py       # Main training script (ControlNet + Adapter + LoRA)
├── test_control_v2.py     # Quantitative evaluation (PSNR/SSIM)
├── test_vis.py            # Qualitative evaluation (Visual generation)
└── utils_config.py        # Helper to load config.yaml

🛠 Requirements

Install the required Python packages. It is recommended to use a virtual environment (Conda or venv).

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install diffusers transformers peft accelerate
pip install pandas numpy pillow decord scikit-image matplotlib tqdm pyyaml

📥 Model Weights Download

This project is based on timbrooks/instruct-pix2pix (and AI-ModelScope/instruct-pix2pix in Modelscope). First, download the pretrained InstructPix2Pix weights from Huggingface. Then, place them in a local folder (e.g., models/instruct-pix2pix) and update your config.yaml:

base_model_path: "./models/instruct-pix2pix"

📂 Data Preparation

This project uses the Something-Something V2 dataset.

Step 1: Filter the samples (sample.py)

Filter video IDs from raw JSON tags based on specified tasks (such as move_object, drop_object).

  • Modify the task_plan dictionary in sample.py to define the task.

  • Run:

    python sample.py
  • Generate file:selected_samples_train_test_small.json

Step 2: Video Preprocessing (preprocess.py)

Segment the video into historical frames and target frames, and generate a Metadata CSV.

  • Ensure that config.yaml or VIDEO_DIR in the script points to your SSv2 video folder.

  • Run:

    python preprocess.py
  • Output directory structure:

    processed_dataset/
    ├── train/
    │   └── drop_object/
    │       ├── history_images/ (序列帧)
    │       └── target_images/  (GT帧)
    ├── metadata_train.csv
    └── metadata_test.csv
    

🚀 How to run

1. Configuration file

All parameters are managed in config.yaml.

Before running, please check that task_name and the path are set correctly.

2. Training

Start training using train_control.py. The script will automatically load the parameters from config.yaml.

python train_control.py

Training Process Description:

  • Load the IP2P UNet and VAE (freeze VAE and Text Encoder).

  • UNet: Fine-tune using LoRA.

  • ControlNet: Copy the UNet weights for initialization and perform full fine-tuning.

  • Temporal Adapter: Process the past 20 frames and output conditional features to ControlNet.

  • Weights are stored in the experiments/{task_name}_{resolution}/ directory.

3. Testing

Use test_control_v2.py or test_vis.py for inference and evaluation.

# Pure indicator calculation (PSNR/SSIM)
python test_control_v2.py
# Indicator calculation + visualization (Recommended)
python test_vis.py

Note:

  • Ensure that the checkpoint_folder (e.g., checkpoint_epoch_25) in config.yaml exists in the experimental directory.

  • The test script will generate a comparison chart of input vs prediction vs ground_truth.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages