This project implements a video frame prediction and editing model based on InstructPix2Pix and ControlNet. It is designed to work with the Something-Something V2 (SSv2) dataset.
The model takes a sequence of history frames (e.g., 20 frames) and a text prompt (e.g., "Moving something down") to predict the next frame (Target). It utilizes a Temporal Adapter to aggregate historical context and feeds it into a ControlNet that conditions a frozen InstructPix2Pix UNet.
.
├── config.yaml # Main configuration file (Create this based on template below)
├── data.py # SSv2 Dataset loader & transforms
├── model_control_v2.py # Architecture: TemporalAdapter + IP2P_ControlNet
├── preprocess.py # Data extraction: Video to Frames + Metadata CSV
├── sample.py # Selects tasks/IDs from raw SSv2 JSON labels
├── train_control.py # Main training script (ControlNet + Adapter + LoRA)
├── test_control_v2.py # Quantitative evaluation (PSNR/SSIM)
├── test_vis.py # Qualitative evaluation (Visual generation)
└── utils_config.py # Helper to load config.yaml
Install the required Python packages. It is recommended to use a virtual environment (Conda or venv).
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install diffusers transformers peft accelerate
pip install pandas numpy pillow decord scikit-image matplotlib tqdm pyyamlThis project is based on timbrooks/instruct-pix2pix (and AI-ModelScope/instruct-pix2pix in Modelscope). First, download the pretrained InstructPix2Pix weights from Huggingface. Then, place them in a local folder (e.g., models/instruct-pix2pix) and update your config.yaml:
base_model_path: "./models/instruct-pix2pix"This project uses the Something-Something V2 dataset.
Filter video IDs from raw JSON tags based on specified tasks (such as move_object, drop_object).
-
Modify the
task_plandictionary insample.pyto define the task. -
Run:
python sample.py
-
Generate file:
selected_samples_train_test_small.json
Segment the video into historical frames and target frames, and generate a Metadata CSV.
-
Ensure that
config.yamlorVIDEO_DIRin the script points to your SSv2 video folder. -
Run:
python preprocess.py
-
Output directory structure:
processed_dataset/ ├── train/ │ └── drop_object/ │ ├── history_images/ (序列帧) │ └── target_images/ (GT帧) ├── metadata_train.csv └── metadata_test.csv
All parameters are managed in config.yaml.
Before running, please check that task_name and the path are set correctly.
Start training using train_control.py. The script will automatically load the parameters from config.yaml.
python train_control.pyTraining Process Description:
-
Load the IP2P UNet and VAE (freeze VAE and Text Encoder).
-
UNet: Fine-tune using LoRA.
-
ControlNet: Copy the UNet weights for initialization and perform full fine-tuning.
-
Temporal Adapter: Process the past 20 frames and output conditional features to ControlNet.
-
Weights are stored in the
experiments/{task_name}_{resolution}/directory.
Use test_control_v2.py or test_vis.py for inference and evaluation.
# Pure indicator calculation (PSNR/SSIM)
python test_control_v2.py
# Indicator calculation + visualization (Recommended)
python test_vis.pyNote:
-
Ensure that the
checkpoint_folder(e.g.,checkpoint_epoch_25) inconfig.yamlexists in the experimental directory. -
The test script will generate a comparison chart of
inputvspredictionvsground_truth.