[Homepage] [Paper] [Code] [Full Dataset] [Challenge]
This repository provides a baseline method for both the ViCo challenge and ViCo Project, including vivid talking head video generation and responsive listening head video generation.
Our code is composed of five groups:
Deep3DFaceRecon_pytorch
: use for extract 3dmm coefficients. Mainly from sicxu/Deep3DFaceRecon, modified following RenYurui/PIRenderpreprocess
: scripts for making dataset compatible with our methodvico
: our method proposed in paper Responsive Listening Head Generation: A Benchmark Dataset and Baseline arXivPIRender
: render 3dmm coefficients to video. Mainly from RenYurui/PIRender with minor modifications.evaluation
: quantitative analysis for generations, including SSIM, CPBD, PSNR, FID, CSIM, etc.- code for CSIM is mainly from deepinsight/insightface
- code for lip sync evaluation is mainly from joonson/syncnet_python
- in Challenge 2023, we use cleardusk/3DDFA_V2 to extract landmarks for LipLMD and 3DMM reconstruction.
For end-to-end inference, this repo may be useful.
This repo is created largely for the challenge, while the full dataset released in ViCo Project is slightly different from the challenge data. You can use the script to convert:
python convert.py --anno_file path_to_anno_file --video_folder path_to_videos --audio_folder path_to_audios --target_folder path_to_target_dataset
-
create a workspace
mkdir vico-workspace cd vico-workspace
-
download dataset from this link and unzip
listening_head.zip
to folderdata/
unzip listening_head.zip -d data/
-
reorganize
data/
folder to meet the requirements of PIRendermkdir -p data/listening_head/videos/test mv data/listening_head/videos/*.mp4 data/listening_head/videos/test
-
clone our code
git clone https://github.com/dc3ea9f/vico_challenge_baseline.git
-
extract 3d coefficients for video ([reference])
-
change directory to
vico_challenge_baseline/Deep3DFaceRecon_pytorch/
cd vico_challenge_baseline/Deep3DFaceRecon_pytorch/
-
prepare environment following this
-
prepare
BFM/
andcheckpoints/
following these instructions -
extract facial landmarks from videos
python extract_kp_videos.py \ --input_dir ../../data/listening_head/videos/ \ --output_dir ../../data/listening_head/keypoints/ \ --device_ids 0,1,2,3 \ --workers 12
-
extract coefficients for videos
python face_recon_videos.py \ --input_dir ../../data/listening_head/videos/ \ --keypoint_dir ../../data/listening_head/keypoints/ \ --output_dir ../../data/listening_head/recons/ \ --inference_batch_size 128 \ --name=face_recon_feat0.2_augment \ --epoch=20 \ --model facerecon
-
-
extract audios features
-
change directory to
vico_challenge_baseline/preprocess
cd ../preprocess
-
install python package
librosa
,torchaudio
andsoundfile
-
extract audio features
python extract_audio_features.py \ --input_audio_folder ../../data/listening_head/audios/ \ --input_recons_folder ../../data/listening_head/recons/ \ --output_folder ../../data/listening_head/example/features/audio_feats
-
-
reorganize video features
python rearrange_recon_coeffs.py \ --input_folder ../../data/listening_head/recons/ \ --output_folder ../../data/listening_head/example/features/video_feats
-
organize data
-
compute mean and std for features
python statistics_mean_std.py ../../data/listening_head/example/features
-
organize for training
mkdir ../../data/listening_head/example/metadata cp ../../data/listening_head/train.csv ../../data/listening_head/example/metadata/data.csv cd ../vico ln -s ../../data/listening_head/example/ ./data
-
-
train baseline
python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 train.py \ --batch_size 4 \ --time_size 90 \ --max_epochs 500 \ --lr 0.002 \ --task speaker \ --output_path saved/baseline_speaker
-
inference
python eval.py \ --batch_size 4 \ --output_path saved/baseline_speaker_E500 \ --resume saved/baseline_speaker/checkpoints/Epoch_500.bin \ --task speaker
-
train baseline
python -m torch.distributed.launch --nproc_per_node 4 --master_port 22345 train.py \ --batch_size 4 \ --time_size 90 \ --max_epochs 500 \ --lr 0.002 \ --task listener \ --output_path saved/baseline_listener
-
inference
python eval.py \ --batch_size 4 \ --output_path saved/baseline_listener_E500 \ --resume saved/baseline_listener/checkpoints/Epoch_500.bin \ --task listener
-
change directory to render
cd ../PIRender
-
prepare environment for PIRender following this
-
download the trained weights of PIRender following this
-
prepare vox lmdb
python scripts/prepare_vox_lmdb.py \ --path ../../data/listening_head/videos/ \ --coeff_3dmm_path ../vico/saved/baseline_speaker_E500/recon_coeffs/ \ --out ../vico/saved/baseline_speaker_E500/vox_lmdb/
-
render to videos
python -m torch.distributed.launch --nproc_per_node=1 --master_port 32345 inference.py \ --config ./config/face_demo.yaml \ --name face \ --no_resume \ --input ../vico/saved/baseline_speaker_E500/vox_lmdb/ \ --output_dir ./vox_result/baseline_speaker_E500
-
prepare vox lmdb
python scripts/prepare_vox_lmdb.py \ --path ../../data/listening_head/videos/ \ --coeff_3dmm_path ../vico/saved/baseline_listener_E500/recon_coeffs/ \ --out ../vico/saved/baseline_listener_E500/vox_lmdb/
-
render to videos
python -m torch.distributed.launch --nproc_per_node=1 --master_port 42345 inference.py \ --config ./config/face_demo.yaml \ --name face \ --no_resume \ --input ../vico/saved/baseline_listener_E500/vox_lmdb/ \ --output_dir ./vox_result/baseline_listener_E500
We will evaluate of the quality of generated videos for the following prespectives:
- generation quality (image level): SSIM, CPBD, PSNR
- generation quality (feature level): FID
- identity preserving: Cosine Similarity from Arcface
- expression: expression features L1 distance from 3dmm
- head motion: expression features L1 distance from 3dmm
- lip sync: AV offset and AV confidence from Sync Net
- lip landmark distance
python compute_base_metrics.py --gt_video_folder {} --pd_video_folder {} --anno_file {*.csv} --task {listener,speaker}
python compute_cpbd.py --pd_video_folder {} --anno_file {*.csv} --task {listener,speaker}
python compute_fid.py --gt_video_folder {} --pd_video_folder {} --anno_file {*.csv} --task {listener,speaker}
Pretrained model: ms1mv3_arcface_r100_fp16/backbone.pth
of this download link.
cd arcface_torch/
python compute_csim.py \
--gt_video_folder {} \
--pd_video_folder {} \
--anno_file {} \
--task {listener,speaker} \
--weight ms1mv3_arcface_r100_fp16/backbone.pth
mean distance of exp / (angle, trans) 3d cofficients
cd lip_sync/
python compute_lipsync.py --pd_video_folder {} --gt_audio_folder {} --anno_file {*.csv}
python compute_lmd.py --gt_video_folder {} --pd_video_folder {} --anno_file {*.csv}
If you think this work is helpful for you, please give it a star and citation :)
@InProceedings{zhou2022responsive,
title={Responsive Listening Head Generation: A Benchmark Dataset and Baseline},
author={Zhou, Mohan and Bai, Yalong and Zhang, Wei and Yao, Ting and Zhao, Tiejun and Mei, Tao},
booktitle={Proceedings of the European conference on computer vision (ECCV)},
year={2022}
}