GitHub - conphi/SadTalker: （CVPR 2023）SadTalker：Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

😭 SadTalker： Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Wenxuan Zhang ^*,1,2 Xiaodong Cun ^*,2 Xuan Wang ³ Yong Zhang ² Xi Shen ²
Yu Guo¹ Ying Shan ² Fei Wang ¹

¹ Xi'an Jiaotong University ² Tencent AI Lab ³ Ant Group

CVPR 2023

TL;DR: A realistic and stylized talking head video generation method from a single image and audio.

📋 Changelog

2023.03.18 Support expression intensity, now you can change the intensity of the generated motion: python inference.py --expression_scale 2(some value > 1).
2023.03.18 Reconfig the data folders, now you can download the checkpoint automatically using bash utils/download_models.sh.
2023.03.18 We have offically integrate the GFPGAN for face enhancement, using python inference.py --enhancer gfpgan for better visualization performance.
2023.03.14 Specify the version of package joblib to remove the errors in using librosa, is online!
Previous Changelogs
- 2023.03.06 Solve some bugs in code and errors in installation
- 2023.03.03 Release the test code for audio-driven single image animation!
- 2023.02.28 SadTalker has been accepted by CVPR 2023!

🎼 Pipeline

🚧 TODO

Generating 2D face from a single Image.
Generating 3D face from Audio.
Generating 4D free-view talking examples from audio and a single image.
Gradio/Colab Demo.
training code of each componments.
Audio-driven Anime Avatar.
interpolate ChatGPT for a conversation demo 🤔
integrade with stable-diffusion-web-ui. (stay tunning!)

sadtalker_demo_short.mp4

🔮 Inference Demo!

Dependence Installation

CLICK ME

git clone https://github.com/Winfredy/SadTalker.git
cd SadTalker 
conda create -n sadtalker python=3.8
source activate sadtalker
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
conda install ffmpeg
pip install ffmpy Cmake boost dlib-bin # [dlib-bin is much faster than dlib installation] conda install dlib 
pip install -r requirements.txt

### install gpfgan for enhancer
pip install git+https://github.com/TencentARC/GFPGAN

Trained Models

CLICK ME

You can run the following script to put all the models in the right place.

bash utils/download_models.sh

OR download our pre-trained model from google drive or our github release page, and then, put it in ./checkpoints.

Model	Description
checkpoints/auido2exp_00300-model.pth	Pre-trained ExpNet in Sadtalker.
checkpoints/auido2pose_00140-model.pth	Pre-trained PoseVAE in Sadtalker.
checkpoints/mapping_00229-model.pth.tar	Pre-trained MappingNet in Sadtalker.
checkpoints/facevid2vid_00189-model.pth.tar	Pre-trained face-vid2vid model from the reappearance of face-vid2vid.
checkpoints/epoch_20.pth	Pre-trained 3DMM extractor in Deep3DFaceReconstruction.
checkpoints/wav2lip.pth	Highly accurate lip-sync model in Wav2lip.
checkpoints/shape_predictor_68_face_landmarks.dat	Face landmark model used in dilb.
checkpoints/BFM	3DMM library file.
checkpoints/hub	Face detection models used in face alignment.

Generating 2D face from a single Image

python inference.py --driven_audio <audio.wav> \
                    --source_image <video.mp4 or picture.png> \
                    --batch_size <default equals 2, a larger run faster> \
                    --expression_scale <default is 1.0, a larger value will make the motion stronger> \
                    --result_dir <a file to store results> \
                    --enhancer <default is None, you can choose gfpgan or RestoreFormer>

basic	w/ gfpgan	w/ gfpgan, w/ expression scale = 2
art_0.japanese.mp4	art_0.japanese_es1.mp4	art_0.japanese_es2.mp4

Generating 3D face from Audio

To do ...

Generating 4D free-view talking examples from audio and a single image

We use camera_yaw, camera_pitch, camera_roll to control camera pose. For example, --camera_yaw -20 30 10 means the camera yaw degree changes from -20 to 30 and then changes from 30 to 10.

python inference.py --driven_audio <audio.wav> \
                    --source_image <video.mp4 or picture.png> \
                    --result_dir <a file to store results> \
                    --camera_yaw -20 30 10

🛎 Citation

If you find our work useful in your research, please consider citing:

@article{zhang2022sadtalker,
  title={SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation},
  author={Zhang, Wenxuan and Cun, Xiaodong and Wang, Xuan and Zhang, Yong and Shen, Xi and Guo, Yu and Shan, Ying and Wang, Fei},
  journal={arXiv preprint arXiv:2211.12194},
  year={2022}
}

💗 Acknowledgements

Facerender code borrows heavily from zhanglonghao's reproduction of face-vid2vid and PIRender. We thank the authors for sharing their wonderful code. In training process, We also use the model from Deep3DFaceReconstruction and Wav2lip. We thank for their wonderful work.

🥂 Related Works

📢 Disclaimer

This is not an official product of Tencent. This repository can only be used for personal/research/non-commercial purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
audio2exp_models		audio2exp_models
audio2pose_models		audio2pose_models
checkpoints		checkpoints
config		config
docs		docs
examples		examples
face3d		face3d
facerender		facerender
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_batch.py		generate_batch.py
generate_facerender_batch.py		generate_facerender_batch.py
inference.py		inference.py
quick_demo.ipynb		quick_demo.ipynb
requirements.txt		requirements.txt
test_audio2coeff.py		test_audio2coeff.py

License

conphi/SadTalker

Folders and files

Latest commit

History

Repository files navigation