Skip to content

conphi/SadTalker

 
 

Repository files navigation

😭 SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

            Open In Colab

Wenxuan Zhang *,1,2Xiaodong Cun *,2Xuan Wang 3Yong Zhang 2Xi Shen 2
Yu Guo1 Ying Shan 2   Fei Wang 1

1 Xi'an Jiaotong University   2 Tencent AI Lab   3 Ant Group  

CVPR 2023

sadtalker

TL;DR: A realistic and stylized talking head video generation method from a single image and audio.


📋 Changelog

  • 2023.03.18 Support expression intensity, now you can change the intensity of the generated motion: python inference.py --expression_scale 2(some value > 1).

  • 2023.03.18 Reconfig the data folders, now you can download the checkpoint automatically using bash utils/download_models.sh.

  • 2023.03.18 We have offically integrate the GFPGAN for face enhancement, using python inference.py --enhancer gfpgan for better visualization performance.

  • 2023.03.14 Specify the version of package joblib to remove the errors in using librosa, Open In Colab is online!     

    Previous Changelogs

    • 2023.03.06 Solve some bugs in code and errors in installation
    • 2023.03.03 Release the test code for audio-driven single image animation!
    • 2023.02.28 SadTalker has been accepted by CVPR 2023!

🎼 Pipeline

main_of_sadtalker

🚧 TODO

  • Generating 2D face from a single Image.
  • Generating 3D face from Audio.
  • Generating 4D free-view talking examples from audio and a single image.
  • Gradio/Colab Demo.
  • training code of each componments.
  • Audio-driven Anime Avatar.
  • interpolate ChatGPT for a conversation demo 🤔
  • integrade with stable-diffusion-web-ui. (stay tunning!)
sadtalker_demo_short.mp4

🔮 Inference Demo!

Dependence Installation

CLICK ME
git clone https://github.com/Winfredy/SadTalker.git
cd SadTalker 
conda create -n sadtalker python=3.8
source activate sadtalker
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
conda install ffmpeg
pip install ffmpy Cmake boost dlib-bin # [dlib-bin is much faster than dlib installation] conda install dlib 
pip install -r requirements.txt

### install gpfgan for enhancer
pip install git+https://github.com/TencentARC/GFPGAN

Trained Models

CLICK ME

You can run the following script to put all the models in the right place.

bash utils/download_models.sh

OR download our pre-trained model from google drive or our github release page, and then, put it in ./checkpoints.

Model Description
checkpoints/auido2exp_00300-model.pth Pre-trained ExpNet in Sadtalker.
checkpoints/auido2pose_00140-model.pth Pre-trained PoseVAE in Sadtalker.
checkpoints/mapping_00229-model.pth.tar Pre-trained MappingNet in Sadtalker.
checkpoints/facevid2vid_00189-model.pth.tar Pre-trained face-vid2vid model from the reappearance of face-vid2vid.
checkpoints/epoch_20.pth Pre-trained 3DMM extractor in Deep3DFaceReconstruction.
checkpoints/wav2lip.pth Highly accurate lip-sync model in Wav2lip.
checkpoints/shape_predictor_68_face_landmarks.dat Face landmark model used in dilb.
checkpoints/BFM 3DMM library file.
checkpoints/hub Face detection models used in face alignment.

Generating 2D face from a single Image

python inference.py --driven_audio <audio.wav> \
                    --source_image <video.mp4 or picture.png> \
                    --batch_size <default equals 2, a larger run faster> \
                    --expression_scale <default is 1.0, a larger value will make the motion stronger> \
                    --result_dir <a file to store results> \
                    --enhancer <default is None, you can choose gfpgan or RestoreFormer>
basic w/ gfpgan w/ gfpgan, w/ expression scale = 2
art_0.japanese.mp4
art_0.japanese_es1.mp4
art_0.japanese_es2.mp4

Generating 3D face from Audio

To do ...

Generating 4D free-view talking examples from audio and a single image

We use camera_yaw, camera_pitch, camera_roll to control camera pose. For example, --camera_yaw -20 30 10 means the camera yaw degree changes from -20 to 30 and then changes from 30 to 10.

python inference.py --driven_audio <audio.wav> \
                    --source_image <video.mp4 or picture.png> \
                    --result_dir <a file to store results> \
                    --camera_yaw -20 30 10

free_view

🛎 Citation

If you find our work useful in your research, please consider citing:

@article{zhang2022sadtalker,
  title={SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation},
  author={Zhang, Wenxuan and Cun, Xiaodong and Wang, Xuan and Zhang, Yong and Shen, Xi and Guo, Yu and Shan, Ying and Wang, Fei},
  journal={arXiv preprint arXiv:2211.12194},
  year={2022}
}

💗 Acknowledgements

Facerender code borrows heavily from zhanglonghao's reproduction of face-vid2vid and PIRender. We thank the authors for sharing their wonderful code. In training process, We also use the model from Deep3DFaceReconstruction and Wav2lip. We thank for their wonderful work.

🥂 Related Works

📢 Disclaimer

This is not an official product of Tencent. This repository can only be used for personal/research/non-commercial purposes.

About

(CVPR 2023)SadTalker:Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.9%
  • Jupyter Notebook 1.8%
  • Shell 0.3%