- Requirements
- Data Extraction
- Data Preprocessing
- Training Different Models
- Pretrained Model
- How to run an example
- Citations
- FAQs
- miniconda
- python 3.7 bash installer
- ffmpeg (with x264 enabled)
sudo apt-get install ffmpeg
brew install -i ffmpeg
./configure --enable-gpl --enable-libx264
You may install the requirements by running the following commands
conda init <bash|zsh|...>
<close and re-open terminal or vscode>
conda deactivate
conda env remove -n obamanet
conda create -n obamanet
conda activate obamanet
conda install python=3.7 pip
conda install numpy scikit-learn scipy tqdm cmake
conda run pip install -r requirements.txt
The project is built for python 3.5 and above. The other libraries are listed below
- OpenCV (
sudo pip3 install opencv-contrib-python
) - Dlib (
sudo pip3 install dlib
) with this file unzipped in the data folder - Python Speech Features (
sudo pip3 install python-speech-features
)
For a complete list refer to requirements.txt
file.
I used the tools below to extract and manipulate the data:
I extracted the data from youtube using youtube-dl. It's perhaps the best downloader for youtube on linux. Commands for extracting particular streams are given below.
- Subtitle Extraction
youtube-dl --sub-lang en --skip-download --write-sub --output '~/obamanet/data/captions/%(autonumber)s.%(ext)s' --batch-file ~/obamanet/data/obama_addresses.txt --ignore-config
- Video Extraction
youtube-dl --batch-file ~/obamanet/data/obama_addresses.txt -o '~/obamanet/data/videos/%(autonumber)s.%(ext)s' -f "best[height=720]" --autonumber-start 1
(Videos not available in 720p: 165)
- Video to Audio Conversion
python3 vid2wav.py
- Video to Images
ffmpeg -i 00001.mp4 -r 1/5 -vf scale=-1:720 images/00001-$filename%05d.bmp
To convert from BMP format to JPG format, use the following in the directory
mogrify -format jpg *.bmp
rm -rf *.bmp
Copy the patched images into folder a
and the cropped images to folder b
python3 tools/process.py --input_dir a --b_dir b --operation combine --output_dir c
python3 tools/split.py --dir c
You may use this pretrained model or train pix2pix from scratch using this dataset. Unzip the dataset into the pix2pix main directory.
python3 pix2pix.py --mode train --output_dir output --max_epochs 200 --input_dir c/train/ --which_direction AtoB
To run the pix2pix trained model
python3 pix2pix.py --mode test --output_dir test_out/ --input_dir c_test/ --checkpoint output/
To convert images to video
ffmpeg -r 30 -f image2 -s 256x256 -i %d-targets.png -vcodec libx264 -crf 25 ../targets.mp4
Link to the pretrained model and a subset of the data is here - Link
landmarks file unzipped in the data folder
Download and extract the checkpoints and the data folders into the repository. The file structure should look as shown below.
obamanet
|
└─ data
| | audios
| | a2key_data
| | shape_predictor_68_face_landmarks.dat
| ...
|
└─ checkpoints
| | output
| | my_model.h5
| ...
└─ train.py
└─ run.py
└─ run.sh
...
Run the following commands
conda run ./run.sh <relative_path_to_audio_wav_file>
Example:
conda run ./run.sh data/audios/karan.wav
Feel free to experiment with different voices. However, the result will depend on how close your voice is to the subject we trained on.
If you use this code for your research, please cite the paper this code is based on: ObamaNet: Photo-realistic lip-sync from text and also the amazing repository of pix2pix by affinelayer.
Cite as arXiv:1801.01442v1 [cs.CV]