Skip to content

Random collection of code snippets used to create Deep Fakes

Notifications You must be signed in to change notification settings

beyondbeneath/deepfake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

Deep Fake

Random collection of code snippets used to create Deep Fakes using Databricks for the GPU computation components. It is not expected these would be useful to the public, rather it's just a collection of js, Python and bash commands we used throughout the process.

This was a collaborative project between myself and @soswow.

Repos

Face Swap

Environment configuration

Prepare Databricks environment:

cd /dbfs/.../faceswap
sudo /databricks/python3/bin/pip install --upgrade pip
sudo /databricks/python3/bin/pip install -r requirements.txt
sudo apt-get install ffmpeg

Training data collection

Download YouTube videos from text file:

youtube-dl \
--batch-file "data/url_list.txt" \
-o 'data/videos/%(autonumber)s.%(ext)s' \
--autonumber-start 1 \

Trim videos to appropriate sections:

ffmpeg -ss 0:10 -to 1:10 -i 00001.mp4 -codec copy 00001a.mp4

Extract every 12th frame from a video (see StackOverflow post):

ffmpeg -i videos00001a.mp4 \
-vf select='not(mod(n\,12))',setpts=N/TB \
-r 1 video-frames/video1a-%04d.jpg

Extract faces from directory of frames:

python faceswap.py extract -i ../data/video-frames -o ../data/faces

Although you can directly extract faces from a video, it is recommended to extract frames, then faces in two separate steps since it allows more explicit removal of frames in which you do not want any faces extracted. There are examples where a frame may contain many many people, and face extraction could be slow for example. While there are some methods to help against this (e.g., exclude certain people from face extraction) we found the easiest way was manual selection of which frames contained the person of interest, then use only them as input to the face extractor.

Model build

Build a model on a specific GPU:

CUDA_VISIBLE_DEVICES=0 /databricks/python3/bin/python faceswap.py train \
-A /dbfs/.../facesA \
-B /dbfs/.../facesB \
-m /dbfs/.../model-output \
--timelapse-input-A /dbfs/.../facesA-tl \
--timelapse-input-B /dbfs/.../facesB-tl \
--timelapse-output /dbfs/.../model-tl \
--write_image \
-t dfaker \
--batch-size 16

Note that you should put some representative faces of person A and person B (e.g, 6 of each) in the facesA-tl and facesB-tl folders, which will produce a frame by frame animation as your model builds. Once the model is built, you can convert the frames into a nice animation at (e.g.,) 10 frames-per-second:

ffmpeg -framerate 10 -pattern_type glob -i '*.jpg' model-tl.mp4

Inference

Extract faces from a video file, typically in preparation for inference:

python faceswap.py extract -i ../data/videos/eval.mp4 -o ../data/video-frames-eval/

Remove the all but the first face detection from each frame, using the output alignments.json from previoius step (warning: need to verify that in each case the first detection is indeed the one you care about):

const data = require('./alignments.json');
const fs = require('fs');
let finalObject = {};
Object.entries(data).forEach(([key, value]) => {
  finalObject[key] = [value[0]];
});
fs.writeFileSync('alignments-filtered.json', JSON.stringify(finalObject));

Make the actual face swap on your video using a specific GPU:

CUDA_VISIBLE_DEVICES=` /databricks/python3/bin/python faceswap.py convert \
-i /dbfs/..data/eval.mp4 \
-al /dbfs/../data/alignments-filtered.json \
-o /dbfs/../converted \
-m /dbfs/../model-output \
-w ffmpeg

Voice synthesis

Environment configuration

Prepare Databricks environment:

sudo /databricks/python3/bin/pip install torch==1.0
cd /dbfs/.../tacotron2
sudo /databricks/python3/bin/pip install --upgrade pip
sudo /databricks/python3/bin/pip install -r requirements.txt

Training data collection

Use the Google Speech API to transcript a bunch of videos:

Convert the transcribed videos into a bunch of smaller .wav files and create a metadata .txt which can be used as the labelled set in Tacotron2:

Convert the files to the correct bitrate:

for f in *.WAV; do ffmpeg -i "$f" -acodec pcm_s16le "16bit/$f"; done

Model build

Tacotron (text -> mel-spectrogram)

Replace the supplied train/test files with the new ones and update the paths. We just use the supplied filelist names to avoid having to change extra configuration. Replace 100 with whatever split you desire (probably should be ~10% of the total number of wavs?):

tail -n+100 meta.txt > /dbfs/.../tactron2/filelists/ljs_audio_text_train_filelist.txt
head -n100 meta.txt > /dbfs/.../tactron2/filelists/ljs_audio_text_test_filelist.txt
sed -i -- 's,DUMMY,/dbfs/.../wavs,g' /dbfs/.../tacotron2/filelists/lj*.txt

Fine-tune (transfer learn) the Tacotron model from the NVIDIA supplied checkpoint:

CUDA_VISIBLE_DEVICES=1 /databricks/python3/bin/python train.py \
--output_directory ../taco_out \
--log_directory ../taco_log \
--checkpoint_path /dbfs/.../tacotron2/tacotron2_statedict.pt \
--warm_start \
--hparams batch_size=8

Waveglow (vocoder: mel-spectrogram -> wave)

Create the train/test files which is a simple list of the .wav files. Replace 100 with whatever split you desire (probably should be ~10% of the total number of wavs?):

ls /dbfs/.../wavs/*.wav | tail -n+100 > /dbfs/.../tacotron2/waveglow/train_files.txt
ls /dbfs/.../wavs/*.wav | head -n100 > /dbfs/.../tacotron2/waveglow/test_files.txt

Modify train.py as per waveglow/issues/35:

def load_checkpoint(checkpoint_path, model, optimizer):
    assert os.path.isfile(checkpoint_path)
    checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
    # iteration = checkpoint_dict['iteration']
    iteration = 1
    # optimizer.load_state_dict(checkpoint_dict['optimizer'])
    model_for_loading = checkpoint_dict['model']
    model.load_state_dict(model_for_loading.state_dict())
    print("Loaded checkpoint '{}' (iteration {})" .format(
          checkpoint_path, iteration))
    return model, optimizer, iteration

Adjust the config.json:

checkpoint_path=/dbfs/../tactron2/waveglow/waveglow_256channels.pt
channels=256

Fine-tune (transfer learn) the Waveglow model from NVIDIA supplied checkpoint:

CUDA_VISIBLE_DEVICES=2 /databricks/python3/bin/python train.py -c config.json

Audio inference

This Python function is adpated from the inference.ipynb contained in the NVIDIA Tacotron2 repo, and expanded to include the actual .wav generation (for some reason this was never supplied) and also parameterise which Tacotron & Waveglow models to use. By default it uses the supplied checkpoints, so to use the fine-tuned models substitute in the appropriate checkpoints:

import sys
import os
import numpy as np
import torch

sys.path.append('/dbfs/../tacotron2/')
sys.path.append('/dbfs/../tacotron2/waveglow')
from hparams import create_hparams
from model import Tacotron2
from layers import TacotronSTFT, STFT
from audio_processing import griffin_lim
from train import load_model
from text import text_to_sequence
#from denoiser import Denoiser
import librosa

def text_to_wav(tacotron_path='/dbfs/../tacotron2_statedict.pt',
                waveglow_path='/dbfs/../waveglow_256channels.pt',
                output_file='output.wav',
               text="This is a Deep Fake voice."):

  hparams = create_hparams()
  hparams.sampling_rate = 22050

  model = load_model(hparams)
  model.load_state_dict(torch.load(tacotron_path)['state_dict'])
  _ = model.cuda().eval().half()

  waveglow = torch.load(waveglow_path)['model']
  waveglow.cuda().eval().half()
  for k in waveglow.convinv:
      k.float()
  #denoiser = Denoiser(waveglow)

  sequence = np.array(text_to_sequence(text, ['english_cleaners']))[None, :]
  sequence = torch.autograd.Variable(torch.from_numpy(sequence)).cuda().long()
  mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence)
  with torch.no_grad():
      audio = waveglow.infer(mel_outputs_postnet, sigma=0.666)

  wav = audio[0].data.cpu().numpy()
  librosa.output.write_wav(os.path.join('/dbfs/.../',output_file), wav.astype(np.float32), hparams.sampling_rate)

Notes:

  • We found ending input text with a period . is important otherwise the model outputs a bunch of stuttering garbage for a few seconds at the end of the speech.
  • We couldn't get the "de-noiser" to work, so this is commented out
  • Interestingly, we got our best results with a Tacotron model fine-tuned on 10k iterations (with batch size of 8 and the default learning rate) and the default Waveglow vocoder. Using any more Tacotron iterations, or indeed any fine-tuning on the Vocoder made things worse.

About

Random collection of code snippets used to create Deep Fakes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published