# VOCA
VOCA is an audio driven, rig based speech avatar animation.The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input—even speech in languages other than English—and realistically animates a wide range of adult faces.It is rapidly applicable to even unseen targets.
![](https://ps.is.mpg.de/uploads/publication/image/22550/voca.png)

### Check CUDA availability
If there is no GPU available, then edit the notebook settings, to include a GPU

In [None]:
#%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


## VOCA Installation

### Install MESH library (psbody)

In [None]:
### Install MESH library
!pip3 install git+https://github.com/MPI-IS/mesh.git

Collecting git+https://github.com/MPI-IS/mesh.git
  Cloning https://github.com/MPI-IS/mesh.git to /tmp/pip-req-build-l35y7z3m
  Running command git clone -q https://github.com/MPI-IS/mesh.git /tmp/pip-req-build-l35y7z3m


### VOCA requirements

In [None]:
### VOCA requirements
%cd /content/
!sudo apt install ffmpeg
!git clone https://github.com/TimoBolkart/voca.git
#!pip install tensorboard==1.15
#!pip install gast==0.3.2
!cd voca && pip install -r requirements.txt

/content
Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.
fatal: destination path 'voca' already exists and is not an empty directory.


### Download data and models for VOCA demo
All the models and data are stored in the Google Drive of each user, under the folder **/content/drive/MyDrive/voca/**

In [None]:
'''
### Download all data for VOCA demo
from urllib import request
import os

#### Get VOCA account credentials first
# to get the credentials please register and confirm your account in the following
# websites :
'https://voca.is.tue.mpg.de/register.php'
'https://flame.is.tue.mpg.de/register.php'
username = 'cantoniou@student.ethz.ch' #"Add your username"
password = 'Poseidonos201!' #"Add your password"


#### Define path
voca_model_url = 'https://download.is.tue.mpg.de/download.php?domain=voca&resume=1&sfile=model.zip'
voca_audio_sequences_url = 'https://download.is.tue.mpg.de/download.php?domain=voca&resume=1&sfile=audio.zip'
template_meshes_url = 'https://download.is.tue.mpg.de/download.php?domain=voca&resume=1&sfile=templates.zip'
flame_mpi_url = 'https://download.is.tue.mpg.de/download.php?domain=flame&resume=1&sfile=FLAME2020.zip'
deep_speech_url = 'https://github.com/mozilla/DeepSpeech/releases/download/v0.1.0/deepspeech-0.1.0-models.tar.gz'

#### Download zip files
cmd = "!wget --user=%s --password=%s %s && \
      !wget --user=%s --password=%s %s && \
      !wget --user=%s --password=%s %s && \
      !wget --user=%s --password=%s %s  \
      !wget %s " % (username, password, voca_model_url,
                                           username, password,voca_audio_sequences_url,
                                            username, password, template_meshes_url,
                                            username, password, flame_mpi_url,
                                            deep_speech_url )

cmd = "!wget %s " % (deep_speech_url)
os.system(cmd)
'''

'\n### Download all data for VOCA demo\nfrom urllib import request\nimport os\n\n#### Get VOCA account credentials first\n# to get the credentials please register and confirm your account in the following\n# websites :\n\'https://voca.is.tue.mpg.de/register.php\'\n\'https://flame.is.tue.mpg.de/register.php\'\nusername = \'cantoniou@student.ethz.ch\' #"Add your username"\npassword = \'Poseidonos201!\' #"Add your password"\n\n\n#### Define path\nvoca_model_url = \'https://download.is.tue.mpg.de/download.php?domain=voca&resume=1&sfile=model.zip\'\nvoca_audio_sequences_url = \'https://download.is.tue.mpg.de/download.php?domain=voca&resume=1&sfile=audio.zip\'\ntemplate_meshes_url = \'https://download.is.tue.mpg.de/download.php?domain=voca&resume=1&sfile=templates.zip\'\nflame_mpi_url = \'https://download.is.tue.mpg.de/download.php?domain=flame&resume=1&sfile=FLAME2020.zip\'\ndeep_speech_url = \'https://github.com/mozilla/DeepSpeech/releases/download/v0.1.0/deepspeech-0.1.0-models.tar.gz\'\n

In [None]:
#!wget 'https://github.com/mozilla/DeepSpeech/releases/download/v0.1.0/deepspeech-0.1.0-models.tar.gz' -o /content/drive/MyDrive/voca/

### Place data & models to appropriate location

In [None]:
### Setup files and place them to appropriate location
!unzip /content/drive/MyDrive/voca/model.zip -d /content/voca/model

Archive:  /content/drive/MyDrive/voca/model.zip
  inflating: /content/voca/model/gstep_52280.model.data-00000-of-00001  
  inflating: /content/voca/model/gstep_52280.model.index  
  inflating: /content/voca/model/gstep_52280.model.meta  
  inflating: /content/voca/model/readme.pdf  


In [None]:
# Unzip and place FLAME MODEL to appropriate folder
!unzip /content/drive/MyDrive/voca/FLAME2020.zip -d /content/voca/flame

Archive:  /content/drive/MyDrive/voca/FLAME2020.zip
  inflating: /content/voca/flame/female_model.pkl  
  inflating: /content/voca/flame/generic_model.pkl  
  inflating: /content/voca/flame/male_model.pkl  
  inflating: /content/voca/flame/Readme.pdf  


In [None]:
# Unzip and place audio sequences to audio folder
!unzip /content/drive/MyDrive/voca/audio.zip -d /content/voca/audio

Archive:  /content/drive/MyDrive/voca/audio.zip
   creating: /content/voca/audio/FaceTalk_170725_00137_TA/
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence07.wav  
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence03.wav  
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence28.wav  
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence22.wav  
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence15.wav  
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence11.wav  
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence35.wav  
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence38.wav  
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence09.wav  
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence36.wav  
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence17.wav  
  inflating: /content/voca/audio/FaceTalk_170725_00137_TA/sentence20

In [None]:
# Unzip and place templates to template folder
!unzip /content/drive/MyDrive/voca/templates.zip -d /content/voca/template

Archive:  /content/drive/MyDrive/voca/templates.zip
  inflating: /content/voca/template/FaceTalk_170725_00137_TA.ply  
  inflating: /content/voca/template/FaceTalk_170728_03272_TA.ply  
  inflating: /content/voca/template/FaceTalk_170731_00024_TA.ply  
  inflating: /content/voca/template/FaceTalk_170809_00138_TA.ply  
  inflating: /content/voca/template/FaceTalk_170811_03274_TA.ply  
  inflating: /content/voca/template/FaceTalk_170811_03275_TA.ply  
  inflating: /content/voca/template/FaceTalk_170904_00128_TA.ply  
  inflating: /content/voca/template/FaceTalk_170904_03276_TA.ply  
  inflating: /content/voca/template/FaceTalk_170908_03277_TA.ply  
  inflating: /content/voca/template/FaceTalk_170912_03278_TA.ply  
  inflating: /content/voca/template/FaceTalk_170913_03279_TA.ply  
  inflating: /content/voca/template/FaceTalk_170915_00223_TA.ply  
  inflating: /content/voca/template/readme.pdf  


In [None]:
### Unzip deep speech model and palce in ds_graph folder
!tar -xvf /content/drive/MyDrive/voca/deepspeech-0.1.0-models.tar.gz  -C /content/voca/ds_graph

models/
models/lm.binary
models/output_graph.pb
models/trie
models/alphabet.txt


## VOCA demo
This demo runs VOCA, which outputs the animation meshes given audio sequences, and renders the animation sequence to a video.
To succesfully run the demo (in case of error), deactivate visualization for Google Colab (remote environment) and save all meshes in folder. Then you can use an online 3D render to view them.

### Disable Eager Execution 

In [None]:
%%writefile /content/voca/utils/inference.py
'''
Max-Planck-Gesellschaft zur Foerderung der Wissenschaften e.V. (MPG) is holder of all proprietary rights on this
computer program.

You can only use this computer program if you have closed a license agreement with MPG or you get the right to use
the computer program from someone who is authorized to grant you that right.

Any use of the computer program without a valid license is prohibited and liable to prosecution.

Copyright 2019 Max-Planck-Gesellschaft zur Foerderung der Wissenschaften e.V. (MPG). acting on behalf of its
Max Planck Institute for Intelligent Systems and the Max Planck Institute for Biological Cybernetics.
All rights reserved.

More information about VOCA is available at http://voca.is.tue.mpg.de.
For comments or questions, please email us at voca@tue.mpg.de
'''


import os
import cv2
import scipy
import tempfile
import numpy as np
import tensorflow as tf
from subprocess import call
from scipy.io import wavfile


from psbody.mesh import Mesh
from utils.audio_handler import  AudioHandler
from utils.rendering import render_mesh_helper


### Check if TF us running on eager execution and deactivate
import tensorflow as tf
if tf.executing_eagerly():
  tf.compat.v1.disable_eager_execution() # set to False


def process_audio(ds_path, audio, sample_rate):
    config = {}
    config['deepspeech_graph_fname'] = ds_path
    config['audio_feature_type'] = 'deepspeech'
    config['num_audio_features'] = 29

    config['audio_window_size'] = 16
    config['audio_window_stride'] = 1

    tmp_audio = {'subj': {'seq': {'audio': audio, 'sample_rate': sample_rate}}}
    audio_handler = AudioHandler(config)
    return audio_handler.process(tmp_audio)['subj']['seq']['audio']


def output_sequence_meshes(sequence_vertices, template, out_path, uv_template_fname='', texture_img_fname=''):
    mesh_out_path = os.path.join(out_path, 'meshes')
    if not os.path.exists(mesh_out_path):
        os.makedirs(mesh_out_path)

    if os.path.exists(uv_template_fname):
        uv_template = Mesh(filename=uv_template_fname)
        vt, ft = uv_template.vt, uv_template.ft
    else:
        vt, ft = None, None

    num_frames = sequence_vertices.shape[0]
    for i_frame in range(num_frames):
        out_fname = os.path.join(mesh_out_path, '%05d.obj' % i_frame)
        out_mesh = Mesh(sequence_vertices[i_frame], template.f)
        if vt is not None and ft is not None:
            out_mesh.vt, out_mesh.ft = vt, ft
        if os.path.exists(texture_img_fname):
            out_mesh.set_texture_image(texture_img_fname)
        out_mesh.write_obj(out_fname)

def render_sequence_meshes(audio_fname, sequence_vertices, template, out_path, uv_template_fname='', texture_img_fname=''):
    if not os.path.exists(out_path):
        os.makedirs(out_path)

    tmp_video_file = tempfile.NamedTemporaryFile('w', suffix='.mp4', dir=out_path)
    if int(cv2.__version__[0]) < 3:
        writer = cv2.VideoWriter(tmp_video_file.name, cv2.cv.CV_FOURCC(*'mp4v'), 60, (800, 800), True)
    else:
        writer = cv2.VideoWriter(tmp_video_file.name, cv2.VideoWriter_fourcc(*'mp4v'), 60, (800, 800), True)

    if os.path.exists(uv_template_fname) and os.path.exists(texture_img_fname):
        uv_template = Mesh(filename=uv_template_fname)
        vt, ft = uv_template.vt, uv_template.ft
        tex_img = cv2.imread(texture_img_fname)[:,:,::-1]
    else:
        vt, ft = None, None
        tex_img = None

    num_frames = sequence_vertices.shape[0]
    center = np.mean(sequence_vertices[0], axis=0)
    for i_frame in range(num_frames):
        render_mesh = Mesh(sequence_vertices[i_frame], template.f)
        if vt is not None and ft is not None:
            render_mesh.vt, render_mesh.ft = vt, ft
        img = render_mesh_helper(render_mesh, center, tex_img=tex_img)
        writer.write(img)
    writer.release()

    video_fname = os.path.join(out_path, 'video.mp4')
    cmd = ('ffmpeg' + ' -i {0} -i {1} -vcodec h264 -ac 2 -channel_layout stereo -pix_fmt yuv420p {2}'.format(
        audio_fname, tmp_video_file.name, video_fname)).split()
    call(cmd)


def inference(tf_model_fname, ds_fname, audio_fname, template_fname, condition_idx, out_path, render_sequence=True, uv_template_fname='', texture_img_fname=''):
    template = Mesh(filename=template_fname)

    sample_rate, audio = wavfile.read(audio_fname)
    if audio.ndim != 1:
        print('Audio has multiple channels, only first channel is considered')
        audio = audio[:,0]

    processed_audio = process_audio(ds_fname, audio, sample_rate)

    # Load previously saved meta graph in the default graph
    saver = tf.train.import_meta_graph(tf_model_fname + '.meta')
    graph = tf.get_default_graph()

    speech_features = graph.get_tensor_by_name(u'VOCA/Inputs_encoder/speech_features:0')
    condition_subject_id = graph.get_tensor_by_name(u'VOCA/Inputs_encoder/condition_subject_id:0')
    is_training = graph.get_tensor_by_name(u'VOCA/Inputs_encoder/is_training:0')
    input_template = graph.get_tensor_by_name(u'VOCA/Inputs_decoder/template_placeholder:0')
    output_decoder = graph.get_tensor_by_name(u'VOCA/output_decoder:0')

    num_frames = processed_audio.shape[0]
    feed_dict = {speech_features: np.expand_dims(np.stack(processed_audio), -1),
                 condition_subject_id: np.repeat(condition_idx-1, num_frames),
                 is_training: False,
                 input_template: np.repeat(template.v[np.newaxis, :, :, np.newaxis], num_frames, axis=0)}

    with tf.Session() as session:
        # Restore trained model
        saver.restore(session, tf_model_fname)
        predicted_vertices = np.squeeze(session.run(output_decoder, feed_dict))
        output_sequence_meshes(predicted_vertices, template, out_path)
        if(render_sequence):
            render_sequence_meshes(audio_fname, predicted_vertices, template, out_path, uv_template_fname, texture_img_fname)
    tf.reset_default_graph()


def inference_interpolate_styles(tf_model_fname, ds_fname, audio_fname, template_fname, condition_weights, out_path):
    template = Mesh(filename=template_fname)

    sample_rate, audio = wavfile.read(audio_fname)
    if audio.ndim != 1:
        print('Audio has multiple channels, only first channel is considered')
        audio = audio[:, 0]

    processed_audio = process_audio(ds_fname, audio, sample_rate)

    # Load previously saved meta graph in the default graph
    saver = tf.train.import_meta_graph(tf_model_fname + '.meta')
    graph = tf.get_default_graph()

    speech_features = graph.get_tensor_by_name(u'VOCA/Inputs_encoder/speech_features:0')
    condition_subject_id = graph.get_tensor_by_name(u'VOCA/Inputs_encoder/condition_subject_id:0')
    is_training = graph.get_tensor_by_name(u'VOCA/Inputs_encoder/is_training:0')
    input_template = graph.get_tensor_by_name(u'VOCA/Inputs_decoder/template_placeholder:0')
    output_decoder = graph.get_tensor_by_name(u'VOCA/output_decoder:0')

    non_zeros = np.where(condition_weights > 0.0)[0]
    condition_weights[non_zeros] /= sum(condition_weights[non_zeros])

    num_frames = processed_audio.shape[0]
    output_vertices = np.zeros((num_frames, template.v.shape[0], template.v.shape[1]))

    with tf.Session() as session:
        # Restore trained model
        saver.restore(session, tf_model_fname)

        for condition_id in non_zeros:
            feed_dict = {speech_features: np.expand_dims(np.stack(processed_audio), -1),
                         condition_subject_id: np.repeat(condition_id, num_frames),
                         is_training: False,
                         input_template: np.repeat(template.v[np.newaxis, :, :, np.newaxis], num_frames, axis=0)}
            predicted_vertices = np.squeeze(session.run(output_decoder, feed_dict))
            output_vertices += condition_weights[condition_id] * predicted_vertices

        output_sequence_meshes(output_vertices, template, out_path)

Overwriting /content/voca/utils/inference.py


### Inference

In [None]:
%cd /content/voca/
!python3 run_voca.py --tf_model_fname './model/gstep_52280.model' \
--ds_fname './ds_graph/models/output_graph.pb'  \
--audio_fname '/content/drive/MyDrive/voca/trifilo_varvato_k_oraio.wav' \
--template_fname './template/FLAME_sample.ply' --condition_idx 3  --out_path  '/content/drive/MyDrive/voca/animation_pao/'   # './animation_output' \
#--visualize False \

/content/voca
Audio has multiple channels, only first channel is considered




2022-04-19 12:01:59.034659: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-04-19 12:01:59.051612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-19 12:01:59.052285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
2022-04-19 12:01:59.052607: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-04-19 12:01:59.053862: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2022-04-19 12:01:59.054956: I tensorflow/stream_exec

Run VOCA and visualize the meshes with a pre-defined texture (obtained by fitting FLAME to an image using TF_FLAME

In [None]:
!python run_voca.py --tf_model_fname './model/gstep_52280.model'  \
--ds_fname './ds_graph/models/output_graph.pb' --audio_fname './audio/test_sentence.wav'  \
--template_fname './template/FLAME_sample.ply' --condition_idx 3       \
--uv_template_fname './template/texture_mesh.obj' --texture_img_fname './template/texture_mesh.png'  \
--out_path './animation_pao_textured'





2022-04-19 12:18:28.435044: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-04-19 12:18:28.452083: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-19 12:18:28.452720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
2022-04-19 12:18:28.453054: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-04-19 12:18:28.454372: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2022-04-19 12:18:28.455512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library 

### VOCA extension
Use Deep Speech to convert a text to an audio signal. Then parse this signal to the VOCA model. Along with the audio signal, add another input which describes the emotinal state of the text (sentiment analysis). This input might require a retraining of the model or it can be used just as a baseline to produce a video animation of the speaking person. At this stage there will be lacking ofc some emotion or more realistic facial expression. Then this video can be used as a driver vido input for the fd_vid2vid, along with a synthetic face reference input from StyleGAN. 

Now there are 2 options:
 1. Try to influence the motion and expressions of the face directly on an avatar level (driver video level)
 2. Influence facial expressions by using the generated images from StyleGAN. This means that for each word, an image could be generated matching the sentimental nature of the text (reference level)

 Another option could be to jointly influence both???