<a href="https://colab.research.google.com/github/Vipul2084/ProjectV/blob/main/Talking_Head_Final_code_UI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Overview

The objective of this Proof of Concept (PoC) is to develop a system that generates lip-synced videos based on user-provided text input. Users will specify various parameters, including language, gender, speaker code, and the choice of one out of four available image options representing different persons. The system will produce a video where the selected person appears to be speaking the provided text in a realistic and synchronized manner.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Scope of Work
Input Parameters:

* Text: User-provided text that the selected person will be lip-syncing.
* Language: The language in which the text is written and will be spoken.
* Gender: The gender of the speaker, affecting the voice modulation.
* Speaker Code: A unique identifier for different speaker voice profiles.
* Image Option: Selection of one out of four predefined images of persons who will appear in the video.

Expected Output:

A lip-synced video featuring the chosen person, where the video output aligns the movements of the person's lips with the provided text.

## Technical Specifications
* Programming Language: Python (or any other suitable language)
* Frameworks and Libraries:
**  Text-to-Speech: Use the suno bark small model for TTS.
**  Lip Sync: Use the wav2lip model for generating realistic lip-syncing videos.

## Running the code:
### It requires to upload requirements.txt file and upload speakers.json file and installing it using command below:

pip install -r requirements.txt

## Installing libraries and models

In [22]:
pip install -r /content/drive/MyDrive/Talkinghead/requirements.txt

Collecting cudf-cu12@ https://pypi.nvidia.com/cudf-cu12/cudf_cu12-24.4.1-cp310-cp310-manylinux_2_28_x86_64.whl#sha256=57366e7ef09dc63e0b389aff20df6c37d91e2790065861ee31a4720149f5b694 (from -r /content/drive/MyDrive/Talkinghead/requirements.txt (line 73))
  Using cached https://pypi.nvidia.com/cudf-cu12/cudf_cu12-24.4.1-cp310-cp310-manylinux_2_28_x86_64.whl (473.3 MB)
Collecting en-core-web-sm@ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl#sha256=86cc141f63942d4b2c5fcee06630fd6f904788d2f0ab005cce45aadb8fb73889 (from -r /content/drive/MyDrive/Talkinghead/requirements.txt (line 104))
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m71.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ghc@ https://raw.githubusercontent.com/AwaleSajil/ghc/mast

In [23]:
# Clone the Wav2Lip repository from GitHub
!git clone https://github.com/zabique/Wav2Lip

# Download the pretrained Wav2Lip model
!wget 'https://iiitaphyd-my.sharepoint.com/personal/radrabha_m_research_iiit_ac_in/_layouts/15/download.aspx?share=EdjI7bZlgApMqsVoEUUXpLsBxqXbn5z8VTmoxp55YNDcIA' -O '/content/Wav2Lip/checkpoints/wav2lip_gan.pth'

# Install a specific package
a = !pip install https://raw.githubusercontent.com/AwaleSajil/ghc/master/ghc-1.0-py3-none-any.whl

# Install requirements from the Wav2Lip repository
!cd Wav2Lip && pip install -r requirements.txt

# Install youtube-dl for downloading videos
!pip install -q youtube-dl

# Install librosa for audio processing (specific version 0.9.1)
!pip install librosa==0.9.1

# Install moviepy for video editing
!pip install -q moviepy

variable_name = False  # Initialize a variable (not used in this snippet)

# Remove the default sample data directory and create a new one
!rm -rf /content/sample_data
!mkdir /content/sample_data

#Required libraries for audio processing and display
import torch
import scipy.io.wavfile  # For reading and writing WAV files
from transformers import BarkModel, AutoProcessor  # Hugging Face Transformers
from IPython.display import Audio, HTML  # For displaying audio and HTML in Jupyter notebooks
import time  # For time-related functions
from base64 import b64encode  # For encoding binary data to base64
import json  # For working with JSON data
from moviepy.editor import VideoFileClip, AudioFileClip  # For editing video and audio files
import random  # For generating random numbers



fatal: destination path 'Wav2Lip' already exists and is not an empty directory.
--2024-07-27 22:36:11--  https://iiitaphyd-my.sharepoint.com/personal/radrabha_m_research_iiit_ac_in/_layouts/15/download.aspx?share=EdjI7bZlgApMqsVoEUUXpLsBxqXbn5z8VTmoxp55YNDcIA
Resolving iiitaphyd-my.sharepoint.com (iiitaphyd-my.sharepoint.com)... 13.107.136.10, 13.107.138.10, 2620:1ec:8f8::10, ...
Connecting to iiitaphyd-my.sharepoint.com (iiitaphyd-my.sharepoint.com)|13.107.136.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 435801865 (416M) [application/octet-stream]
Saving to: ‘/content/Wav2Lip/checkpoints/wav2lip_gan.pth’


2024-07-27 22:36:18 (64.3 MB/s) - ‘/content/Wav2Lip/checkpoints/wav2lip_gan.pth’ saved [435801865/435801865]



In [24]:
!pip install ipython==8.4.0




## Text to speech module

In [25]:
import json
# Open the 'speakers.json' file in read mod
with open('/content/drive/MyDrive/Talkinghead/speakers.json', 'r') as f:
    # Load the JSON data from the file into a Python dictionary
    speakers = json.load(f)


In [26]:
def choose_voice(data, language, gender, speaker_code):
    """
    Choose a specific voice based on language, gender, and speaker code.

    Parameters:
    - data (list of dict): List of speaker dictionaries, each containing 'language', 'gender', and 'code'.
    - language (str): The desired language of the speaker.
    - gender (str): The desired gender of the speaker.
    - speaker_code (str): The specific code of the speaker.

    Returns:
    - dict or None: The dictionary of the selected speaker if found, otherwise None.
    """

    # Filter speakers by the desired language
    speakers_by_language = [d for d in data if d['language'] == language]

    # Further filter by the desired gender
    speakers_by_gender = [d for d in speakers_by_language if d['gender'] == gender]

    # Find the specific speaker with the given code
    selected_speaker = next((d for d in speakers_by_gender if d['code'] == speaker_code), None)

    return selected_speaker


In [27]:
def text_to_speech(text, language, gender, speaker_code):
    """
    Convert text to speech using the specified parameters.

    Parameters:
    - text (str): The text to convert to speech.
    - language (str): The desired language of the voice.
    - gender (str): The desired gender of the voice.
    - speaker_code (str): The code of the specific speaker to use.

    Returns:
    - speech_output (tensor): The generated speech output.
    - sampling_rate (int): The sampling rate of the generated speech.
    """

    # Initialize model and processor
    model = BarkModel.from_pretrained("suno/bark-small")
    processor = AutoProcessor.from_pretrained("suno/bark")

    # Determine the computation device (GPU if available, otherwise CPU)
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    model = model.to(device)

    # Choose the specific voice based on parameters
    voice = choose_voice(speakers, language, gender, speaker_code)

    if voice is None:
        raise ValueError("The specified voice was not found.")

    # Process the text with the voice preset
    inputs = processor(text, voice_preset=voice['code'])

    # Generate speech from the processed inputs
    speech_output = model.generate(**inputs.to(device))

    # Retrieve the sampling rate from the model's configuration
    sampling_rate = model.generation_config.sample_rate

    return speech_output, sampling_rate


In [28]:
def save_audio(text, language, gender, speaker_code):
    """
    Generate speech from text and save it as a WAV file.

    Parameters:
    - text (str): The text to convert to speech.
    - language (str): The desired language of the voice.
    - gender (str): The desired gender of the voice.
    - speaker_code (str): The code of the specific speaker to use.
    """
    # Generate speech output and retrieve the sampling rate
    speech_output, sampling_rate = text_to_speech(text, language, gender, speaker_code)

    # Convert speech output tensor to numpy array and ensure it's 1D
    audio_data = speech_output.cpu().numpy().squeeze()

    # Save the audio data to a WAV file
    scipy.io.wavfile.write("/content/audio.wav", rate=sampling_rate, data=audio_data)
    tts_audio_path = "/content/audio.wav"

    print("Audio saved as /content/audio.wav")
    return tts_audio_path



## Wav2lip module

In [29]:
def generate_video(tts_audio_path, video_path):
    """
    Generate a lip-synced video using Wav2Lip.

    Parameters:
    - tts_audio_path (str): Path to the text-to-speech audio file.
    - video_path (str): Path to the input video file.

    Returns:
    - None
    """

    # Define paths for the audio and video files
    tts_audio_path = tts_audio_path
    moviepy_video_path = video_path

    # Command to run the Wav2Lip inference script
    !cd Wav2Lip && python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "{video_path}" --audio "{tts_audio_path}"

    print("Video generation complete.")

## Moviepy module

In [30]:
def sync_audio_with_video(video_path, tts_audio_path):
    """
    Synchronize TTS audio with a video clip, adjusting the video duration or looping if needed.

    Parameters:
    - video_path (str): Path to the input video file.
    - tts_audio_path (str): Path to the TTS audio file.
    - output_path (str): Path to save the output video file.

    Returns:
    - None
    """
    # Load the video and audio clips
    video_clip = VideoFileClip(video_path)
    audio_clip = AudioFileClip(tts_audio_path)

    # Determine the duration of the video and audio
    video_duration = video_clip.duration
    audio_duration = audio_clip.duration

    # Calculate the random start time for the video subclip
    if video_duration > audio_duration:
        max_start_time = video_duration - audio_duration
        start_time = random.uniform(0, max_start_time)
        video_clip = video_clip.subclip(start_time, start_time + audio_duration)
    else:
        # If the video duration is less than or equal to the audio duration, loop the video
        video_clip = video_clip.loop(duration=audio_duration)

    # Set the audio of the video clip to the new TTS audio
    final_video = video_clip.set_audio(audio_clip)

    # Write the result to a file
    final_video.write_videofile("/content/output.mp4", codec="libx264", audio_codec="aac")

    print(f"Video saved")


## Display video

In [31]:
def display_video(file_path):
    """
    Display a video in an IPython notebook using base64 encoding.

    Parameters:
    - file_path (str): Path to the video file to be displayed.

    Returns:
    - HTML: HTML object to render the video in the notebook.
    """
    # Read the video file in binary mode
    with open(file_path, 'rb') as video_file:
        mp4 = video_file.read()

    # Encode the video file in base64
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()

    # Create HTML for displaying the video
    video_html = f"""
    <video width="50%" height="50%" controls>
          <source src="{data_url}" type="video/mp4">
    </video>"""

    # Return the HTML object
    return HTML(video_html)



In [32]:
wav2lip_output_path = "/content/Wav2Lip/results/result_voice.mp4"
moviepy_output_path = "/content/output.mp4"

## Giving the inputs here

In [33]:
  text = "Hi my name is George, i am news reader for evening news channel"
  language = "English"
  gender = "Male"
  speaker_code = "v2/en_speaker_0"
  video_path = "/content/drive/MyDrive/Talkinghead/1.mp4"
  wav2lip_path = "/content/drive/MyDrive/Talkinghead"

## Calling the tts function to generate audio

In [34]:
tts_audio_path = save_audio(text, language, gender, speaker_code)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



config.json:   0%|          | 0.00/8.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]


  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)



generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

speaker_embeddings_path.json:   0%|          | 0.00/61.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

en_speaker_0_semantic_prompt.npy:   0%|          | 0.00/2.86k [00:00<?, ?B/s]

en_speaker_0_coarse_prompt.npy:   0%|          | 0.00/8.32k [00:00<?, ?B/s]

en_speaker_0_fine_prompt.npy:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Audio saved as /content/audio.wav


It took 37 sec to generate this audio in T4

## Calling the wav2lip function to make a video

Upload the video files before calling this

In [35]:
generate_video(tts_audio_path, video_path)

Using cpu for inference.
Reading video frames...
Number of frames available for inference: 922
  return librosa.filters.mel(hp.sample_rate, hp.n_fft, n_mels=hp.num_mels,
(80, 337)
Length of mel chunks: 243
  0% 0/2 [00:00<?, ?it/s]Downloading: "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" to /root/.cache/torch/hub/checkpoints/s3fd-619a316812.pth

  0% 0.00/85.7M [00:00<?, ?B/s][A
  0% 128k/85.7M [00:00<04:57, 301kB/s][A
  0% 256k/85.7M [00:00<03:00, 495kB/s][A
  1% 512k/85.7M [00:00<01:38, 907kB/s][A
  1% 896k/85.7M [00:00<01:00, 1.47MB/s][A
  2% 1.75M/85.7M [00:01<00:29, 2.96MB/s][A
  4% 3.50M/85.7M [00:01<00:14, 5.91MB/s][A
  7% 6.38M/85.7M [00:01<00:07, 10.4MB/s][A
 10% 8.62M/85.7M [00:01<00:06, 12.2MB/s][A
 13% 11.2M/85.7M [00:01<00:05, 14.2MB/s][A
 16% 14.0M/85.7M [00:01<00:04, 15.9MB/s][A
 20% 16.9M/85.7M [00:01<00:04, 17.4MB/s][A
 23% 19.5M/85.7M [00:02<00:03, 17.9MB/s][A
 26% 22.4M/85.7M [00:02<00:03, 18.7MB/s][A
 29% 25.0M/85.7M [00:02<00

In [36]:
display_video(wav2lip_output_path)

## Calling moviepy function to generate video

In [37]:
sync_audio_with_video(video_path, tts_audio_path)

Moviepy - Building video /content/output.mp4.
MoviePy - Writing audio in outputTEMP_MPY_wvf_snd.mp4




MoviePy - Done.
Moviepy - Writing video /content/output.mp4





Moviepy - Done !
Moviepy - video ready /content/output.mp4
Video saved


In [38]:
display_video(moviepy_output_path)

Gradio app code begins

In [41]:
!pip install gradio moviepy scipy




In [42]:
import gradio as gr
import torch
import scipy.io.wavfile
from moviepy.editor import VideoFileClip, AudioFileClip
from transformers import AutoProcessor, BarkModel
from base64 import b64encode
from IPython.display import HTML

# Assume the following imports and classes for BarkModel are correct
#from bark_model import BarkModel

# data for speakers
import json
# Open the 'speakers.json' file in read mod
with open('/content/drive/MyDrive/Talkinghead/speakers.json', 'r') as f:
    # Load the JSON data from the file into a Python dictionary
    speakers = json.load(f)

def choose_voice(data, language, gender, speaker_code):
    speakers_by_language = [d for d in data if d['language'] == language]
    speakers_by_gender = [d for d in speakers_by_language if d['gender'] == gender]
    selected_speaker = next((d for d in speakers_by_gender if d['code'] == speaker_code), None)
    return selected_speaker

def text_to_speech(text, language, gender, speaker_code):
    model = BarkModel.from_pretrained("suno/bark-small")
    processor = AutoProcessor.from_pretrained("suno/bark")
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    voice = choose_voice(speakers, language, gender, speaker_code)
    if voice is None:
        raise ValueError("The specified voice was not found.")
    inputs = processor(text, voice_preset=voice['code'])
    speech_output = model.generate(**inputs.to(device))
    sampling_rate = model.generation_config.sample_rate
    return speech_output, sampling_rate

def save_audio(text, language, gender, speaker_code):
    speech_output, sampling_rate = text_to_speech(text, language, gender, speaker_code)
    audio_data = speech_output.cpu().numpy().squeeze()
    scipy.io.wavfile.write("/content/audio.wav", rate=sampling_rate, data=audio_data)
    tts_audio_path = "/content/audio.wav"
    return tts_audio_path

def generate_video(tts_audio_path, video_path):
    !cd Wav2Lip && python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "{video_path}" --audio "{tts_audio_path}"
    print("Video generation complete.")

def sync_audio_with_video(video_path, tts_audio_path):
    video_clip = VideoFileClip(video_path)
    audio_clip = AudioFileClip(tts_audio_path)
    video_duration = video_clip.duration
    audio_duration = audio_clip.duration
    if video_duration > audio_duration:
        max_start_time = video_duration - audio_duration
        start_time = random.uniform(0, max_start_time)
        video_clip = video_clip.subclip(start_time, start_time + audio_duration)
    else:
        video_clip = video_clip.loop(duration=audio_duration)
    final_video = video_clip.set_audio(audio_clip)
    final_video.write_videofile("/content/output.mp4", codec="libx264", audio_codec="aac")
    return "/content/output.mp4"

def display_video(file_path):
    with open(file_path, 'rb') as video_file:
        mp4 = video_file.read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    video_html = f"""
    <video width="50%" height="50%" controls>
          <source src="{data_url}" type="video/mp4">
    </video>"""
    return HTML(video_html)

# Gradio interface functions
def process(text, language, gender, speaker_code, video_path):
    tts_audio_path = save_audio(text, language, gender, speaker_code)
    output_video_path = sync_audio_with_video(video_path, tts_audio_path)
    return display_video(output_video_path)

# Create Gradio interface
interface = gr.Interface(
    fn=process,
    inputs=[
        gr.components.Textbox(lines=2, placeholder="Enter text here...", label="Text"),
        gr.components.Textbox(placeholder="Language (e.g., 'en')", label="Language"),
        gr.components.Textbox(placeholder="Gender (e.g., 'male')", label="Gender"),
        gr.components.Textbox(placeholder="Speaker Code (e.g., 'speaker_1')", label="Speaker Code"),
        gr.components.Textbox(placeholder="Video Path", label="Video Path")
    ],
    outputs=gr.components.HTML(label="Output Video"),
    title="Text-to-Speech to Video Generator"
)

# Launch the interface
interface.launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://68f5060d98addc91f6.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




New code

In [43]:
import gradio as gr
import torch
import scipy.io.wavfile
from moviepy.editor import VideoFileClip, AudioFileClip
from transformers import AutoProcessor, BarkModel
from base64 import b64encode
from IPython.display import HTML
import json

# Load speakers data
with open('/content/drive/MyDrive/Talkinghead/speakers.json', 'r') as f:
    speakers = json.load(f)

def choose_voice(data, language, gender, speaker_code):
    speakers_by_language = [d for d in data if d['language'] == language]
    speakers_by_gender = [d for d in speakers_by_language if d['gender'] == gender]
    selected_speaker = next((d for d in speakers_by_gender if d['code'] == speaker_code), None)
    return selected_speaker

def text_to_speech(text, language, gender, speaker_code):
    model = BarkModel.from_pretrained("suno/bark-small")
    processor = AutoProcessor.from_pretrained("suno/bark")
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    voice = choose_voice(speakers, language, gender, speaker_code)
    if voice is None:
        raise ValueError("The specified voice was not found.")
    inputs = processor(text, voice_preset=voice['code'])
    speech_output = model.generate(**inputs.to(device))
    sampling_rate = model.generation_config.sample_rate
    return speech_output, sampling_rate

def save_audio(text, language, gender, speaker_code):
    speech_output, sampling_rate = text_to_speech(text, language, gender, speaker_code)
    audio_data = speech_output.cpu().numpy().squeeze()
    scipy.io.wavfile.write("/content/audio.wav", rate=sampling_rate, data=audio_data)
    tts_audio_path = "/content/audio.wav"
    return tts_audio_path

def generate_video(tts_audio_path, video_path):
    !cd Wav2Lip && python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "{video_path}" --audio "{tts_audio_path}"
    print("Video generation complete.")

def sync_audio_with_video(video_path, tts_audio_path):
    video_clip = VideoFileClip(video_path)
    audio_clip = AudioFileClip(tts_audio_path)
    video_duration = video_clip.duration
    audio_duration = audio_clip.duration
    if video_duration > audio_duration:
        max_start_time = video_duration - audio_duration
        start_time = random.uniform(0, max_start_time)
        video_clip = video_clip.subclip(start_time, start_time + audio_duration)
    else:
        video_clip = video_clip.loop(duration=audio_duration)
    final_video = video_clip.set_audio(audio_clip)
    final_video.write_videofile("/content/output.mp4", codec="libx264", audio_codec="aac")
    return "/content/output.mp4"

def display_video(file_path):
    with open(file_path, 'rb') as video_file:
        mp4 = video_file.read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    video_html = f"""
    <video width="50%" height="50%" controls>
          <source src="{data_url}" type="video/mp4">
    </video>"""
    return HTML(video_html)

# Gradio interface functions
def process(text, language, gender, speaker_code, image):
    tts_audio_path = save_audio(text, language, gender, speaker_code)
    output_video_path = sync_audio_with_video(image, tts_audio_path)
    return display_video(output_video_path)

# Create Gradio interface
interface = gr.Interface(
    fn=process,
    inputs=[
        gr.Textbox(lines=2, placeholder="Enter text here...", label="Text"),
        gr.Textbox(placeholder="Language (e.g., 'en')", label="Language"),
        gr.Textbox(placeholder="Gender (e.g., 'male')", label="Gender"),
        gr.Textbox(placeholder="Speaker Code (e.g., 'speaker_1')", label="Speaker Code"),
        gr.Radio(label="Select Image", choices=["image1", "image2", "image3", "image4"], type="index"),
        gr.Image(type="filepath", label="Image")  # Assuming images are used for the face
    ],
    outputs=gr.HTML(label="Output Video"),
    title="Talking Head AI Video Generation Tool"
)

interface.launch()






Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://c8d2a05dd6c2f38a6a.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




NEW code

In [44]:
import gradio as gr
import torch
import scipy.io.wavfile
from moviepy.editor import VideoFileClip, AudioFileClip
from transformers import AutoProcessor, BarkModel
from base64 import b64encode
from IPython.display import HTML
import json

# Load speakers data
with open('/content/drive/MyDrive/Talkinghead/speakers.json', 'r') as f:
    speakers = json.load(f)

def choose_voice(data, language, gender, speaker_code):
    speakers_by_language = [d for d in data if d['language'] == language]
    speakers_by_gender = [d for d in speakers_by_language if d['gender'] == gender]
    selected_speaker = next((d for d in speakers_by_gender if d['code'] == speaker_code), None)
    return selected_speaker

def text_to_speech(text, language, gender, speaker_code):
    model = BarkModel.from_pretrained("suno/bark-small")
    processor = AutoProcessor.from_pretrained("suno/bark")
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    voice = choose_voice(speakers, language, gender, speaker_code)
    if voice is None:
        raise ValueError("The specified voice was not found.")
    inputs = processor(text, voice_preset=voice['code'])
    speech_output = model.generate(**inputs.to(device))
    sampling_rate = model.generation_config.sample_rate
    return speech_output, sampling_rate

def save_audio(text, language, gender, speaker_code):
    speech_output, sampling_rate = text_to_speech(text, language, gender, speaker_code)
    audio_data = speech_output.cpu().numpy().squeeze()
    scipy.io.wavfile.write("/content/audio.wav", rate=sampling_rate, data=audio_data)
    tts_audio_path = "/content/audio.wav"
    return tts_audio_path

def generate_video_with_lipsync(tts_audio_path, video_path):
    !cd Wav2Lip && python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "{video_path}" --audio "{tts_audio_path}"
    return "/content/Wav2Lip/results/result_voice.mp4"

def sync_audio_with_video(video_path, tts_audio_path):
    video_clip = VideoFileClip(video_path)
    audio_clip = AudioFileClip(tts_audio_path)
    video_duration = video_clip.duration
    audio_duration = audio_clip.duration
    if video_duration > audio_duration:
        max_start_time = video_duration - audio_duration
        start_time = random.uniform(0, max_start_time)
        video_clip = video_clip.subclip(start_time, start_time + audio_duration)
    else:
        video_clip = video_clip.loop(duration=audio_duration)
    final_video = video_clip.set_audio(audio_clip)
    final_video.write_videofile("/content/output.mp4", codec="libx264", audio_codec="aac")
    return "/content/output.mp4"

def display_video(file_path):
    with open(file_path, 'rb') as video_file:
        mp4 = video_file.read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    video_html = f"""
    <video width="50%" height="50%" controls>
          <source src="{data_url}" type="video/mp4">
    </video>"""
    return HTML(video_html)

def process_with_lipsync(text, language, gender, speaker_code, image):
    tts_audio_path = save_audio(text, language, gender, speaker_code)
    output_video_path = generate_video_with_lipsync(tts_audio_path, image)
    return display_video(output_video_path)

def process_without_lipsync(text, language, gender, speaker_code, image):
    tts_audio_path = save_audio(text, language, gender, speaker_code)
    output_video_path = sync_audio_with_video(image, tts_audio_path)
    return display_video(output_video_path)

# Create Gradio interface
interface = gr.Interface(
    fn=lambda text, language, gender, speaker_code, image: (process_with_lipsync(text, language, gender, speaker_code, image), process_without_lipsync(text, language, gender, speaker_code, image)),
    inputs=[
        gr.Textbox(lines=2, placeholder="Hi ", label="Text"),
        gr.Textbox(placeholder="Language (e.g., 'en')", label="Language"),
        gr.Textbox(placeholder="Gender (e.g., 'male')", label="Gender"),
        gr.Textbox(placeholder="Speaker Code (e.g., 'speaker_1')", label="Speaker Code"),
        gr.Image(type="filepath", label="Image")  # Assuming images are used for the face
    ],
    outputs=[
        gr.HTML(label="Output Video with Lip Sync"),
        gr.HTML(label="Output Video without Lip Sync")
    ],
    title="Talking Head AI Video Generation Tool"
)

interface.launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://3ba24fdc9bb46b6854.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


