<a href="https://colab.research.google.com/github/bemxio/colab-notebooks/blob/main/AutomaticVoiceover/AutomaticVoiceover.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic Voice-Over
A Colab notebook to take an audio of a video file, process it using [Whisper](https://github.com/openai/whisper) to get subtitles out of the audio and translate them to English, then use [gTTS](https://github.com/pndurette/gTTS) together with [pydub](https://github.com/jiaaro/pydub) to generate the voice-over audio, and finally use [ffmpeg](https://github.com/FFmpeg/FFmpeg) to add the voice-over audio to the video.

Also using [`srt`]() for reading the subtitle file and [`audio-effects`]() for slowing down a sample.

Made just for pure fun and laughs, do not use it in professional stuff, unless you want yourself to look dumb. :3

#### Install required dependencies

In [1]:
!pip install openai-whisper srt gTTS pydub audio-effects

Collecting openai-whisper
  Downloading openai-whisper-20230918.tar.gz (794 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m794.3/794.3 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting srt
  Downloading srt-3.5.3.tar.gz (28 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gTTS
  Downloading gTTS-2.3.2-py3-none-any.whl (28 kB)
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Collecting audio-effects
  Downloading audio_effects-0.22.tar.gz (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tiktoken==0.3.3 (from openai-whisper)
  Downloading tiktoken-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux201

#### Set the parameters and upload the video file

In [3]:
import pathlib

from moviepy.editor import ipython_display
from google.colab import files

# constants set by the user in the notebook
MODEL = "large-v2" # @param ["tiny.en", "tiny", "base.en", "base", "small.en", "small", "medium.en", "medium", "large-v1", "large-v2", "large"]
LANGUAGE = "Polish" # @param ["Afrikaans", "Albanian", "Amharic", "Arabic", "Armenian", "Assamese", "Azerbaijani", "Bashkir", "Basque", "Belarusian", "Bengali", "Bosnian", "Breton", "Bulgarian", "Burmese", "Castilian", "Catalan", "Chinese", "Croatian", "Czech", "Danish", "Dutch", "Estonian", "Faroese", "Finnish", "Flemish", "French", "Galician", "Georgian", "German", "Greek", "Gujarati", "Haitian", "Haitian Creole", "Hausa", "Hawaiian", "Hebrew", "Hindi", "Hungarian", "Icelandic", "Indonesian", "Italian", "Japanese", "Javanese", "Kannada", "Kazakh", "Khmer", "Korean", "Lao", "Latin", "Latvian", "Letzeburgesch", "Lingala", "Lithuanian", "Luxembourgish", "Macedonian", "Malagasy", "Malay", "Malayalam", "Maltese", "Maori", "Marathi", "Moldavian", "Moldovan", "Mongolian", "Myanmar", "Nepali", "Norwegian", "Nynorsk", "Occitan", "Panjabi", "Pashto", "Persian", "Polish", "Portuguese", "Punjabi", "Pushto", "Romanian", "Russian", "Sanskrit", "Serbian", "Shona", "Sindhi", "Sinhala", "Sinhalese", "Slovak", "Slovenian", "Somali", "Spanish", "Sundanese", "Swahili", "Swedish", "Tagalog", "Tajik", "Tamil", "Tatar", "Telugu", "Thai", "Tibetan", "Turkish", "Turkmen", "Ukrainian", "Urdu", "Uzbek", "Valencian", "Vietnamese", "Welsh", "Yiddish", "Yoruba"]
TTS_REGION = "co.uk" # @param {type: "string"}

# show a prompt for file upload
uploads = files.upload()

# get the path of the file
path = pathlib.Path(next(iter(uploads)))

# show a preview of the video
ipython_display(str(path), filetype="video", maxduration=300)

Saving pszeruba ivona.webm to pszeruba ivona.webm


#### Generate the subtitles based on the video file

In [4]:
!python3 -m whisper --model {MODEL} --language {LANGUAGE} --task translate "{path}" --output_format srt

100%|█████████████████████████████████████| 2.87G/2.87G [00:46<00:00, 66.7MiB/s]
[00:00.000 --> 00:02.320]  PISU PISU PISU MAZU MAZU MAZU
[00:02.640 --> 00:04.640]  PISU PISU PISU MAZU MAZU MAZU
[00:04.640 --> 00:06.560]  PER PER PER PER PER PER PER PER
[00:07.320 --> 00:08.320]  Heil Hitler!
[00:08.320 --> 00:09.200]  Bitch!
[00:09.200 --> 00:09.920]  What?
[00:09.920 --> 00:10.880]  Fuck!
[00:10.880 --> 00:12.320]  Oh fuck, Satan!
[00:12.320 --> 00:13.120]  Satan?
[00:13.120 --> 00:15.120]  Your old man, it's Satan!
[00:15.120 --> 00:17.120]  I came for the cigarettes, you moron!
[00:17.120 --> 00:19.120]  I smoked them all, and I don't have money!
[00:19.120 --> 00:21.120]  But I don't have cigarettes anymore!
[00:21.120 --> 00:22.320]  Oh, fuck!
[00:22.320 --> 00:24.320]  How come you don't have cigarettes, you old whore!
[00:24.320 --> 00:25.920]  Can you not call me?
[00:25.920 --> 00:27.680]  Fuck off, Heil Hitler!
[00:27.760 --> 00:30.160]  You come here and you call me a whore

#### Generate the voice-over based on subtitles

In [5]:
from io import BytesIO

import srt
import gtts
from pydub import AudioSegment

from pydub.effects import speedup as speed_up
from audio_effects import speed_down

# a helper function for changing the speed of a sample
def change_length(segment: AudioSegment, length: float) -> AudioSegment:
    multiplier = segment.duration_seconds / length

    if multiplier > 1:
        return speed_up(segment, multiplier, chunk_size=50)
    elif multiplier < 1:
        return speed_down(segment, multiplier)
    else:
        return segment

# set the cache dictionary
cache = {}

with open(path.with_suffix(".srt"), "r", encoding="utf-8") as file:
    # get the subtitles in a list
    subtitles = list(srt.parse(file.read()))

    # get the duration of the whole voice-over in miliseconds
    duration = subtitles[-1].end.total_seconds() * 1000

    # make a silent audio segment
    audio = AudioSegment.silent(duration=duration)

    for subtitle in subtitles:
        if subtitle.content in cache:
            stream = cache[subtitle.content] # no need to call the API that way
        else:
            stream = BytesIO()

            speech = gtts.gTTS(subtitle.content, lang="en", tld=TTS_REGION, slow=True)
            speech.write_to_fp(stream)

            cache[subtitle.content] = stream

        # seek to the beginning of the stream
        stream.seek(0)

        # import the audio from the stream
        speech = AudioSegment.from_file(stream, format="mp3")

        # get the start, end and the duration of a subtitle
        start = subtitle.start
        end = subtitle.end

        length = (end - start).total_seconds()

        # change the speed of the sample and overlay it to the voice-over
        audio = audio.overlay(change_length(speech, length), start.total_seconds() * 1000)

        # print debug information about the subtitle
        print(f"Subtitle: {subtitle.content}")
        print(f"Start: {start}")
        print(f"End: {end}")
        print(f"Length: {length}\n")

# get the voice-over path
voiceover = path.with_suffix(".wav")

# export the voice-over to a WAV file
audio.export(voiceover, format="wav")

# show a preview of the voice-over
ipython_display(str(voiceover), filetype="audio", maxduration=300)

Subtitle: PISU PISU PISU MAZU MAZU MAZU
Start: 0:00:00
End: 0:00:02.320000
Length: 2.32

Subtitle: PISU PISU PISU MAZU MAZU MAZU
Start: 0:00:02.640000
End: 0:00:04.640000
Length: 2.0

Subtitle: PER PER PER PER PER PER PER PER
Start: 0:00:04.640000
End: 0:00:06.560000
Length: 1.92

Subtitle: Heil Hitler!
Start: 0:00:07.320000
End: 0:00:08.320000
Length: 1.0

Subtitle: Bitch!
Start: 0:00:08.320000
End: 0:00:09.200000
Length: 0.88

Subtitle: What?
Start: 0:00:09.200000
End: 0:00:09.920000
Length: 0.72

Subtitle: Fuck!
Start: 0:00:09.920000
End: 0:00:10.880000
Length: 0.96

Subtitle: Oh fuck, Satan!
Start: 0:00:10.880000
End: 0:00:12.320000
Length: 1.44

Subtitle: Satan?
Start: 0:00:12.320000
End: 0:00:13.120000
Length: 0.8

Subtitle: Your old man, it's Satan!
Start: 0:00:13.120000
End: 0:00:15.120000
Length: 2.0

Subtitle: I came for the cigarettes, you moron!
Start: 0:00:15.120000
End: 0:00:17.120000
Length: 2.0

Subtitle: I smoked them all, and I don't have money!
Start: 0:00:17.120000


#### Make a video file with the new voice-over

In [6]:
# get the video output path
output = path.with_stem(path.stem + ' voiceover')

# replace the audio track in the video file
!ffmpeg -i "{path}" -i "{voiceover}" -c:v copy -map 0:v:0 -map 1:a:0 "{output}"

# show a preview of the video file
ipython_display(str(output), filetype="video", maxduration=300)

# download the video file
files.download(str(output))

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab