<a href="https://colab.research.google.com/github/aioi50/vvray/blob/main/faster_whisper_youtube_drive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Youtube/Google Drive Videos Translation/Transcription with Faster Whisper**

[faster-whisper](https://github.com/guillaumekln/faster-whisper) is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models.

This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

This Notebook will guide you through the transcription or translation of a  video file (from Youtube/Google Drive) using Faster Whisper. You'll be able to explore most inference parameters or use the Notebook as-is to store the output and video audio in your Google Drive.

## **How to use**
1. Read and understand the notebook. You should at the very least modify the **video selection section** to choose the video you wish to translate/transcribe
2. Click Runtime -> Run all and wait for the notebook to do its magic, alternatively you may run the cells one by one and skip the Google Drive portion if you do not intend to use it
3. A download prompt should appear once subtitles is ready, or check the 'Files' tab on the left for the output


In [1]:
#@markdown # **[Optional]** Access data in Google Drive 💾
#@markdown Enter a Google Drive path and run this cell to store the results inside Google Drive.

# Uncomment to copy generated images to drive, faster than downloading directly from colab in my experience.
from google.colab import drive
from pathlib import Path

drive_mount_path = Path("/") / "content" / "drive"
drive.mount(str(drive_mount_path))
drive_mount_path /= "My Drive"
#@markdown ---
drive_path = "Colab Notebooks/Faster Whisper" #@param {type:"string"}
#@markdown ---
#@markdown **Run this cell again if you change your Google Drive path.**

drive_whisper_path = drive_mount_path / Path(drive_path.lstrip("/"))
drive_whisper_path.mkdir(parents=True, exist_ok=True)

Mounted at /content/drive


In [2]:
#@markdown # **Check GPU type** 🕵️

#@markdown The type of GPU you get assigned in your Colab session defined the speed at which the video will be transcribed.
#@markdown The higher the number of floating point operations per second (FLOPS), the faster the transcription.
#@markdown But even the least powerful GPU available in Colab is able to run any Whisper model.
#@markdown Make sure you've selected `GPU` as hardware accelerator for the Notebook (Runtime &rarr; Change runtime type &rarr; Hardware accelerator).

#@markdown |  GPU   |  GPU RAM   | FP32 teraFLOPS |     Availability   |
#@markdown |:------:|:----------:|:--------------:|:------------------:|
#@markdown |  T4    |    16 GB   |       8.1      |         Free       |
#@markdown | P100   |    16 GB   |      10.6      |      Colab Pro     |
#@markdown | V100   |    16 GB   |      15.7      |  Colab Pro (Rare)  |

#@markdown ---
#@markdown **Factory reset your Notebook's runtime if you want to get assigned a new GPU.**

!nvidia-smi -L

!nvidia-smi

GPU 0: Tesla T4 (UUID: GPU-9d63f2fc-98a7-61b3-b528-b04170b4853d)
Mon Feb  3 16:36:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   46C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+-------

In [3]:
#@markdown # **Install libraries** 🏗️
#@markdown This cell will take a little while to download several libraries, including Faster Whisper.

#@markdown ---

! pip install faster-whisper
! pip install yt-dlp

import os
import sys
import warnings
from faster_whisper import WhisperModel
import yt_dlp
import subprocess
import torch
import shutil
import numpy as np
from IPython.display import display, Markdown, YouTubeVideo
import requests
from urllib.parse import urlsplit
from google.colab import files
from pathlib import Path

device = torch.device('cuda:0')
print('Using device:', device, file=sys.stderr)

!sudo apt-get update
!sudo apt install nvidia-cuda-toolkit

Collecting faster-whisper
  Downloading faster_whisper-1.1.1-py3-none-any.whl.metadata (16 kB)
Collecting ctranslate2<5,>=4.0 (from faster-whisper)
  Downloading ctranslate2-4.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting onnxruntime<2,>=1.14 (from faster-whisper)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting av>=11 (from faster-whisper)
  Downloading av-14.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.7 kB)
Collecting coloredlogs (from onnxruntime<2,>=1.14->faster-whisper)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime<2,>=1.14->faster-whisper)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading faster_whisper-1.1.1-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m

Using device: cuda:0


Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,306 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,643 kB]
Get:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]
Get:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:12 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,229 kB]
Hit:13 https://ppa.launchp

In [4]:
#@markdown # **Model selection** 🧠

#@markdown There are several models to choose from, with varying performance and speed. large-v2 is recommended for most cases:

#@markdown |  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
#@markdown |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
#@markdown |  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~0.8 GB     |      ~32x      |
#@markdown |  base  |    74 M    |     `base.en`      |       `base`       |     ~1.0 GB     |      ~16x      |
#@markdown | small  |   244 M    |     `small.en`     |      `small`       |     ~1.4 GB     |      ~6x       |
#@markdown | medium |   769 M    |    `medium.en`     |      `medium`      |     ~2.7 GB     |      ~2x       |
#@markdown | large-v1  |   1550 M   |        N/A         |      `large-v1`       |    ~4.3 GB     |       1x       |
#@markdown | large-v2  |   1550 M   |        N/A         |      `large-v2`       |    ~4.3 GB     |       1x       |

#@markdown ---
model_size = "large-v3" # @param ["tiny","tiny.en","base","base.en","small","small.en","medium","medium.en","large-v1","large-v3"]
device_type = "cuda" #@param {type:"string"} ['cuda', 'cpu']
compute_type = "float16" #@param {type:"string"} ['float16', 'int8_float16', 'int8']
#@markdown ---
#@markdown **Run this cell again if you change the model.**

model = WhisperModel(model_size, device=device_type, compute_type=compute_type)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

In [6]:
#@markdown # **Video selection** 📺

#@markdown Enter the URL of the Youtube video **OR** Google drive video path of the video you want to translate/transcribe, and run the cell. Make sure the correct Type is chosen! This may take awhile depending on video file size.

Type = "Google Drive" #@param ['Youtube video or playlist', 'Google Drive', 'Direct download']
#@markdown ---
#@markdown #### **Youtube video or playlist**
URL = "https://www.youtube.com/watch?v=9ez8lm9I26Y" #@param {type:"string"}
# store_audio = True #@param {type:"boolean"}
#@markdown ---
#@markdown #### **Google Drive video, audio (mp4, wav), or folder containing video and/or audio files**
video_path = "1/audio.aac" #@param {type:"string"}
#@markdown ---
#@markdown #### **Direct Download**
ddl_url = "https://www.example.com/video.mkv" #@param {type:"string"}
#@markdown ---
#@markdown **Run this cell again if you change the video.**

video_path_local_list = []

if Type == "Youtube video or playlist":

    ydl_opts = {
        'format': 'm4a/bestaudio/best',
        'outtmpl': '%(id)s.%(ext)s',
        # ℹ️ See help(yt_dlp.postprocessor) for a list of available Postprocessors and their arguments
        'postprocessors': [{  # Extract audio using ffmpeg
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'wav',
        }]
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code = ydl.download([URL])
        list_video_info = [ydl.extract_info(URL, download=False)]

    for video_info in list_video_info:
        video_path_local_list.append(Path(f"{video_info['id']}.wav"))

elif Type == "Google Drive":
    # video_path_drive = drive_mount_path / Path(video_path.lstrip("/"))
    video_path = drive_mount_path / Path(video_path.lstrip("/"))
    if video_path.is_dir():
        for video_path_drive in video_path.glob("**/*"):
            if video_path_drive.is_file():
                display(Markdown(f"**{str(video_path_drive)} selected for processing.**"))
            elif video_path_drive.is_dir():
                display(Markdown(f"**Subfolders not supported.**"))
            else:
                display(Markdown(f"**{str(video_path_drive)} does not exist, skipping.**"))
            video_path_local = Path(".").resolve() / (video_path_drive.name)
            shutil.copy(video_path_drive, video_path_local)
            video_path_local_list.append(video_path_local)
    elif video_path.is_file():
        video_path_local = Path(".").resolve() / (video_path.name)
        shutil.copy(video_path, video_path_local)
        video_path_local_list.append(video_path_local)
        print(f"{video_path_local} appended to list for processing")
        display(Markdown(f"**{str(video_path)} selected for processing.**"))
    else:
        display(Markdown(f"**{str(video_path)} does not exist.**"))

elif Type == "Direct download":
    print(f"⚠️ Please ensure this is a direct download link and is of a valid format")
    print(f"Attempting to download: {ddl_url}\n")
    # !wget {ddl_url} -O ddl_video.mp4
    # video_path_local_list.append("/content/ddl_video.mp4")

    response = requests.get(ddl_url)

    if response.status_code == 200:
        # Extract the filename from the URL
        filename = urlsplit(ddl_url).path.split("/")[-1]

        # Create the full path for the destination file in the current working directory
        destination_path = os.path.join(os.getcwd(), filename)

        # Save the file
        with open(destination_path, 'wb') as file:
            file.write(response.content)

        print(f"File downloaded successfully: {destination_path}")

        video_path_local = Path(".").resolve() / (filename)

        # print(f"Path local: {video_path_local}") # /content/video.mkv

        video_path_local_list.append(video_path_local)
    else:
        print(f"Failed to download file. Status code: {response.status_code}")

else:
    raise(TypeError("Please select supported input type."))

for video_path_local in video_path_local_list:
    valid_suffixes = [".mp4", ".mkv", ".mov", ".avi", ".wmv", ".flv", ".webm", ".3gp", ".mpeg"]

    print(f"Processing video file {video_path_local} with ffmpeg..")

    if video_path_local.suffix in valid_suffixes:
        input_suffix = video_path_local.suffix
        video_path_local = video_path_local.with_suffix(".wav")
        result = subprocess.run(["ffmpeg", "-i", str(video_path_local.with_suffix(input_suffix)), "-vn", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", str(video_path_local)])


/content/audio.aac appended to list for processing


**/content/drive/My Drive/1/audio.aac selected for processing.**

Processing video file /content/audio.aac with ffmpeg..


In [7]:
def seconds_to_time_format(s):
    # Convert seconds to hours, minutes, seconds, and milliseconds
    hours = s // 3600
    s %= 3600
    minutes = s // 60
    s %= 60
    seconds = s // 1
    milliseconds = round((s % 1) * 1000)

    # Return the formatted string
    return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d},{int(milliseconds):03d}"

#@markdown # **Run the model** 🚀

#@markdown Run this cell to execute the transcription/translation of the video. This can take a while and very based on the length of the video and the number of parameters of the model selected above.

#@markdown ## **Parameters** ⚙️

#@markdown ### **Behavior control**
#@markdown #### Language
language = "ja" #@param ["auto", "en", "zh", "ja", "fr", "de"] {allow-input: true}
#@markdown #### initial prompt (change to transcribe if you prefer transcribing only)
initial_prompt = "Please transcribe this from Japanese." #@param {type:"string"}
#@markdown ---
#@markdown #### Word-level timestamps
word_level_timestamps = False #@param {type:"boolean"}
#@markdown ---
#@markdown #### VAD filter
vad_filter = True #@param {type:"boolean"}
vad_filter_min_silence_duration_ms = 50 #@param {type:"integer"}
#@markdown ---
#@markdown #### Output (Default is srt, txt if `text_only` be checked )
text_only = False #@param {type:"boolean"}


segments, info = model.transcribe(str(video_path_local), beam_size=5,
                                  language=None if language == "auto" else language,
                                  initial_prompt=initial_prompt,
                                  word_timestamps=word_level_timestamps,
                                  vad_filter=vad_filter,
                                  vad_parameters=dict(min_silence_duration_ms=vad_filter_min_silence_duration_ms))

display(Markdown(f"Detected language '{info.language}' with probability {info.language_probability}"))

ext_name = '.txt' if text_only else ".srt"
output_file_name = video_path_local.stem + ext_name
sentence_idx = 1
with open(output_file_name, 'w') as f:
  for segment in segments:
    if word_level_timestamps:
      for word in segment.words:
        ts_start = seconds_to_time_format(word.start)
        ts_end = seconds_to_time_format(word.end)
        print(f"[{ts_start} --> {ts_end}] {word.word}")
        if not text_only:
          f.write(f"{sentence_idx}\n")
          f.write(f"{ts_start} --> {ts_end}\n")
          f.write(f"{word.word}\n\n")
        else:
          f.write(f"{word.word}")
        f.write("\n")
        sentence_idx = sentence_idx + 1
    else:
      ts_start = seconds_to_time_format(segment.start)
      ts_end = seconds_to_time_format(segment.end)
      print(f"[{ts_start} --> {ts_end}] {segment.text}")
      if not text_only:
        f.write(f"{sentence_idx}\n")
        f.write(f"{ts_start} --> {ts_end}\n")
        f.write(f"{segment.text.strip()}\n\n")
      else:
        f.write(f"{segment.text.strip()}\n")
      sentence_idx = sentence_idx + 1

try:
  files.download(output_file_name)
  shutil.copy(video_path_local.parent / output_file_name,
            drive_whisper_path / output_file_name
  )
  display(Markdown(f"**Output file created: {drive_whisper_path / output_file_name}**"))
except:
  display(Markdown(f"**Output file created: {video_path_local.parent / output_file_name}**"))


Detected language 'ja' with probability 1

[00:00:42,000 --> 00:00:45,120] いつもより良い味してるけど。
[00:00:45,120 --> 00:00:48,370] もう全然混ざってないんだけど。
[00:00:48,370 --> 00:00:50,650] あ、そう?
[00:00:50,650 --> 00:00:52,650] 温度かな?入れ方?
[00:00:52,650 --> 00:00:54,650] 全然ね、変わんない変わんない。
[00:00:54,650 --> 00:00:56,870] いつもと同じ。
[00:00:56,870 --> 00:00:58,870] お茶ってやってなかった?
[00:00:58,870 --> 00:01:04,320] 私飛行だから茶道の授業があったけどね。
[00:01:04,320 --> 00:01:06,320] あ、そうか。
[00:01:06,320 --> 00:01:09,880] そう。
[00:01:09,880 --> 00:01:10,880] こう回すんでしょ?
[00:01:10,880 --> 00:01:12,880] こう回して。
[00:01:12,880 --> 00:01:13,880] まあね。
[00:01:13,880 --> 00:01:15,880] 尚げた飲むんじゃないけど。
[00:01:15,880 --> 00:01:16,880] あ、そう?回し飲み?
[00:01:16,880 --> 00:01:17,880] まさかまさか。
[00:01:17,880 --> 00:01:18,880] 知らんの?
[00:01:18,880 --> 00:01:19,880] こう拭いて。
[00:01:19,880 --> 00:01:21,880] あ、一個一個出すんだ。
[00:01:21,880 --> 00:01:23,880] そう。一人でしょ。
[00:01:23,880 --> 00:01:24,880] あ、そうだ。
[00:01:24,880 --> 00:01:26,710] うん。
[00:01:26,710 --> 00:01:28,710] あ、でもお試し美味しかったよ。
[00:01:28,710 --> 00:01:31,740] 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Output file created: /content/drive/My Drive/Colab Notebooks/Faster Whisper/audio.srt**