<a href="https://colab.research.google.com/github/bemxio/colab-notebooks/blob/main/WhisperDemo/WhisperDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Whisper (YT-DLP variant)

Whisper is a general-purpose speech recognition model, made by OpenAI. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

And this is a Google Colab demo made for it, for fun.

#### Install required dependencies

In [None]:
!pip install openai-whisper yt-dlp

#### Set parameters for Whisper and upload the audio file

In [None]:
import pathlib

from moviepy.editor import ipython_display
from google.colab import files

# constants set by the user in the notebook
SOURCE_URL = "" # @param {type: "string"}

MODEL = "large" # @param ["tiny.en", "tiny", "base.en", "base", "small.en", "small", "medium.en", "medium", "large-v1", "large-v2", "large"]
LANGUAGE = "English" # @param ["Afrikaans", "Albanian", "Amharic", "Arabic", "Armenian", "Assamese", "Azerbaijani", "Bashkir", "Basque", "Belarusian", "Bengali", "Bosnian", "Breton", "Bulgarian", "Burmese", "Castilian", "Catalan", "Chinese", "Croatian", "Czech", "Danish", "Dutch", "English", "Estonian", "Faroese", "Finnish", "Flemish", "French", "Galician", "Georgian", "German", "Greek", "Gujarati", "Haitian", "Haitian Creole", "Hausa", "Hawaiian", "Hebrew", "Hindi", "Hungarian", "Icelandic", "Indonesian", "Italian", "Japanese", "Javanese", "Kannada", "Kazakh", "Khmer", "Korean", "Lao", "Latin", "Latvian", "Letzeburgesch", "Lingala", "Lithuanian", "Luxembourgish", "Macedonian", "Malagasy", "Malay", "Malayalam", "Maltese", "Maori", "Marathi", "Moldavian", "Moldovan", "Mongolian", "Myanmar", "Nepali", "Norwegian", "Nynorsk", "Occitan", "Panjabi", "Pashto", "Persian", "Polish", "Portuguese", "Punjabi", "Pushto", "Romanian", "Russian", "Sanskrit", "Serbian", "Shona", "Sindhi", "Sinhala", "Sinhalese", "Slovak", "Slovenian", "Somali", "Spanish", "Sundanese", "Swahili", "Swedish", "Tagalog", "Tajik", "Tamil", "Tatar", "Telugu", "Thai", "Tibetan", "Turkish", "Turkmen", "Ukrainian", "Urdu", "Uzbek", "Valencian", "Vietnamese", "Welsh", "Yiddish", "Yoruba"]
TASK = "transcribe" # @param ["transcribe", "translate"]

# download the video
!python3 -m yt_dlp --no-simulate --print-to-file "%(id)s.%(ext)s" filename.txt "{SOURCE_URL}" --output "%(id)s.%(ext)s"

# get the path of the video
with open("filename.txt", "r", encoding="utf-8") as file:
    path = pathlib.Path(file.read().strip())

# delete the filename file
!rm filename.txt

# show a preview of the audio
ipython_display(str(path), filetype="audio", maxduration=300)

#### Process the audio file with Whisper

In [None]:
import torch

from whisper.transcribe import transcribe
from whisper.utils import get_writer
from whisper import load_model

# other constants, if you really want to, you can edit them within the code
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

TEMPERATURE = 0.0
BEAM_SIZE = 5

OUTPUT_FORMAT = "all"

# load the model for audio processing and the writer for various formats
model = load_model(MODEL, device=DEVICE, download_root=None)
writer = get_writer(OUTPUT_FORMAT, output_dir=".")

# get the transcription
result = transcribe(
    model=model, 
    audio=str(path), 
    
    verbose=True,

    task=TASK,
    language=LANGUAGE,

    temperature=TEMPERATURE,
    beam_size=BEAM_SIZE
)

# write the result in the defined output format
writer(result, str(path), options={})

You can now access all of the files Whisper generated in the Files tab (that little folder on the left bar).

Congratulations! Download stuff you need or generate more stuff if you want.