Use CUDA enabled whisper.cpp. Setup is outlined in https://github.com/ggerganov/whisper.cpp. YouTube extraction code from OPENVINO example notebook 227-whisper-nnfc-quantize.ipynb.

CAUTION: MAKE SURE TO BACKUP AUDIO FILES IF THEY HAVE SPACES IN THEIR NAMES AS THEY WILL BE RENAMED (AND THE METADATA ALTERED). For file names without spaces, the original file is not renamed and a copy is made in a compatible audio format.

In [15]:
import numpy as np
import subprocess
import os
import glob

import ipywidgets as widgets
from pytube import YouTube
from utils import get_audio
from utils import prepare_srt
from pathlib import Path

np.set_printoptions(linewidth=50)

# change to reflect your local directory and file name
home_directory = os.path.expanduser("~")
directory = home_directory + '/machine_learning/whisper.cpp/samples/'

In [16]:
# copy and youtube share link below:
VIDEO_LINK = "https://youtu.be/eUuGdh3nBGo?feature=shared"
name = VIDEO_LINK[17:-15]
print(name)

link = widgets.Text(
    value=VIDEO_LINK,
    placeholder="Type link for video",
    description="Video:",
    disabled=False
)
link

print(f"Downloading video {link.value} started")
output_file = Path(directory + name + ".mp4")
yt = YouTube(link.value)
yt.streams.get_lowest_resolution().download(filename=output_file)
print(f"Video saved to {output_file}")

eUuGdh3nBGo
Downloading video https://youtu.be/eUuGdh3nBGo?feature=shared started
Video saved to /var/home/fraser/machine_learning/whisper.cpp/samples/eUuGdh3nBGo.mp4


In [17]:
import subprocess

extracted_audio_file = name + '.wav'

def extract_audio(video_path, audio_path):
    yes_command = f'echo "y" | '
    command = yes_command + "ffmpeg -i {} -vn -acodec pcm_s16le -ar 16000 -ac 1 {}".format(video_path, audio_path)
    subprocess.call(command, shell=True)

# Usage
try:
    extract_audio(output_file, directory + extracted_audio_file)
    print("Audio coverted successfully.")
except subprocess.CalledProcessError as e:
    print(f"Audio convertion failed with error {e.returncode}.")

ffmpeg version 4.4 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 9.3.0 (crosstool-NG 1.24.0.133_b0863d8_dirty)
  configuration: --prefix=/root/miniconda3/envs/conda_bld/conda-bld/ffmpeg_1635335682798/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place --cc=/root/miniconda3/envs/conda_bld/conda-bld/ffmpeg_1635335682798/_build_env/bin/x86_64-conda-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libavresample   4.  0.  0 /  4.  0.  0

Audio coverted successfully.


size=  145435kB time=01:17:33.90 bitrate= 256.0kbits/s speed=1.33e+03x    
video:0kB audio:145435kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000052%


In [18]:
# transcribe using the base model, output text file (great with CUDA enabled whisper.cpp)
try:
    subprocess.run(['transcribe -t 12 -m ' + home_directory + '/machine_learning/whisper.cpp/models/ggml-base.en.bin -f ' 
                    + directory + extracted_audio_file + ' -otxt -ovtt -osrt -olrc'], shell=True, check=True)
    print("Transcription executed successfully and saved in " + directory)
except subprocess.CalledProcessError as e:
    print(f"Transcription failed with error {e.returncode}.")

whisper_init_from_file_with_params_no_state: loading model from '/var/home/fraser/machine_learning/whisper.cpp/models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6, VMM: yes
whisper_backend_init: using 


[00:00:00.000 --> 00:00:13.000]   So welcome, this is going further with CUDA for Python programmers, as the name suggests this won't make too much sense unless you've got started with CUDA for Python programmers.
[00:00:13.000 --> 00:00:19.000]   The good news is that I have a video called getting started with CUDA for Python programmers.
[00:00:19.000 --> 00:00:28.000]   So start there, it's only a bit over an hour long, you might be surprised at how quick and easy it is to get started if you haven't.
[00:00:28.000 --> 00:00:40.000]   So assuming that you have got started, today we're going to be looking at the most important next step of taking advantage of CUDA,
[00:00:40.000 --> 00:00:49.000]   which is we've already learnt to take advantage of the thousands of threads that you can run simultaneously on a GPU.
[00:00:49.000 --> 00:00:54.000]   Today we're going to learn how to take advantage of the incredibly fast memory.
[00:00:54.000 --> 00:01:03.000]   So in the up to now, alt

output_txt: saving output to '/var/home/fraser/machine_learning/whisper.cpp/samples/eUuGdh3nBGo.wav.txt'
output_vtt: saving output to '/var/home/fraser/machine_learning/whisper.cpp/samples/eUuGdh3nBGo.wav.vtt'
output_srt: saving output to '/var/home/fraser/machine_learning/whisper.cpp/samples/eUuGdh3nBGo.wav.srt'
output_lrc: saving output to '/var/home/fraser/machine_learning/whisper.cpp/samples/eUuGdh3nBGo.wav.lrc'

whisper_print_timings:     load time =  1260.26 ms
whisper_print_timings:     fallbacks =   3 p /   4 h
whisper_print_timings:      mel time =  2623.71 ms
whisper_print_timings:   sample time = 33439.42 ms / 82363 runs (    0.41 ms per run)
whisper_print_timings:   encode time =   357.77 ms /   195 runs (    1.83 ms per run)
whisper_print_timings:   decode time =  1400.00 ms /   798 runs (    1.75 ms per run)
whisper_print_timings:   batchd time = 42444.65 ms / 80578 runs (    0.53 ms per run)
whisper_print_timings:   prompt time =  9885.02 ms / 44003 runs (    0.22 ms per

In [31]:
# need to convert text file to html because send to kindle joins lines in txt file

def convert_txt_to_html(txt_file_path, html_file_path):
    with open(txt_file_path, 'r') as txt_file, open(html_file_path, 'w') as html_file:
        html_file.write('<!DOCTYPE html>\n<html><body>')
        for line in txt_file:
            html_file.write('<p>{}</p>'.format(line.strip()))
        html_file.write('</body>\n</html>')

txt_file_path = directory + name + '.wav.lrc'  # replace with your text file path
html_file_path = directory + name + '.html'  # replace with your HTML file path
convert_txt_to_html(txt_file_path, html_file_path)