[![notebook shield](https://img.shields.io/static/v1?label=&message=Notebook&color=blue&style=for-the-badge&logo=googlecolab&link=https://colab.research.google.com/github/ArthurFDLR/whisper-youtube/blob/main/whisper_youtube.ipynb)](https://colab.research.google.com/github/Majdoddin/nlp/blob/main/Pyannote_plays_and_Whisper_rhymes_v_2_0.ipynb)
[![repository shield](https://img.shields.io/static/v1?label=&message=Repository&color=blue&style=for-the-badge&logo=github&link=https://github.com/openai/whisper)](https://github.com/majdoddin/nlp)

# Whisper's transcription plus Pyannote's Diarization

**Update** - [@johnwyles](https://github.com/johnwyles) added HTML output for audio/video files from Google Drive, along with some fixes.

Using the new word-level timestamping of Whisper, the transcription words are highlighted as the video plays, with optional autoscroll. And the display on small displays is improved.

Moreover, the model is loaded just once, thus the whole thing runs much faster now. You can also hardcode your Huggingface token.

---
Andrej Karpathy [suggested](https://twitter.com/karpathy/status/1574476200801538048?s=20&t=s5IMMXOYjBI6-91dib6w8g) training a classifier on top of  OpenAI [Whisper](https://openai.com/blog/whisper/) model features to identify the speaker, so we can visualize the speaker in the transcript. But, as [pointed out](https://twitter.com/tarantulae/status/1574493613362388992?s=20&t=s5IMMXOYjBI6-91dib6w8g) by Christian Perone, it seems that features from whisper wouldn't be that great for speaker recognition as its main objective is basically to ignore speaker differences.

In the following, I use [**`pyannote-audio`**](https://github.com/pyannote/pyannote-audio), a speaker diarization toolkit by Hervé Bredin, to identify the speakers, and then match it with the transcriptions of Whispr, linked to the video. The input can be YouTube or an video/audio file (also on Google Drive). I try it on a [Customer Support Call](https://youtu.be/hpZFJctBUHQ). Check the result [**here**](https://majdoddin.github.io/dyson.html).

To make it easier to match the transcriptions to diarizations by speaker change, Sarah Kaiser [suggested](https://github.com/openai/whisper/discussions/264#discussioncomment-3825375) runnnig the pyannote.audio first and  then just running whisper on the split-by-speaker chunks.
For sake of performance (and transcription quality?), we attach the audio segments into a single audio file with a silent spacer as a separator, and run whisper on it. Enjoy it!

(For sake of performance , I also tried attaching the audio segments into a single audio file with a silent -or beep- spacer as a separator, and run whisper on it see it on [colab](https://colab.research.google.com/drive/1HuvcY4tkTHPDzcwyVH77LCh_m8tP-Qet?usp=sharing). It [works](https://majdoddin.github.io/lexicap.html) on some audio, and fails on some (Dyson's Interview). The problem is, whisper does not reliably make a timestap on a spacer. See the discussions [#139](https://github.com/openai/whisper/discussions/139) and [#29](https://github.com/openai/whisper/discussions/29))

The Markdown form used below is from [@ArthurFDLR](https://github.com/ArthurFDLR/whisper-youtube/).   

# Preparing the audio file

**Optional:** Mount Google Drive



In [1]:
from google.colab import drive
from pathlib import Path

drive_mount_path = Path("/content/drive")
drive.mount(str(drive_mount_path))
drive_mount_path /= "MyDrive"

Mounted at /content/drive


In [2]:
from pathlib import Path
#@markdown Enter the URL of the YouTube video, or the path to the video/audio file you want to transcribe, give the output path, etc. and run the cell. HTML file embeds the video for YouTube, and audio for media files.

Source = 'Youtube' #@param ['Youtube', 'File (Google Drive)']
#@markdown ---
#@markdown #### **Youtube video**
video_url = "https://youtu.be/hpZFJctBUHQ" #@param {type:"string"}
#store_audio = True #@param {type:"boolean"}
#@markdown ---
#@markdown #### **Google Drive video or audio path (mp4, wav, mp3)**
video_path = "/content/drive/MyDrive/Customer_Service.mp3" #@param {type:"string"}
#@markdown ---
output_path = "/content/transcript/" #@param {type:"string"}
output_path = str(Path(output_path))
#@markdown ---
#@markdown #### **Title for transcription of media file**
audio_title = "Sample Order Taking" #@param {type:"string"}
#@markdown ---
#@markdown #### Copy a token from your [Hugging Face tokens page](https://huggingface.co/settings/tokens) and paste it below.
access_token = "hf_" #@param {type:"string"}
#@markdown ---
#@markdown **Run this cell again if you change the video.**

In [3]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [4]:
Path(output_path).mkdir(parents=True, exist_ok=True)
%cd {output_path}
video_title = ""
video_id = ""

/content/transcript


## From YouTube

 Installing [`yt-dlp`](https://github.com/yt-dlp/yt-dlp) and downloading the [video](https://youtu.be/NSp2fEQ6wyA) from youtube.

In [5]:
if Source == "Youtube":
  !pip install -U yt-dlp

Collecting yt-dlp
  Downloading yt_dlp-2024.8.6-py3-none-any.whl.metadata (170 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/170.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m170.1/170.1 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting brotli (from yt-dlp)
  Downloading Brotli-1.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.5 kB)
Collecting mutagen (from yt-dlp)
  Downloading mutagen-1.47.0-py3-none-any.whl.metadata (1.7 kB)
Collecting pycryptodomex (from yt-dlp)
  Downloading pycryptodomex-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting websockets>=12.0 (from yt-dlp)
  Downloading websockets-13.0.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading yt_dlp-2024.8.6-py3-none-any.whl (3.1 MB)
[2K   [

Custom build of `ffmpeg` as [recommended](https://github.com/yt-dlp/yt-dlp#strongly-recommended) by `yt-dlp`.

In [6]:
if Source == "Youtube":
  !wget -O - -q  https://github.com/yt-dlp/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz | xz -qdc| tar -x

In [7]:
#Getting video info
if Source == "Youtube":
  from yt_dlp import YoutubeDL
  with YoutubeDL() as ydl:
    info_dict = ydl.extract_info(video_url, download=False)
    video_title = info_dict.get('title', None)
    video_id = info_dict.get('id', None)
    print("Title: " + video_title) # <= Here, you got the video title


[youtube] Extracting URL: https://youtu.be/hpZFJctBUHQ
[youtube] hpZFJctBUHQ: Downloading webpage
[youtube] hpZFJctBUHQ: Downloading ios player API JSON
[youtube] hpZFJctBUHQ: Downloading web creator player API JSON
[youtube] hpZFJctBUHQ: Downloading player e38bb6de
[youtube] hpZFJctBUHQ: Downloading m3u8 information
Title: Sample Order Taking | Customer Support Philippines


Downloading the audio from YouTube.

In [8]:
if Source == "Youtube":
  !yt-dlp -xv --ffmpeg-location ffmpeg-master-latest-linux64-gpl/bin --audio-format wav  -o "{str(output_path) + '/'}input.wav" -- {video_url}

[debug] Command-line config: ['-xv', '--ffmpeg-location', 'ffmpeg-master-latest-linux64-gpl/bin', '--audio-format', 'wav', '-o', '/content/transcript/input.wav', '--', 'https://youtu.be/hpZFJctBUHQ']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2024.08.06 from yt-dlp/yt-dlp [4d9231208] (pip)
[debug] Python 3.10.12 (CPython x86_64 64bit) - Linux-6.1.85+-x86_64-with-glibc2.35 (OpenSSL 3.0.2 15 Mar 2022, glibc 2.35)
[debug] exe versions: ffmpeg N-117043-g8707c8660d-20240915 (setts), ffprobe N-117043-g8707c8660d-20240915
[debug] Optional libraries: Cryptodome-3.20.0, brotli-1.1.0, certifi-2024.08.30, mutagen-1.47.0, requests-2.32.3, secretstorage-3.3.1, sqlite3-3.37.2, urllib3-2.0.7, websockets-13.0.1
[debug] Proxy map: {'colab_language_server': '/usr/colab/bin/language_service'}
[debug] Request Handlers: urllib, requests, websockets
[debug] Loaded 1830 extractors
[youtube] Extracting URL: https://youtu.be/hpZFJct

## or from File (Google Drive)

In [9]:
if Source == 'File (Google Drive)':
    !ffmpeg -i {repr(video_path)} -vn -acodec pcm_s16le -ar 16000 -ac 1 -y input.wav

## Prepending a spacer

`pyannote.audio` seems to miss the first 0.5 seconds of the audio, and, therefore, we prepend a spcacer.

In [10]:
!pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [11]:
from pydub import AudioSegment

spacermilli = 2000
spacer = AudioSegment.silent(duration=spacermilli)


audio = AudioSegment.from_wav("input.wav")

audio = spacer.append(audio, crossfade=0)

audio.export('input_prep.wav', format='wav')

<_io.BufferedRandom name='input_prep.wav'>

# Pyannote's Diarization

[`pyannote.audio`](https://github.com/pyannote/pyannote-audio) is an open-source toolkit written in Python for **speaker diarization**.

Based on [`PyTorch`](https://pytorch.org) machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines.

`pyannote.audio` also comes with pretrained [models](https://huggingface.co/models?other=pyannote-audio-model) and [pipelines](https://huggingface.co/models?other=pyannote-audio-pipeline) covering a wide range of domains for voice activity detection, speaker segmentation, overlapped speech detection, speaker embedding reaching state-of-the-art performance for most of them.

Installing `pyannote.audio`.

In [12]:
!pip install light-the-torch

Collecting light-the-torch
  Downloading light_the_torch-0.7.5-py3-none-any.whl.metadata (9.5 kB)
Collecting pip<23.3,>=22.3 (from light-the-torch)
  Downloading pip-23.2.1-py3-none-any.whl.metadata (4.2 kB)
Downloading light_the_torch-0.7.5-py3-none-any.whl (14 kB)
Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m53.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip, light-the-torch
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed light-the-torch-0.7.5 pip-23.2.1


In [13]:
!ltt install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1

Collecting torch==1.13.1
  Downloading https://download.pytorch.org/whl/cu117/torch-1.13.1%2Bcu117-cp310-cp310-linux_x86_64.whl (1801.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 GB[0m [31m793.7 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.14.1
  Downloading https://download.pytorch.org/whl/cu117/torchvision-0.14.1%2Bcu117-cp310-cp310-linux_x86_64.whl (24.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.3/24.3 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==0.13.1
  Downloading https://download.pytorch.org/whl/cu117/torchaudio-0.13.1%2Bcu117-cp310-cp310-linux_x86_64.whl (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m79.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch, torchvision, torchaudio
  Attempting uninstall: torch
    Found existing installation: torch 2.4.0+cu121
    Uninstalling torch-2.4.0+cu121:
      Succes

In [14]:
!pip install  git+https://github.com/hmmlearn/hmmlearn.git
!pip install  git+https://github.com/pyannote/pyannote-audio.git@develop

Collecting git+https://github.com/hmmlearn/hmmlearn.git
  Cloning https://github.com/hmmlearn/hmmlearn.git to /tmp/pip-req-build-1nfa_u4l
  Running command git clone --filter=blob:none --quiet https://github.com/hmmlearn/hmmlearn.git /tmp/pip-req-build-1nfa_u4l
  Resolved https://github.com/hmmlearn/hmmlearn.git to commit 33b1a916a11efb5c569abfc9705b7ae030e81c31
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: hmmlearn
  Building wheel for hmmlearn (setup.py) ... [?25l[?25hdone
  Created wheel for hmmlearn: filename=hmmlearn-0.3.2.post4+g33b1a91-cp310-cp310-linux_x86_64.whl size=139513 sha256=4924cdb5fc888e81f07655bdfe3c013ee5aff0beacfc462ab9d30e6b4123d5a0
  Stored in directory: /tmp/pip-ephem-wheel-cache-xasswotf/wheels/f2/09/3d/bb8629ebd26dfd25f201dc48b05a7addd80388e2a9bc61e19d
Successfully built hmmlearn
Installing collected packages: hmmlearn
Successfully installed hmmlearn-0.3.2.post4+g33b1a91
Collecting git+https://github.com/pyannote

**Important:** To load the pyannote speaker diarization pipeline,

* accept the user conditions on both [hf.co/pyannote/speaker-diarization](https://hf.co/pyannote/speaker-diarization) and [hf.co/pyannote/segmentation](https://huggingface.co/pyannote/segmentation).
* paste your access_token or login using `notebook_login` below

In [23]:
# if not(access_token):
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [24]:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization', use_auth_token= ("AuthToken") or True )

config.yaml:   0%|          | 0.00/500 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/318 [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../root/.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.4.1+cu121. Bad things might happen unless you revert torch to 1.x.


hyperparams.yaml:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

embedding_model.ckpt:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

mean_var_norm_emb.ckpt:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

classifier.ckpt:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

label_encoder.txt:   0%|          | 0.00/129k [00:00<?, ?B/s]

In [25]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipeline.to(device)

<pyannote.audio.pipelines.speaker_diarization.SpeakerDiarization at 0x7a8727d6e680>

Running pyannote.audio to generate the diarizations.

In [26]:
DEMO_FILE = {'uri': 'blabla', 'audio': 'input_prep.wav'}
dz = pipeline(DEMO_FILE)

with open("diarization.txt", "w") as text_file:
    text_file.write(str(dz))

In [27]:
print(*list(dz.itertracks(yield_label = True))[:10], sep="\n")

(<Segment(2.78159, 5.46472)>, 'A', 'SPEAKER_00')
(<Segment(5.78534, 9.29534)>, 'B', 'SPEAKER_01')
(<Segment(9.75097, 12.6872)>, 'C', 'SPEAKER_00')
(<Segment(13.0922, 14.0372)>, 'D', 'SPEAKER_01')
(<Segment(14.5266, 16.9903)>, 'E', 'SPEAKER_00')
(<Segment(17.5135, 20.7872)>, 'F', 'SPEAKER_01')
(<Segment(21.496, 24.0272)>, 'G', 'SPEAKER_01')
(<Segment(24.7697, 29.2753)>, 'H', 'SPEAKER_00')
(<Segment(29.2753, 30.7435)>, 'I', 'SPEAKER_01')
(<Segment(31.4185, 32.1778)>, 'J', 'SPEAKER_01')


# Preparing audio files according to the diarization

In [28]:
def millisec(timeStr):
  spl = timeStr.split(":")
  s = (int)((int(spl[0]) * 60 * 60 + int(spl[1]) * 60 + float(spl[2]) )* 1000)
  return s

Grouping the diarization segments according to the speaker.

In [29]:
import re
dzs = open('diarization.txt').read().splitlines()

groups = []
g = []
lastend = 0

for d in dzs:
  if g and (g[0].split()[-1] != d.split()[-1]):      #same speaker
    groups.append(g)
    g = []

  g.append(d)

  end = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=d)[1]
  end = millisec(end)
  if (lastend > end):       #segment engulfed by a previous segment
    groups.append(g)
    g = []
  else:
    lastend = end
if g:
  groups.append(g)
print(*groups, sep='\n')

['[ 00:00:02.781 -->  00:00:05.464] A SPEAKER_00']
['[ 00:00:05.785 -->  00:00:09.295] B SPEAKER_01']
['[ 00:00:09.750 -->  00:00:12.687] C SPEAKER_00']
['[ 00:00:13.092 -->  00:00:14.037] D SPEAKER_01']
['[ 00:00:14.526 -->  00:00:16.990] E SPEAKER_00']
['[ 00:00:17.513 -->  00:00:20.787] F SPEAKER_01', '[ 00:00:21.495 -->  00:00:24.027] G SPEAKER_01']
['[ 00:00:24.769 -->  00:00:29.275] H SPEAKER_00']
['[ 00:00:29.275 -->  00:00:30.743] I SPEAKER_01', '[ 00:00:31.418 -->  00:00:32.177] J SPEAKER_01', '[ 00:00:32.819 -->  00:00:33.747] K SPEAKER_01', '[ 00:00:34.540 -->  00:00:35.805] L SPEAKER_01']
['[ 00:00:36.329 -->  00:00:42.758] M SPEAKER_00']
['[ 00:00:43.163 -->  00:00:46.049] N SPEAKER_01', '[ 00:00:46.690 -->  00:00:48.327] O SPEAKER_01']
['[ 00:00:48.917 -->  00:00:53.760] P SPEAKER_00']
['[ 00:00:53.963 -->  00:00:54.975] Q SPEAKER_01']
['[ 00:00:55.262 -->  00:00:55.819] R SPEAKER_00']
['[ 00:00:56.342 -->  00:00:57.321] S SPEAKER_01', '[ 00:00:58.064 -->  00:00:59.312] T

Save the audio part corresponding to each diarization group.

In [30]:
audio = AudioSegment.from_wav("input_prep.wav")
gidx = -1
for g in groups:
  start = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[0])[0]
  end = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[-1])[1]
  start = millisec(start) #- spacermilli
  end = millisec(end)  #- spacermilli
  gidx += 1
  audio[start:end].export(str(gidx) + '.wav', format='wav')
  print(f"group {gidx}: {start}--{end}")

group 0: 2781--5464
group 1: 5785--9295
group 2: 9750--12687
group 3: 13092--14037
group 4: 14526--16990
group 5: 17513--24027
group 6: 24769--29275
group 7: 29275--35805
group 8: 36329--42758
group 9: 43163--48327
group 10: 48917--53760
group 11: 53963--54975
group 12: 55262--55819
group 13: 56342--63548
group 14: 63548--72610
group 15: 73184--75057
group 16: 76002--78719
group 17: 78719--79225
group 18: 79697--83174
group 19: 84507--84878
group 20: 87105--95914
group 21: 96285--98564
group 22: 98901--100335
group 23: 100825--101584
group 24: 102225--104014
group 25: 104487--105854
group 26: 106360--109904


Freeing up some memory

In [31]:
del   DEMO_FILE, pipeline, spacer,  audio, dz

# Whisper's Transcriptions

Installing Open AI whisper.

In [32]:
!pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-vv9gjrqy
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-vv9gjrqy
  Resolved https://github.com/openai/whisper.git to commit 279133e3107392276dc509148da1f41bfb532c7e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20231117)
  Obtaining dependency information for tiktoken from https://files.pythonhosted.org/packages/e7/8c/7d1007557b343d5cf18349802e94d3a14397121e9105b4661f8cd753f9bf/tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_

Run whisper on all audio files. Whisper generates the transcription and writes it to a file.

In [33]:
import whisper
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = whisper.load_model('large', device = device)


100%|█████████████████████████████████████| 2.88G/2.88G [01:26<00:00, 35.6MiB/s]


In [34]:
import json
for i in range(len(groups)):
  audiof = str(i) + '.wav'
  result = model.transcribe(audio=audiof, language='en', word_timestamps=True)#, initial_prompt=result.get('text', ""))
  with open(str(i)+'.json', "w") as outfile:
    json.dump(result, outfile, indent=4)

# Generating the HTML and/or txt file from the Transcriptions and the Diarization

Change or add to the speaker names and collors bellow as you wish `(speaker, textbox color, speaker color)`.

In [35]:
speakers = {'SPEAKER_00':('Customer', '#e1ffc7', 'darkgreen'), 'SPEAKER_01':('Call Center', 'white', 'darkorange') }
def_boxclr = 'white'
def_spkrclr = 'orange'

In the generated HTML,  the transcriptions for each diarization group are written in a box, with the speaker name on the top. By clicking a transcription, the embedded video jumps to the right time .

In [36]:
if Source == 'Youtube':
    preS = '<!DOCTYPE html>\n<html lang="en">\n\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1.0">\n\t<meta http-equiv="X-UA-Compatible" content="ie=edge">\n\t<title>' + \
video_title+ \
'</title>\n\t<style>\n\t\tbody {\n\t\t\tfont-family: sans-serif;\n\t\t\tfont-size: 14px;\n\t\t\tcolor: #111;\n\t\t\tpadding: 0 0 1em 0;\n\t\t\tbackground-color: #efe7dd;\n\t\t}\n\n\t\ttable {\n\t\t\tborder-spacing: 10px;\n\t\t}\n\n\t\tth {\n\t\t\ttext-align: left;\n\t\t}\n\n\t\t.lt {\n\t\t\tcolor: inherit;\n\t\t\ttext-decoration: inherit;\n\t\t}\n\n\t\t.l {\n\t\t\tcolor: #050;\n\t\t}\n\n\t\t.s {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t.c {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t.e {\n\t\t\t/*background-color: white; Changing background color */\n\t\t\tborder-radius: 10px;\n\t\t\t/* Making border radius */\n\t\t\twidth: 50%;\n\t\t\t/* Making auto-sizable width */\n\t\t\tpadding: 0 0 0 0;\n\t\t\t/* Making space around letters */\n\t\t\tfont-size: 14px;\n\t\t\t/* Changing font size */\n\t\t\tmargin-bottom: 0;\n\t\t}\n\n\t\t.t {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t#player-div {\n\t\t\tposition: sticky;\n\t\t\ttop: 20px;\n\t\t\tfloat: right;\n\t\t\twidth: 40%\n\t\t}\n\n\t\t#player {\n\t\t\taspect-ratio: 16 / 9;\n\t\t\twidth: 100%;\n\t\t\theight: auto;\n\n\t\t}\n\n\t\ta {\n\t\t\tdisplay: inline;\n\t\t}\n\t</style>\n\t<script>\n\t\tvar tag = document.createElement(\'script\');\n\t\ttag.src = "https://www.youtube.com/iframe_api";\n\t\tvar firstScriptTag = document.getElementsByTagName(\'script\')[0];\n\t\tfirstScriptTag.parentNode.insertBefore(tag, firstScriptTag);\n\t\tvar player;\n\t\tfunction onYouTubeIframeAPIReady() {\n\t\t\tplayer = new YT.Player(\'player\', {\n\t\t\t\t//height: \'210\',\n\t\t\t\t//width: \'340\',\n\t\t\t\tvideoId: \''+ \
video_id + \
'\',\n\t\t\t});\n\n\n\n\t\t\t// This is the source "window" that will emit the events.\n\t\t\tvar iframeWindow = player.getIframe().contentWindow;\n\t\t\tvar lastword = null;\n\n\t\t\t// So we can compare against new updates.\n\t\t\tvar lastTimeUpdate = "-1";\n\n\t\t\t// Listen to events triggered by postMessage,\n\t\t\t// this is how different windows in a browser\n\t\t\t// (such as a popup or iFrame) can communicate.\n\t\t\t// See: https://developer.mozilla.org/en-US/docs/Web/API/Window/postMessage\n\t\t\twindow.addEventListener("message", function (event) {\n\t\t\t\t// Check that the event was sent from the YouTube IFrame.\n\t\t\t\tif (event.source === iframeWindow) {\n\t\t\t\t\tvar data = JSON.parse(event.data);\n\n\t\t\t\t\t// The "infoDelivery" event is used by YT to transmit any\n\t\t\t\t\t// kind of information change in the player,\n\t\t\t\t\t// such as the current time or a playback quality change.\n\t\t\t\t\tif (\n\t\t\t\t\t\tdata.event === "infoDelivery" &&\n\t\t\t\t\t\tdata.info &&\n\t\t\t\t\t\tdata.info.currentTime\n\t\t\t\t\t) {\n\t\t\t\t\t\t// currentTime is emitted very frequently (milliseconds),\n\t\t\t\t\t\t// but we only care about whole second changes.\n\t\t\t\t\t\tvar ts = (data.info.currentTime).toFixed(1).toString();\n\t\t\t\t\t\tts = (Math.round((data.info.currentTime) * 5) / 5).toFixed(1);\n\t\t\t\t\t\tts = ts.toString();\n\t\t\t\t\t\tconsole.log(ts)\n\t\t\t\t\t\tif (ts !== lastTimeUpdate) {\n\t\t\t\t\t\t\tlastTimeUpdate = ts;\n\n\t\t\t\t\t\t\t// It\'s now up to you to format the time.\n\t\t\t\t\t\t\t//document.getElementById("time2").innerHTML = time;\n\t\t\t\t\t\t\tword = document.getElementById(ts)\n\t\t\t\t\t\t\tif (word) {\n\t\t\t\t\t\t\t\tif (lastword) {\n\t\t\t\t\t\t\t\t\tlastword.style.fontWeight = \'normal\';\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\tlastword = word;\n\t\t\t\t\t\t\t\t//word.style.textDecoration = \'underline\';\n\t\t\t\t\t\t\t\tword.style.fontWeight = \'bold\';\n\n\t\t\t\t\t\t\t\tlet toggle = document.getElementById("autoscroll");\n\t\t\t\t\t\t\t\tif (toggle.checked) {\n\t\t\t\t\t\t\t\t\tlet position = word.offsetTop - 20;\n\t\t\t\t\t\t\t\t\twindow.scrollTo({\n\t\t\t\t\t\t\t\t\t\ttop: position,\n\t\t\t\t\t\t\t\t\t\tbehavior: \'smooth\'\n\t\t\t\t\t\t\t\t\t});\n\t\t\t\t\t\t\t\t}\n\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t})\n\t\t}\n\t\tfunction jumptoTime(timepoint, id) {\n\t\t\tevent.preventDefault();\n\t\t\thistory.pushState(null, null, "#" + id);\n\t\t\tplayer.seekTo(timepoint);\n\t\t\tplayer.playVideo();\n\t\t}\n\t</script>\n</head>\n\n<body>\n\t<h2>'  + \
video_title + \
'</h2>\n\t<i>Click on a part of the transcription, to jump to its video, and get an anchor to it in the address\n\t\tbar<br><br></i>\n\t<div id="player-div">\n\t\t<div id="player"></div>\n\t\t<div><label for="autoscroll">auto-scroll: </label>\n\t\t\t<input type="checkbox" id="autoscroll" checked>\n\t\t</div>\n\t</div>\n  '
else:
    preS = '\n<!DOCTYPE html>\n<html lang="en">\n\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="whtmlidth=device-width, initial-scale=1.0">\n\t<meta http-equiv="X-UA-Compatible" content="ie=edge">\n\t<title>' + \
    audio_title+ \
    '</title>\n\t<style>\n\t\tbody {\n\t\t\tfont-family: sans-serif;\n\t\t\tfont-size: 14px;\n\t\t\tcolor: #111;\n\t\t\tpadding: 0 0 1em 0;\n\t\t\tbackground-color: #efe7dd;\n\t\t}\n\n\t\ttable {\n\t\t\tborder-spacing: 10px;\n\t\t}\n\n\t\tth {\n\t\t\ttext-align: left;\n\t\t}\n\n\t\t.lt {\n\t\t\tcolor: inherit;\n\t\t\ttext-decoration: inherit;\n\t\t}\n\n\t\t.l {\n\t\t\tcolor: #050;\n\t\t}\n\n\t\t.s {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t.c {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t.e {\n\t\t\t/*background-color: white; Changing background color */\n\t\t\tborder-radius: 10px;\n\t\t\t/* Making border radius */\n\t\t\twidth: 50%;\n\t\t\t/* Making auto-sizable width */\n\t\t\tpadding: 0 0 0 0;\n\t\t\t/* Making space around letters */\n\t\t\tfont-size: 14px;\n\t\t\t/* Changing font size */\n\t\t\tmargin-bottom: 0;\n\t\t}\n\n\t\t.t {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t#player-div {\n\t\t\tposition: sticky;\n\t\t\ttop: 20px;\n\t\t\tfloat: right;\n\t\t\twidth: 40%\n\t\t}\n\n\t\t#player {\n\t\t\taspect-ratio: 16 / 9;\n\t\t\twidth: 100%;\n\t\t\theight: auto;\n\t\t}\n\n\t\ta {\n\t\t\tdisplay: inline;\n\t\t}\n\t</style>';
    preS += '\n\t<script>\n\twindow.onload = function () {\n\t\t\tvar player = document.getElementById("audio_player");\n\t\t\tvar player;\n\t\t\tvar lastword = null;\n\n\t\t\t// So we can compare against new updates.\n\t\t\tvar lastTimeUpdate = "-1";\n\n\t\t\tsetInterval(function () {\n\t\t\t\t// currentTime is checked very frequently (1 millisecond),\n\t\t\t\t// but we only care about whole second changes.\n\t\t\t\tvar ts = (player.currentTime).toFixed(1).toString();\n\t\t\t\tts = (Math.round((player.currentTime) * 5) / 5).toFixed(1);\n\t\t\t\tts = ts.toString();\n\t\t\t\tconsole.log(ts);\n\t\t\t\tif (ts !== lastTimeUpdate) {\n\t\t\t\t\tlastTimeUpdate = ts;\n\n\t\t\t\t\t// Its now up to you to format the time.\n\t\t\t\t\tword = document.getElementById(ts)\n\t\t\t\t\tif (word) {\n\t\t\t\t\t\tif (lastword) {\n\t\t\t\t\t\t\tlastword.style.fontWeight = "normal";\n\t\t\t\t\t\t}\n\t\t\t\t\t\tlastword = word;\n\t\t\t\t\t\t//word.style.textDecoration = "underline";\n\t\t\t\t\t\tword.style.fontWeight = "bold";\n\n\t\t\t\t\t\tlet toggle = document.getElementById("autoscroll");\n\t\t\t\t\t\tif (toggle.checked) {\n\t\t\t\t\t\t\tlet position = word.offsetTop - 20;\n\t\t\t\t\t\t\twindow.scrollTo({\n\t\t\t\t\t\t\t\ttop: position,\n\t\t\t\t\t\t\t\tbehavior: "smooth"\n\t\t\t\t\t\t\t});\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}, 0.1);\n\t\t}\n\n\t\tfunction jumptoTime(timepoint, id) {\n\t\t\tvar player = document.getElementById("audio_player");\n\t\t\thistory.pushState(null, null, "#" + id);\n\t\t\tplayer.pause();\n\t\t\tplayer.currentTime = timepoint;\n\t\t\tplayer.play();\n\t\t}\n\t\t</script>\n\t</head>';
    preS += '\n\n<body>\n\t<h2>' + audio_title + '</h2>\n\t<i>Click on a part of the transcription, to jump to its portion of audio, and get an anchor to it in the address\n\t\tbar<br><br></i>\n\t<div id="player-div">\n\t\t<div id="player">\n\t\t\t<audio controls="controls" id="audio_player">\n\t\t\t\t<source src="input.wav" />\n\t\t\t</audio>\n\t\t</div>\n\t\t<div><label for="autoscroll">auto-scroll: </label>\n\t\t\t<input type="checkbox" id="autoscroll" checked>\n\t\t</div>\n\t</div>\n';

postS = '\t</body>\n</html>'

In [37]:
#import webvtt
import json
from datetime import timedelta

def timeStr(t):
  return '{0:02d}:{1:02d}:{2:06.2f}'.format(round(t // 3600),
                                                round(t % 3600 // 60),
                                                t % 60)

html = list(preS)
txt = list("")
gidx = -1
for g in groups:
  shift = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[0])[0]
  shift = millisec(shift) - spacermilli #the start time in the original video
  shift=max(shift, 0)

  gidx += 1

  captions = json.load(open(str(gidx) + '.json'))['segments']

  if captions:
    speaker = g[0].split()[-1]
    boxclr = def_boxclr
    spkrclr = def_spkrclr
    if speaker in speakers:
      speaker, boxclr, spkrclr = speakers[speaker]

    html.append(f'<div class="e" style="background-color: {boxclr}">\n');
    html.append('<p  style="margin:0;padding: 5px 10px 10px 10px;word-wrap:normal;white-space:normal;">\n')
    html.append(f'<span style="color:{spkrclr};font-weight: bold;">{speaker}</span><br>\n\t\t\t\t')

    for c in captions:
      start = shift + c['start'] * 1000.0
      start = start / 1000.0   #time resolution ot youtube is Second.
      end = (shift + c['end'] * 1000.0) / 1000.0
      txt.append(f'[{timeStr(start)} --> {timeStr(end)}] [{speaker}] {c["text"]}\n')

      for i, w in enumerate(c['words']):
        if w == "":
           continue
        start = (shift + w['start']*1000.0) / 1000.0
        #end = (shift + w['end']) / 1000.0   #time resolution ot youtube is Second.
        html.append(f'<a href="#{timeStr(start)}" id="{"{:.1f}".format(round(start*5)/5)}" class="lt" onclick="jumptoTime({int(start)}, this.id)">{w["word"]}</a><!--\n\t\t\t\t-->')
    #html.append('\n')
    html.append('</p>\n')
    html.append(f'</div>\n')

html.append(postS)


with open(f"capspeaker.txt", "w", encoding='utf-8') as file:
  s = "".join(txt)
  file.write(s)
  print('captions saved to capspeaker.txt:')
  print(s+'\n')

with open(f"capspeaker.html", "w", encoding='utf-8') as file:    #TODO: proper html embed tag when video/audio from file
  s = "".join(html)
  file.write(s)
  print('captions saved to capspeaker.html:')
  print(s+'\n')

captions saved to capspeaker.txt:
[00:00:000.78 --> 00:00:003.04] [Customer]  Thank you for calling Martha's Flores, Towne S-60.
[00:00:003.79 --> 00:00:007.08] [Call Center]  Hello, I'd like to order flowers and I think you have what I'm looking for.
[00:00:007.75 --> 00:00:010.33] [Customer]  I'd be happy to take care of your order. May I have your name, please?
[00:00:011.09 --> 00:00:011.65] [Call Center]  Randall Thomas.
[00:00:012.53 --> 00:00:014.77] [Customer]  Randall Thomas, can you spell that for me?
[00:00:015.51 --> 00:00:018.63] [Call Center]  Randall, R-A-N-D-A-L-L,
[00:00:019.55 --> 00:00:021.89] [Call Center]  Robert, T-H-O-M-A-N.
[00:00:022.77 --> 00:00:024.27] [Customer]  Thank you for that information, Randall.
[00:00:024.29 --> 00:00:026.89] [Customer]  You may have your home or office number area code first.
[00:00:027.27 --> 00:00:033.45] [Call Center]  Aircode 409, then 866-5088.
[00:00:034.33 --> 00:00:037.73] [Customer]  That's 409-866-5088.
[00:00:038.03 --> 