This is for the first run only - for future runs, you can skip to "Run the code block to connect Google Drive."

First, prepare your Hugging Face token to access required models for diarization. You can generate the token [here](https://huggingface.co/settings/tokens), making it a "Read" token, and paste it in a text document, since you'll need it later.

You then need to accept the user agreement for the following models: [Segmentation](https://huggingface.co/pyannote/segmentation-3.0) and [Speaker-Diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)

To add the token to your Colab secrets, click the key on the left of this screen, click "Add new secret", name it "HF_TOKEN", paste the token into Value, and make sure Notebook access is on.

Run the code block to connect Google Drive to get file(s) to transcribe and save transcriptions. Make sure the source file(s) (mp4 and wav are supported) are in a folder labeled "transcription" in the root folder of your Drive ("My Drive"). You will get multiple popups asking if you want to connect and are sure, give it access. You will also get a popup saying this code was not written by Google, click run anyway.

To run each code block, click the play button in the top left, then wait for it to finish.

Note: All files in "My Drive/transciption" will be processed, so make sure to move the old files out of this folder before continuing!

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Run to set up the environment, you will get a prompt to restart session. Wait for it to finish before restarting.

In [None]:
!pip install pytorch==2.0.0 torchaudio=2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
!pip install git+https://github.com/m-bain/whisperx.git
!pip install ctranslate2==4.4.0

Run to do the actual transcription, alignment, and diarization with WhisperX, outputting a not-quite human readable format and a version meant to be more human readable. This is adapted from the example at https://github.com/m-bain/whisperX This takes about 15 minutes on a 2h21m vidoe pre-converted to .wav. This will print a message when finished.

In [None]:
import whisperx
import gc
import torch

from google.colab import userdata
if not userdata.get('HF_TOKEN'):
  print("Error: missing HF_TOKEN! Please go to https://huggingface.co/settings/tokens to generate a token and add it to your notebook secrets")

device = "cuda"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v3", device, compute_type=compute_type)

def transcribe(audio_file_name):
  audio = whisperx.load_audio(audio_file_name)
  result = model.transcribe(audio, batch_size=batch_size)
  #print(result["segments"]) # before alignment

  # delete model if low on GPU resources
  # import gc; gc.collect(); torch.cuda.empty_cache(); del model

  # 2. Align whisper output
  model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
  result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

  #print(result["segments"]) # after alignment

  # delete model if low on GPU resources
  # import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

  # 3. Assign speaker labels
  diarize_model = whisperx.DiarizationPipeline(use_auth_token=userdata.get('HF_TOKEN'), device=device)

  # add min/max number of speakers if known
  diarize_segments = diarize_model(audio_file_name)
  # diarize_model(audio_file, min_speakers=min_speakers, max_speakers=max_speakers)

  result = whisperx.assign_word_speakers(diarize_segments, result)
  #print(diarize_segments)
  #print(result["segments"]) # segments are now assigned speaker IDs
  with open(str(audio_file_name) + ".txt", "w") as f :
      f.write(str(result["segments"]))

import glob, os
files = glob.glob("/content/drive/MyDrive/transcription/*.*")
print("Files to transcribe:")
for i in range(len(files)):
  if (os.path.isfile(files[i]) and (files[i].endswith(".wav") or files[i].endswith(".mp4"))):
    print(files[i])

for i in range(len(files)):
  if (os.path.isfile(files[i]) and files[i].endswith(".wav")):
    print("Transcribing file " + files[i])
    transcribe(files[i])
  elif (os.path.isfile(files[i]) and (files[i].endswith(".mp4") or files[i].endswith("mp3"))):
    print("Converting and transcribing file " + files[i])
    basefile = os.path.splitext(files[i])[0]
    #convert to wav
    !ffmpeg -i "{files[i]}" "{basefile}.wav"
    transcribe(basefile + ".wav")
    os.remove(basefile + ".wav")


from ast import literal_eval
import sys
import math
import os


def seconds_to_h_m_s(seconds):
    return (
        str(math.floor(seconds / 3600))
        + ":"
        + str(math.floor((seconds % 3600) / 60))
        + ":"
        + "{:.2f}".format(seconds % 60)
    )


def to_string(data):
    current_speaker = None
    string = ""
    while len(data) > 0:
        if not "speaker" in data[0]:
            data[0]["speaker"] = "SPEAKER_UNK"

        if current_speaker is None or current_speaker != data[0]["speaker"]:
            current_speaker = data[0]["speaker"]
            string += (
                current_speaker
                + ":\n\t"
                + seconds_to_h_m_s(data[0]["start"])
                + ": "
                + data[0]["text"]
                + "\n"
            )

        else:
            string += (
                "\t"
                + seconds_to_h_m_s(data[0]["start"])
                + ": "
                + data[0]["text"]
                + "\n"
            )
        data = data[1:]
    return string

def make_readable(filename):
  # Open the JSON file for reading
  with open(filename, "r") as f:
      # Load the JSON file into a variable
      data = literal_eval(f.read())
      # print(data[0]["text"])
      # print(data[0]["speaker"])
      sys.setrecursionlimit(3000)
      with open(os.path.splitext(filename)[0] + "_readable.txt", "w") as f:
          #print(to_string(data))
          f.write(to_string(data))

import glob, os
files = glob.glob("/content/drive/MyDrive/transcription/*.txt")
for i in range(len(files)):
  if (os.path.isfile(files[i]) and files[i].endswith(".txt") and not files[i].endswith("_readable.txt")):
    print("Making " + files[i] + " readable ")
    make_readable(files[i])

print("Done! Look in the transcription folder for the results")