<a href="https://colab.research.google.com/github/fedexmax10603/ai/blob/main/whisper_ai_pyannote_transcribing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explanation of the Provided Code

The provided code performs several tasks related to preparing audio, transcribing it, performing speaker diarization, and generating subtitles. Here's a breakdown of each part of the code:

## Downloading Video from YouTube

1. The code starts by installing the `pytube` library for working with YouTube videos.

2. It specifies the YouTube video URL that you want to download and assigns it to the variable `video_url`.

3. A `YouTube` object is created using the provided video URL.

4. The highest resolution stream (video) is obtained using `yt.streams.get_highest_resolution()`.

5. The download path is specified as "test".

6. The video is downloaded using `video_stream.download()`.

7. The code then retrieves the path of the downloaded video file.

## Extracting Audio from Video

1. The code installs the `moviepy` library for working with multimedia files.

2. It specifies the output audio file path as "test.mp3".

3. The audio is extracted from the downloaded video using `ffmpeg_extract_audio()` and saved as an MP3 file.

4. The path of the saved audio file is printed.

## Transcribing

1. The code sets up the hardware acceleration token for Hugging Face (HF) if you have one. In this case, it is set to a placeholder value.

2. It checks the availability of a CUDA-enabled GPU and assigns the appropriate device (CPU or GPU) to the variable `DEVICE`.

## Generating Script

1. The code installs the `whisper` library, which is used for automatic speech recognition.

2. It specifies the desired ASR (Automatic Speech Recognition) model size (e.g., "large").

3. The ASR model is loaded using `whisper.load_model()`.

4. The script (transcription) is generated from the audio file using the loaded ASR model.

## Speaker Diarization

1. The code installs the `whisperX` library, which is used for speaker diarization.

2. It creates a diarization pipeline using the Hugging Face token.

3. Speaker diarization is performed on the audio using the diarization pipeline.

## Combining Script with Speaker Diarization

1. The code aligns the generated script with the speaker diarization results.

2. It loads an align model and metadata using `whisperx.load_align_model()`.

3. The script segments are aligned with the audio using `whisperx.align()`.

4. Speaker information is assigned to word segments, creating a list of transcribed segments with speaker labels.

## Generating Subtitles File

1. The code specifies the output SubRip (`.srt`) subtitles file path as "subtitles.srt".

2. It opens the `.srt` file for writing.

3. The code iterates through the transcribed segments, converts the start and end times to the SubRip format, and writes each subtitle entry to the `.srt` file.

4. Speaker names are mapped from codes (e.g., "SPEAKER_00") to actual names (e.g., "Ali").

5. The `.srt` file is created with subtitle entries following the SubRip format.

6. A message is printed to indicate the successful creation of the `.srt` file.

This code is designed to download a video from YouTube, extract its audio, transcribe the audio, perform speaker diarization, and generate subtitles. It leverages various libraries and Hugging Face models to automate these tasks.

# Original Video

In [None]:
from IPython.display import HTML
HTML('<div align="center"><iframe align = "middle" width="790" height="440" src="https://www.youtube.com/embed/qR4JwjI3ldU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></div>')

# Transcribed Video

In [None]:
from IPython.display import HTML
HTML('<div align="center"><iframe align = "middle" width="790" height="440" src="https://www.youtube.com/embed/4jQzoXxlzMU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></div>')

# Preparing the Audio

## Downloading Video from Youtube

In [1]:
!pip install pytube

Collecting pytube
  Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m51.2/57.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube
Successfully installed pytube-15.0.0


In [2]:
from pytube import YouTube

# Input the YouTube video URL
video_url = "https://www.youtube.com/watch?v=qR4JwjI3ldU"

# Create a YouTube object
yt = YouTube(video_url)

# Get the highest resolution stream (usually it's the first stream in the list)
video_stream = yt.streams.get_highest_resolution()

# Provide the download path where you want to save the video
download_path = "test"

# Download the video
video_stream.download(output_path=download_path)

# getting the path
import os
video_path = os.path.join(download_path, os.listdir(download_path)[0])
print("video downloaded at:", video_path)

video downloaded at: test/Muhammad Ali Speech - Value Of Education.mp4


## Extracting Audio from Video

In [3]:
! pip install moviepy



In [4]:
from moviepy.video.io.ffmpeg_tools import ffmpeg_extract_audio

# Output audio file path (MP3)
audio_path = "test.mp3"

# Extract audio from the video and save it as MP3
ffmpeg_extract_audio(video_path, audio_path)
print("audio saved at:", audio_path)

Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
audio saved at: test.mp3


# Transcribing

In [26]:
# https://huggingface.co/settings/tokens
HF_TOKEN = "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"

In [5]:
import torch
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE

device(type='cpu')

## Generating Script

In [7]:
!pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-ja0ymc84
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-ja0ymc84
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20231117)
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->openai-whisper==20231117)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [8]:
import whisper

model_name = "large"   # tiny | base | small | medium  | large
model = whisper.load_model(model_name,"cpu")
script = model.transcribe(audio_path)

100%|██████████████████████████████████████| 2.88G/2.88G [00:23<00:00, 134MiB/s]


In [9]:
script['text'][:1000]

" Sir, you there, are you going to teach your son to be a fighter? Are you bringing him up to be a fighter? No, sir. My son is already, he's two years old, and we're starting to learn three languages. Arabic, French, and Spanish. My son is going to, by the time he's where he is, we in America would have had what we call our independence by now. The separation we preached would have been taking place in maybe another less than ten years. This is going to happen. God's going to force it. By then, he's going to be, I hope, to be a world traveler and interpreter, talking to other people. In French, many African people speak only French. Many darker people speak Spanish. And he's going to have to do a lot of traveling and ambassador work and doing different things, I plan. He's learning these three. He's two years old, and we're getting him ready now for Arabic, French. French and Spanish. And so he ain't going to be no fighter. He's going to use his brain. Why not? Why wouldn't you let him

## Speaker Diarization

In [33]:
! pip install git+https://github.com/m-bain/whisperX.git

Collecting git+https://github.com/m-bain/whisperX.git
  Cloning https://github.com/m-bain/whisperX.git to /tmp/pip-req-build-w_h75u8f
  Running command git clone --filter=blob:none --quiet https://github.com/m-bain/whisperX.git /tmp/pip-req-build-w_h75u8f
  Resolved https://github.com/m-bain/whisperX.git to commit 78dcfaab51005aa703ee21375f81ed31bc248560
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [35]:
from whisperx.diarize import DiarizationPipeline

diarization_pipeline = DiarizationPipeline(use_auth_token="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")
diarized = diarization_pipeline(audio_path)

pytorch_model.bin:   0%|          | 0.00/5.91M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/399 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/26.6M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/221 [00:00<?, ?B/s]

In [36]:
diarized

Unnamed: 0,segment,label,speaker,start,end
0,[ 00:00:00.144 --> 00:00:01.638],A,SPEAKER_01,0.144312,1.63837
1,[ 00:00:02.487 --> 00:00:04.388],B,SPEAKER_01,2.487267,4.388795
2,[ 00:00:04.711 --> 00:00:06.680],C,SPEAKER_01,4.711375,6.680815
3,[ 00:00:05.916 --> 00:00:06.714],D,SPEAKER_00,5.916808,6.714771
4,[ 00:00:07.767 --> 00:00:09.295],E,SPEAKER_00,7.767402,9.295416
5,[ 00:00:10.008 --> 00:00:11.027],F,SPEAKER_00,10.008489,11.027165
6,[ 00:00:11.553 --> 00:00:14.151],G,SPEAKER_00,11.55348,14.151104
7,[ 00:00:14.864 --> 00:00:15.475],H,SPEAKER_00,14.864177,15.475382
8,[ 00:00:16.528 --> 00:00:17.088],I,SPEAKER_00,16.528014,17.088285
9,[ 00:00:17.546 --> 00:00:18.446],J,SPEAKER_00,17.546689,18.44652


## Combining Script with Speaker Diarization

In [22]:
from whisperx import load_align_model, align
from whisperx.diarize import assign_word_speakers

In [37]:
# Align Script
model_a, metadata = load_align_model(language_code=script["language"], device=DEVICE)
script_aligned = align(script["segments"], model_a, metadata, audio_path, DEVICE)

# Align Speakers
result_segments, word_seg = list(assign_word_speakers(
    diarized, script_aligned
).values())
transcribed = []
for result_segment in result_segments:
    transcribed.append(
        {
            "start": result_segment["start"],
            "end": result_segment["end"],
            "text": result_segment["text"],
            "speaker": result_segment["speaker"],
        }
    )

Failed to align segment (" He's going to use his brain."): backtrack failed, resorting to original...
Failed to align segment (" I was from Kentucky."): backtrack failed, resorting to original...
Failed to align segment (" You understand?"): backtrack failed, resorting to original...
Failed to align segment (" Leave me alone."): backtrack failed, resorting to original...
Failed to align segment (" Leave me alone."): backtrack failed, resorting to original...
Failed to align segment (" I mean, you know what I mean?"): backtrack failed, resorting to original...
Failed to align segment (" So I need people like you to match with me."): backtrack failed, resorting to original...


In [38]:
for start, end, text, speaker in [i.values() for i in transcribed]:
    print(start, end, speaker, text)

0.522 5.24 SPEAKER_01  Sir, you there, are you going to teach your son to be a fighter?
5.4 6.38 SPEAKER_01  Are you bringing him up to be a fighter?
6.422 6.573 SPEAKER_00  No, sir.
7.84 13.9 SPEAKER_00  My son is already, he's two years old, and we're starting to learn three languages.
15.041 18.24 SPEAKER_00  Arabic, French, and Spanish.
19.82 26.62 SPEAKER_00  My son is going to, by the time he's where he is, we in America would have had what we call our independence by now.
27.021 31.78 SPEAKER_00  The separation we preached would have been taking place in maybe another less than ten years.
31.86 32.419 SPEAKER_00  This is going to happen.
33.26 34.017 SPEAKER_00  God's going to force it.
34.54 38.98 SPEAKER_00  By then, he's going to be, I hope, to be a world traveler and interpreter, talking to other people.
39.822 43.48 SPEAKER_00  In French, many African people speak only French.
44.04 45.68 SPEAKER_00  Many darker people speak Spanish.
46.18 50.36 SPEAKER_00  And he's going t

# Generating Subtitles File

In [39]:
# Output .srt file path
srt_file_path = "subtitles.srt"

# Open the .srt file for writing
with open(srt_file_path, 'w') as srt_file:
    count = 1  # Initialize subtitle count

    for entry in transcribed:
        start_time = entry["start"]
        end_time = entry["end"]

        speaker = entry["speaker"]
        speaker = {
            "SPEAKER_00": "Ali",
            "SPEAKER_01": "Host",
        }[speaker]

        text = speaker + ": " + entry["text"]

        # Convert times to the SubRip format (hours:minutes:seconds,milliseconds)
        start_time_srt = '{:02}:{:02}:{:06.3f}'.format(int(start_time // 3600), int((start_time % 3600) // 60), start_time % 60)
        end_time_srt = '{:02}:{:02}:{:06.3f}'.format(int(end_time // 3600), int((end_time % 3600) // 60), end_time % 60)

        # Write the subtitle entry to the .srt file
        srt_file.write(str(count) + '\n')
        srt_file.write(start_time_srt + ' --> ' + end_time_srt + '\n')
        srt_file.write(text + '\n\n')

        count += 1  # Increment subtitle count

print(f".srt file '{srt_file_path}' created successfully.")


.srt file 'subtitles.srt' created successfully.
