# Explanation of the Provided Code

The provided code performs several tasks related to preparing audio, transcribing it, performing speaker diarization, and generating subtitles. Here's a breakdown of each part of the code:

## Downloading Video from YouTube

1. The code starts by installing the `pytube` library for working with YouTube videos.

2. It specifies the YouTube video URL that you want to download and assigns it to the variable `video_url`.

3. A `YouTube` object is created using the provided video URL.

4. The highest resolution stream (video) is obtained using `yt.streams.get_highest_resolution()`.

5. The download path is specified as "test".

6. The video is downloaded using `video_stream.download()`.

7. The code then retrieves the path of the downloaded video file.

## Extracting Audio from Video

1. The code installs the `moviepy` library for working with multimedia files.

2. It specifies the output audio file path as "test.mp3".

3. The audio is extracted from the downloaded video using `ffmpeg_extract_audio()` and saved as an MP3 file.

4. The path of the saved audio file is printed.

## Transcribing

1. The code sets up the hardware acceleration token for Hugging Face (HF) if you have one. In this case, it is set to a placeholder value.

2. It checks the availability of a CUDA-enabled GPU and assigns the appropriate device (CPU or GPU) to the variable `DEVICE`.

## Generating Script

1. The code installs the `whisper` library, which is used for automatic speech recognition.

2. It specifies the desired ASR (Automatic Speech Recognition) model size (e.g., "large").

3. The ASR model is loaded using `whisper.load_model()`.

4. The script (transcription) is generated from the audio file using the loaded ASR model.

## Speaker Diarization

1. The code installs the `whisperX` library, which is used for speaker diarization.

2. It creates a diarization pipeline using the Hugging Face token.

3. Speaker diarization is performed on the audio using the diarization pipeline.

## Combining Script with Speaker Diarization

1. The code aligns the generated script with the speaker diarization results.

2. It loads an align model and metadata using `whisperx.load_align_model()`.

3. The script segments are aligned with the audio using `whisperx.align()`.

4. Speaker information is assigned to word segments, creating a list of transcribed segments with speaker labels.

## Generating Subtitles File

1. The code specifies the output SubRip (`.srt`) subtitles file path as "subtitles.srt".

2. It opens the `.srt` file for writing.

3. The code iterates through the transcribed segments, converts the start and end times to the SubRip format, and writes each subtitle entry to the `.srt` file.

4. Speaker names are mapped from codes (e.g., "SPEAKER_00") to actual names (e.g., "Ali").

5. The `.srt` file is created with subtitle entries following the SubRip format.

6. A message is printed to indicate the successful creation of the `.srt` file.

This code is designed to download a video from YouTube, extract its audio, transcribe the audio, perform speaker diarization, and generate subtitles. It leverages various libraries and Hugging Face models to automate these tasks.

# Original Video

In [13]:
from IPython.display import HTML
HTML('<div align="center"><iframe align = "middle" width="790" height="440" src="https://www.youtube.com/embed/qR4JwjI3ldU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></div>')

# Transcribed Video

In [19]:
from IPython.display import HTML
HTML('<div align="center"><iframe align = "middle" width="790" height="440" src="https://www.youtube.com/embed/4jQzoXxlzMU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></div>')

# Preparing the Audio

## Downloading Video from Youtube

In [None]:
!pip install pytube

Collecting pytube

  Downloading pytube-15.0.0-py3-none-any.whl (57 kB)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m

[?25hInstalling collected packages: pytube

Successfully installed pytube-15.0.0


In [None]:
from pytube import YouTube

# Input the YouTube video URL
video_url = "https://www.youtube.com/watch?v=qR4JwjI3ldU"

# Create a YouTube object
yt = YouTube(video_url)

# Get the highest resolution stream (usually it's the first stream in the list)
video_stream = yt.streams.get_highest_resolution()

# Provide the download path where you want to save the video
download_path = "test"

# Download the video
video_stream.download(output_path=download_path)

# getting the path
import os
video_path = os.path.join(download_path, os.listdir(download_path)[0])
print("video downloaded at:", video_path)

video downloaded at: test/Muhammad Ali Speech - Value Of Education.mp4


## Extracting Audio from Video

In [None]:
! pip install moviepy















In [None]:
from moviepy.video.io.ffmpeg_tools import ffmpeg_extract_audio

# Output audio file path (MP3)
audio_path = "test.mp3"

# Extract audio from the video and save it as MP3
ffmpeg_extract_audio(video_path, audio_path)
print("audio saved at:", audio_path)

Moviepy - Running:

>>> "+ " ".join(cmd)

Moviepy - Command successful

audio saved at: test.mp3


# Transcribing

In [None]:
# https://huggingface.co/settings/tokens
HF_TOKEN = "here goes your hugging face code"

In [None]:
import torch
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE

device(type='cuda')

## Generating Script

In [None]:
! pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git

  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-9jxvu_dm

  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-9jxvu_dm

  Resolved https://github.com/openai/whisper.git to commit e8622f9afc4eba139bf796c210f5c01081000472

  Installing build dependencies ... [?25l[?25hdone

  Getting requirements to build wheel ... [?25l[?25hdone

  Preparing metadata (pyproject.toml) ... [?25l[?25hdone







Collecting tiktoken==0.3.3 (from openai-whisper==20230314)

  Downloading tiktoken-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m


















Building wheels for collected packages: openai-whisper

  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone

  Created wheel for openai-whisper: f

In [None]:
import whisper

model_name = "large"   # tiny | base | small | medium  | large
model = whisper.load_model(model_name, DEVICE)
script = model.transcribe(audio_path)

100%|█████████████████████████████████████| 2.87G/2.87G [00:33<00:00, 91.7MiB/s]


In [None]:
script['text'][:1000]

" Sir, you there. Are you going to teach your son to be a fighter? Bring him up to be a fighter? No, sir. My son is already, he's two years old and we're starting, we want him to learn three languages. Arabic, French, and Spanish. My son is going to, by the time he's where he is, we in America would have had what we call Independence for now. The separation we preach would have been taking place in maybe another, less than ten years. This is going to happen. God's going to force it. By then, he's going to be, I hope to be a world traveler, an interpreter, talking to other people. In French, most, many African people speak only French. Many darker people speak Spanish. And he's going to have to do a lot of traveling, ambassador work, and doing different things, I plan. And I'm making, he's learning these three, he's two years old, and we're getting him ready now for Arabic, French, and Spanish. And so he ain't going to be no fighter, he's going to use his brain. Why not? Why wouldn't yo

## Speaker Diarization

In [None]:
! pip install git+https://github.com/m-bain/whisperX.git

Installing collected packages: tokenizers, sentencepiece, safetensors, python-editor, primePy, docopt, av, antlr4-python3-runtime, websockets, tensorboardX, shellingham, semver, ruamel.yaml.clib, readchar, python-multipart, ordered-set, omegaconf, Mako, lightning-utilities, humanfriendly, h11, ffmpeg-python, einops, ctranslate2, colorlog, colorama, cmaes, blessed, backoff, uvicorn, starlette, ruamel.yaml, pyannote.core, inquirer, huggingface-hub, deepdiff, dateutils, croniter, coloredlogs, arrow, alembic, transformers, starsessions, optuna, onnxruntime, hyperpyyaml, fastapi, pyannote.database, lightning-cloud, faster-whisper, pyannote.pipeline, pyannote.metrics, torchmetrics, torch-pitch-shift, pytorch-lightning, julius, torch_audiomentations, speechbrain, pytorch_metric_learning, lightning, asteroid-filterbanks, pyannote.audio, whisperx

Successfully installed Mako-1.2.4 alembic-1.12.0 antlr4-python3-runtime-4.9.3 arrow-1.2.3 asteroid-filterbanks-0.4.0 av-10.0.0 backoff-2.2.1 blessed-

In [None]:
from whisperx.diarize import DiarizationPipeline

diarization_pipeline = DiarizationPipeline(use_auth_token=HF_TOKEN)
diarized = diarization_pipeline(audio_path)

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.9. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.

Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu118. Bad things might happen unless you revert torch to 1.x.


In [None]:
diarized

Unnamed: 0,0,1,speaker,start,end
0,[ 00:00:00.497 --> 00:00:01.763],P,SPEAKER_01,0.497812,1.763437
1,[ 00:00:02.404 --> 00:00:06.775],Q,SPEAKER_01,2.404688,6.775313
2,[ 00:00:05.897 --> 00:00:06.741],A,SPEAKER_00,5.897813,6.741563
3,[ 00:00:07.737 --> 00:00:09.255],B,SPEAKER_00,7.737188,9.255938
4,[ 00:00:09.998 --> 00:00:14.132],C,SPEAKER_00,9.998438,14.132813
5,[ 00:00:14.824 --> 00:00:15.432],D,SPEAKER_00,14.824688,15.432188
6,[ 00:00:16.512 --> 00:00:18.419],E,SPEAKER_00,16.512187,18.419062
7,[ 00:00:19.701 --> 00:00:32.442],F,SPEAKER_00,19.701563,32.442188
8,[ 00:00:33.184 --> 00:00:55.628],G,SPEAKER_00,33.184688,55.628438
9,[ 00:00:56.320 --> 00:01:00.758],H,SPEAKER_00,56.320313,60.758438


## Combining Script with Speaker Diarization

In [None]:
from whisperx import load_align_model, align
from whisperx.diarize import assign_word_speakers

In [None]:
# Align Script
model_a, metadata = load_align_model(language_code=script["language"], device=DEVICE)
script_aligned = align(script["segments"], model_a, metadata, audio_path, DEVICE)

# Align Speakers
result_segments, word_seg = list(assign_word_speakers(
    diarized, script_aligned
).values())
transcribed = []
for result_segment in result_segments:
    transcribed.append(
        {
            "start": result_segment["start"],
            "end": result_segment["end"],
            "text": result_segment["text"],
            "speaker": result_segment["speaker"],
        }
    )

In [None]:
for start, end, text, speaker in [i.values() for i in transcribed]:
    print(start, end, speaker, text)

0.522 2.609 SPEAKER_01  Sir, you there.

2.609 5.338 SPEAKER_01 Are you going to teach your son to be a fighter?

5.338 5.98 SPEAKER_01 Bring him up to be a fighter?

6.02 7.845 SPEAKER_00  No, sir.

7.845 13.88 SPEAKER_00 My son is already, he's two years old and we're starting, we want him to learn three languages.

15.042 19.831 SPEAKER_00  Arabic, French, and Spanish.

19.831 25.0 SPEAKER_00 My son is going to, by the time he's where he is, we in America would have had what we call

25.642 27.027 SPEAKER_00  Independence for now.

27.027 30.98 SPEAKER_00 The separation we preach would have been taking place in maybe another, less than ten years.

31.04 33.268 SPEAKER_00  This is going to happen.

33.268 34.532 SPEAKER_00 God's going to force it.

34.532 37.0 SPEAKER_00 By then, he's going to be, I hope to be a world traveler,

37.542 39.829 SPEAKER_00  an interpreter, talking to other people.

39.829 43.0 SPEAKER_00 In French, most, many African people speak only French.

44.024 46

# Generating Subtitles File

In [None]:
# Output .srt file path
srt_file_path = "subtitles.srt"

# Open the .srt file for writing
with open(srt_file_path, 'w') as srt_file:
    count = 1  # Initialize subtitle count

    for entry in transcribed:
        start_time = entry["start"]
        end_time = entry["end"]

        speaker = entry["speaker"]
        speaker = {
            "SPEAKER_00": "Ali",
            "SPEAKER_01": "Host",
        }[speaker]

        text = speaker + ": " + entry["text"]

        # Convert times to the SubRip format (hours:minutes:seconds,milliseconds)
        start_time_srt = '{:02}:{:02}:{:06.3f}'.format(int(start_time // 3600), int((start_time % 3600) // 60), start_time % 60)
        end_time_srt = '{:02}:{:02}:{:06.3f}'.format(int(end_time // 3600), int((end_time % 3600) // 60), end_time % 60)

        # Write the subtitle entry to the .srt file
        srt_file.write(str(count) + '\n')
        srt_file.write(start_time_srt + ' --> ' + end_time_srt + '\n')
        srt_file.write(text + '\n\n')

        count += 1  # Increment subtitle count

print(f".srt file '{srt_file_path}' created successfully.")


.srt file 'subtitles.srt' created successfully.
