# Task : Semantic Chunking of a Youtube Video

## Workflow Steps:
1. Download Video and Extract Audio
2. Transcription of Audio
3. Time-Align Transcript with Audio
4. Semantic Chunking of Data
5. Evaluation and Judgement Criteria
6. Generalization
7. Gradio App Interface

## 1. Download Video and Extract Audio

**Explanation of the Requirements:** <br>
1. youtube-dl: This library is used for downloading videos from YouTube.<br>
2. moviepy: Useful for video and audio processing, such as extracting audio tracks from video files.<br>
3. torch and transformers: These libraries are needed if you are using PyTorch-based models like Facebook's Wav2Vec for speech-to-text transcription.<br>
4. pandas and numpy: Essential for data manipulation and numerical operations.<br>
5. scipy and librosa: Used for more detailed audio analysis and processing tasks.<br>
6. soundfile: For reading and writing audio files.<br>
7. nltk and spacy: Natural Language Processing libraries useful for text manipulation and processing, especially for the semantic chunking.<br>
8. gradio: Allows you to create a web interface to easily interact with your models and processes.<br>

In [1]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.


In [2]:
import torch
torch.cuda.is_available()

True

In [3]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting chardet (from -r requirements.txt (line 3))
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Downloading chardet-5.2.0-py3-none-any.whl (199 kB)
   ---------------------------------------- 0.0/199.4 kB ? eta -:--:--
   ---- ---------------------------------- 20.5/199.4 kB 640.0 kB/s eta 0:00:01
   -------- ------------------------------ 41.0/199.4 kB 653.6 kB/s eta 0:00:01
   ---------------------------- ----------- 143.4/199.4 kB 1.4 MB/s eta 0:00:01
   ---------------------------------------- 199.4/199.4 kB 1.5 MB/s eta 0:00:00
Installing collected packages: chardet
Successfully installed chardet-5.2.0
Note: you may need to restart the kernel to use updated packages.


### using youtube_transcript_api

In [1]:
from youtube_transcript_api import YouTubeTranscriptApi

In [2]:
def get_video_id_from_url(url):
    # This function extracts the video ID from the YouTube URL
    if "youtube.com/watch?v=" in url:
        return url.split("watch?v=")[1]
    else:
        raise ValueError("Invalid YouTube URL")

In [3]:
def get_transcript(youtube_url):
    try:
        # Extract the video ID from the URL
        video_id = get_video_id_from_url(youtube_url)

        # Fetch the available transcripts
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        
        # Print each entry of the transcript
        for entry in transcript:
            print(f"Start: {entry['start']} - Duration: {entry['duration']}s")
            print(entry['text'] + '\n')
    
    except Exception as e:
        print(f"An error occurred: {e}")

In [4]:
youtube_url = "https://www.youtube.com/watch?v=ysLiABvVos8"
get_transcript(youtube_url)

Start: 0.24 - Duration: 7.97s
presents Morning

Start: 2.94 - Duration: 5.27s
[Music]

Start: 10.719 - Duration: 7.041s
News good morning I am SRA maaba the

Start: 15.64 - Duration: 4.719s
headlines campaigning for remaining

Start: 17.76 - Duration: 5.04s
phases of L SAA elections

Start: 20.359 - Duration: 5.121s
intensifies president dropadi murmu to

Start: 22.8 - Duration: 6.44s
confir Padma awards at second investure

Start: 25.48 - Duration: 6.08s
ceremony in raspati bavan today try

Start: 29.24 - Duration: 4.159s
service conference parivartan chintan

Start: 31.56 - Duration: 4.88s
chaired by Chief of Defense staff

Start: 33.399 - Duration: 6.361s
General Anil Chan to be held in New

Start: 36.44 - Duration: 5.84s
Delhi unrest erupts in Pakistan occupied

Start: 39.76 - Duration: 4.4s
Kashmir as Pakistan Security Forces

Start: 42.28 - Duration: 5.16s
deployed ahead of

Start: 44.16 - Duration: 5.48s
protests IMD forast fresh spell of heat

Start: 47.44 - Duration: 5.72s
wav

### using pytube

In [5]:
import os
import subprocess
from pytube import YouTube

In [6]:
def rename_audio(file_path):
    # Define the new file name
    new_file_name = "audio_file"

    # Get the directory of the file
    file_directory = os.path.dirname(file_path)

    # Get the file extension
    file_extension = os.path.splitext(file_path)[1]

    # Create a unique new file name
    count = 1
    while os.path.exists(os.path.join(file_directory, f"{new_file_name}_{count}{file_extension}")):
        count += 1

    new_file_name = f"{new_file_name}_{count}{file_extension}"
    # Get the new file path
    new_file_path = os.path.join(file_directory, new_file_name)

    # Rename the file
    os.rename(file_path, new_file_path)

    print(f"Renamed '{file_path}' to '{new_file_path}'")

    return new_file_path

In [7]:
def convert_to_wav(file_path):
    # Define output directory
    output_directory = os.path.dirname(file_path)

    # Define the new file name with .wav extension
    wav_file_name = os.path.splitext(os.path.basename(file_path))[0] + ".wav"

    # Define the output path for the converted .wav file
    wav_file_path = os.path.join(output_directory, wav_file_name)

    # Execute FFmpeg command to convert to .wav format
    subprocess.run(["ffmpeg", "-i", file_path, "-acodec", "pcm_s16le", "-ar", "44100", wav_file_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

    return wav_file_path

In [8]:
def download_audio(youtube_url):
    # Create a YouTube object with the URL
    yt = YouTube(youtube_url)

    # Define output directory
    output_path = "downloaded_audio"

    # Get the best audio stream
    audio_stream = yt.streams.get_audio_only()

    # Download the audio stream
    downloaded_file_path = audio_stream.download(output_path)

    # Rename the downloaded audio file
    renamed_file_path = rename_audio(downloaded_file_path)

    print(f"Downloaded '{audio_stream.title}' to '{renamed_file_path}' as '{audio_stream.mime_type}'.")

    # Convert to .wav format
    wav_file_path = convert_to_wav(renamed_file_path)

    print(f"Converted '{renamed_file_path}' to '{wav_file_path}'.")

    return wav_file_path

In [9]:
# Example usage:
youtube_url = "https://www.youtube.com/watch?v=ysLiABvVos8"
wav_file_path = download_audio(youtube_url)

Renamed 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\Campaigning for remaining phases of Lok Sabha elections intensifies.mp4' to 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_2.mp4'
Downloaded 'Campaigning for remaining phases of Lok Sabha elections intensifies' to 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_2.mp4' as 'audio/mp4'.
Converted 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_2.mp4' to 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_2.wav'.


In [12]:
wav_file_path

'e:\\Github_projects\\sarvamai-hiring-challenge\\downloaded_audio\\audio_file_2.wav'

In [13]:
youtube_url = "https://www.youtube.com/watch?v=ysLiABvVos8"

### audio to text

In [14]:
!pip install --upgrade pip
!pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to c:\users\fyzan\appdata\local\temp\pip-req-build-la8u2kn2
  Resolved https://github.com/huggingface/transformers.git to commit b8aee2e918d7ba2d5e9e80162ae26b4806873307
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git 'C:\Users\fyzan\AppData\Local\Temp\pip-req-build-la8u2kn2'


In [17]:
%pip install numpy==1.22.0

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting numpy==1.22.0
  Downloading numpy-1.22.0-cp310-cp310-win_amd64.whl.metadata (2.1 kB)
Downloading numpy-1.22.0-cp310-cp310-win_amd64.whl (14.7 MB)
   ---------------------------------------- 0.0/14.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/14.7 MB ? eta -:--:--
   ---------------------------------------- 0.1/14.7 MB 825.8 kB/s eta 0:00:18
   ---------------------------------------- 0.2/14.7 MB 1.8 MB/s eta 0:00:09
    --------------------------------------- 0.3/14.7 MB 2.4 MB/s eta 0:00:06
   - -------------------------------------- 0.4/14.7 MB 2.2 MB/s eta 0:00:07
   - -------------------------------------- 0.6/14.7 MB 2.7 MB/s eta 0:00:06
   -- ------------------------------------- 0.8/14.7 MB 2.9 MB/s eta 0:00:05
   -- ------------------------------------- 0.9/14.7 MB 2.9 MB/s eta 0:00:05
   -- ------------------------------------- 1.1/14.7 MB 3.1 MB/s eta 0:00:05
   --- -----

In [18]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

In [19]:
import torch
torch.cuda.is_available()

True

In [20]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [25]:
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=5,
    batch_size=2,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

In [26]:
result = pipe(wav_file_path, return_timestamps=True)
print(result["chunks"])

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
