# Task : Semantic Chunking of a Youtube Video

## Workflow Steps:
1. Download Video and Extract Audio
2. Transcription of Audio
3. Time-Align Transcript with Audio
4. Semantic Chunking of Data
5. Evaluation and Judgement Criteria
6. Generalization
7. Gradio App Interface

## 1. Download Video and Extract Audio

**Explanation of the Requirements:** <br>
1. youtube-dl: This library is used for downloading videos from YouTube.<br>
2. moviepy: Useful for video and audio processing, such as extracting audio tracks from video files.<br>
3. torch and transformers: These libraries are needed if you are using PyTorch-based models like Facebook's Wav2Vec for speech-to-text transcription.<br>
4. pandas and numpy: Essential for data manipulation and numerical operations.<br>
5. scipy and librosa: Used for more detailed audio analysis and processing tasks.<br>
6. soundfile: For reading and writing audio files.<br>
7. nltk and spacy: Natural Language Processing libraries useful for text manipulation and processing, especially for the semantic chunking.<br>
8. gradio: Allows you to create a web interface to easily interact with your models and processes.<br>

In [1]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting Cython (from -r requirements.txt (line 2))
  Downloading Cython-3.0.10-cp310-cp310-win_amd64.whl.metadata (3.2 kB)
Collecting sentencepiece<1.0.0 (from -r requirements.txt (line 18))
  Downloading sentencepiece-0.2.0-cp310-cp310-win_amd64.whl.metadata (8.3 kB)
Collecting youtokentome>=1.0.5 (from -r requirements.txt (line 19))
  Downloading youtokentome-1.0.6.tar.gz (86 kB)
     ---------------------------------------- 0.0/86.7 kB ? eta -:--:--
     ---------------------------------------- 86.7/86.7 kB 2.5 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'
Note: you may need to restart the kernel to use updated packages.


  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "C:\Users\fyzan\AppData\Local\Temp\pip-install-ebgxt4w8\youtokentome_24307e26fc504dbc9e898b80124a37ab\setup.py", line 5, in <module>
          from Cython.Build import cythonize
      ModuleNotFoundError: No module named 'Cython'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.


### using youtube_transcript_api

In [6]:
from youtube_transcript_api import YouTubeTranscriptApi

video_id = 'ysLiABvVos8'
YouTubeTranscriptApi.get_transcript(video_id)

[{'text': 'presents Morning', 'start': 0.24, 'duration': 7.97},
 {'text': '[Music]', 'start': 2.94, 'duration': 5.27},
 {'text': 'News good morning I am SRA maaba the',
  'start': 10.719,
  'duration': 7.041},
 {'text': 'headlines campaigning for remaining',
  'start': 15.64,
  'duration': 4.719},
 {'text': 'phases of L SAA elections', 'start': 17.76, 'duration': 5.04},
 {'text': 'intensifies president dropadi murmu to',
  'start': 20.359,
  'duration': 5.121},
 {'text': 'confir Padma awards at second investure',
  'start': 22.8,
  'duration': 6.44},
 {'text': 'ceremony in raspati bavan today try',
  'start': 25.48,
  'duration': 6.08},
 {'text': 'service conference parivartan chintan',
  'start': 29.24,
  'duration': 4.159},
 {'text': 'chaired by Chief of Defense staff',
  'start': 31.56,
  'duration': 4.88},
 {'text': 'General Anil Chan to be held in New',
  'start': 33.399,
  'duration': 6.361},
 {'text': 'Delhi unrest erupts in Pakistan occupied',
  'start': 36.44,
  'duration': 5.

In [4]:
from youtube_transcript_api import YouTubeTranscriptApi

def get_video_id_from_url(url):
    # This function extracts the video ID from the YouTube URL
    if "youtube.com/watch?v=" in url:
        return url.split("watch?v=")[1]
    else:
        raise ValueError("Invalid YouTube URL")

def get_transcript(youtube_url):
    try:
        # Extract the video ID from the URL
        video_id = get_video_id_from_url(youtube_url)

        # Fetch the available transcripts
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        
        # Print each entry of the transcript
        for entry in transcript:
            print(f"Start: {entry['start']} - Duration: {entry['duration']}s")
            print(entry['text'] + '\n')
    
    except Exception as e:
        print(f"An error occurred: {e}")

# Example usage
get_transcript('https://www.youtube.com/watch?v=ysLiABvVos8')  # Replace with the actual YouTube URL


Start: 0.24 - Duration: 7.97s
presents Morning

Start: 2.94 - Duration: 5.27s
[Music]

Start: 10.719 - Duration: 7.041s
News good morning I am SRA maaba the

Start: 15.64 - Duration: 4.719s
headlines campaigning for remaining

Start: 17.76 - Duration: 5.04s
phases of L SAA elections

Start: 20.359 - Duration: 5.121s
intensifies president dropadi murmu to

Start: 22.8 - Duration: 6.44s
confir Padma awards at second investure

Start: 25.48 - Duration: 6.08s
ceremony in raspati bavan today try

Start: 29.24 - Duration: 4.159s
service conference parivartan chintan

Start: 31.56 - Duration: 4.88s
chaired by Chief of Defense staff

Start: 33.399 - Duration: 6.361s
General Anil Chan to be held in New

Start: 36.44 - Duration: 5.84s
Delhi unrest erupts in Pakistan occupied

Start: 39.76 - Duration: 4.4s
Kashmir as Pakistan Security Forces

Start: 42.28 - Duration: 5.16s
deployed ahead of

Start: 44.16 - Duration: 5.48s
protests IMD forast fresh spell of heat

Start: 47.44 - Duration: 5.72s
wav

### using pytube

In [18]:
import os

from pytube import YouTube

In [23]:
youtube_url = "https://www.youtube.com/watch?v=ysLiABvVos8"

In [19]:
def rename_audio(file_path):
    # Define the new file name
    new_file_name = "audio_file" + os.path.splitext(file_path)[1]

    # Get the directory of the file
    file_directory = os.path.dirname(file_path)

    # Get the new file path
    new_file_path = os.path.join(file_directory, new_file_name)

    # Rename the file
    os.rename(file_path, new_file_path)

    print(f"Renamed '{file_path}' to '{new_file_path}'")

    return new_file_path

In [20]:
def download_audio(youtube_url):
    # Create a YouTube object with the URL
    yt = YouTube(youtube_url)

    # Define output directory
    output_path = "downloaded_audio"

    # Get the best audio stream
    audio_stream = yt.streams.get_audio_only()

    # Download the audio stream
    downloaded_file_path = audio_stream.download(output_path)

    # Rename the downloaded audio file
    renamed_file_path = rename_audio(downloaded_file_path)

    print(f"Downloaded '{audio_stream.title}' to '{renamed_file_path}' as '{audio_stream.mime_type}'.")

    return renamed_file_path

In [None]:
from pytube import YouTube
import os

def download_audio(youtube_url, output_path, file_name):
    # Create a YouTube object with the URL
    yt = YouTube(youtube_url)

    # Get the best audio stream
    audio_stream = yt.streams.get_audio_only()

    # Download the audio stream with the specified file name
    audio_file_path = os.path.join(output_path, file_name)
    audio_stream.download(output_path=output_path, filename=file_name)

    print(f"Downloaded '{audio_stream.title}' to '{audio_file_path}' as '{audio_stream.mime_type}'.")

In [21]:
# Example usage
download_audio('https://www.youtube.com/watch?v=ysLiABvVos8')

Renamed 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\Campaigning for remaining phases of Lok Sabha elections intensifies.mp4' to 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file.mp4'
Downloaded 'Campaigning for remaining phases of Lok Sabha elections intensifies' to 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file.mp4' as 'audio/mp4'.


'e:\\Github_projects\\sarvamai-hiring-challenge\\downloaded_audio\\audio_file.mp4'

### .wav file

In [24]:
import os
import subprocess
from pytube import YouTube

def download_audio(youtube_url):
    # Create a YouTube object with the URL
    yt = YouTube(youtube_url)

    # Define output directory
    output_path = "downloaded_audio"

    # Get the best audio stream
    audio_stream = yt.streams.get_audio_only()

    # Download the audio stream
    downloaded_file_path = audio_stream.download(output_path)

    # Rename the downloaded audio file
    renamed_file_path = rename_audio(downloaded_file_path)

    print(f"Downloaded '{audio_stream.title}' to '{renamed_file_path}' as '{audio_stream.mime_type}'.")

    # Convert to .wav format
    wav_file_path = convert_to_wav(renamed_file_path)

    print(f"Converted '{renamed_file_path}' to '{wav_file_path}'.")

    return wav_file_path

def rename_audio(file_path):
    # Define the new file name
    new_file_name = "audio_file"

    # Get the directory of the file
    file_directory = os.path.dirname(file_path)

    # Get the file extension
    file_extension = os.path.splitext(file_path)[1]

    # Create a unique new file name
    count = 1
    while os.path.exists(os.path.join(file_directory, f"{new_file_name}_{count}{file_extension}")):
        count += 1

    new_file_name = f"{new_file_name}_{count}{file_extension}"
    # Get the new file path
    new_file_path = os.path.join(file_directory, new_file_name)

    # Rename the file
    os.rename(file_path, new_file_path)

    print(f"Renamed '{file_path}' to '{new_file_path}'")

    return new_file_path

def convert_to_wav(file_path):
    # Define output directory
    output_directory = os.path.dirname(file_path)

    # Define the new file name with .wav extension
    wav_file_name = os.path.splitext(os.path.basename(file_path))[0] + ".wav"

    # Define the output path for the converted .wav file
    wav_file_path = os.path.join(output_directory, wav_file_name)

    # Execute FFmpeg command to convert to .wav format
    subprocess.run(["ffmpeg", "-i", file_path, "-acodec", "pcm_s16le", "-ar", "44100", wav_file_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

    return wav_file_path

# Example usage:
youtube_url = "https://www.youtube.com/watch?v=ysLiABvVos8"
wav_file_path = download_audio(youtube_url)

Renamed 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\Campaigning for remaining phases of Lok Sabha elections intensifies.mp4' to 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_1.mp4'
Downloaded 'Campaigning for remaining phases of Lok Sabha elections intensifies' to 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_1.mp4' as 'audio/mp4'.
Converted 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_1.mp4' to 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_1.wav'.


In [7]:
!ffmpeg -i E:/Github_projects/sarvamai-hiring-challenge/downloaded_audio/audio.mp4 -acodec pcm_s16le -ar 16000 ytaudio.wav

ffmpeg version N-115182-g0d9591841b-20240512 Copyright (c) 2000-2024 the FFmpeg developers
  built with gcc 13.2.0 (crosstool-NG 1.26.0.65_ecc5e41)
  configuration: --prefix=/ffbuild/prefix --pkg-config-flags=--static --pkg-config=pkg-config --cross-prefix=x86_64-w64-mingw32- --arch=x86_64 --target-os=mingw32 --enable-gpl --enable-version3 --disable-debug --disable-w32threads --enable-pthreads --enable-iconv --enable-libxml2 --enable-zlib --enable-libfreetype --enable-libfribidi --enable-gmp --enable-fontconfig --enable-libharfbuzz --enable-libvorbis --enable-opencl --disable-libpulse --enable-libvmaf --disable-libxcb --disable-xlib --enable-amf --enable-libaom --enable-libaribb24 --enable-avisynth --enable-chromaprint --enable-libdav1d --enable-libdavs2 --enable-libdvdread --enable-libdvdnav --disable-libfdk-aac --enable-ffnvcodec --enable-cuda-llvm --enable-frei0r --enable-libgme --enable-libkvazaar --enable-libaribcaption --enable-libass --enable-libbluray --enable-libjxl --enable-l

### audio to text

In [1]:
!pip install --upgrade pip
!pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to c:\users\fyzan\appdata\local\temp\pip-req-build-wngl1esl
  Resolved https://github.com/huggingface/transformers.git to commit 37543bad3c1589bfe469e76f896b7fd5e5b1d0e6
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl.metadata (18 kB)
Collecting datasets[audio]
  Downloadin

  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git 'C:\Users\fyzan\AppData\Local\Temp\pip-req-build-wngl1esl'


In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

In [None]:
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

In [None]:
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

In [None]:
sample = "ytaudio.wav"
result = pipe(sample, return_timestamps=True)
print(result["chunks"])