Semantic Chunking of a YouTube Video 
youtube video link - https://youtu.be/Sby1uJ_NFIY?si=QCGzzQptwQgW78o9


**Choice of Open source model**

To achieve the highest precision and ease of integration with various applications, we have chosen **OpenAI's Whisper speech recognition model**. The objectives of this task are: 1. Precise time-aligned transcription of audio, and 2. Semantic chunking with each chunk being less than 15 seconds in length. Whisper excels in its ability to provide a complete transcription and then segment the text into precise time-series data.

In contrast, other available models, such as Mozilla DeepSpeech and Kaldi, require audio files to be sequentially split into smaller, equal time segments and then processed individually to achieve similar accuracy in time-series transcription. This approach is time-consuming and data-intensive, especially for large audio files.

Setting up pre- requisites for whisper model

In [1]:
pip install -U openai-whisper

Note: you may need to restart the kernel to use updated packages.




In [2]:
pip install git+https://github.com/openai/whisper.git 

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to c:\users\tanay\appdata\local\temp\pip-req-build-c5diuqie
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Note: you may need to restart the kernel to use updated packages.


  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git 'C:\Users\TANAY\AppData\Local\Temp\pip-req-build-c5diuqie'


According to the prerequisites to run OpenAI whisper in the system is installing and seting up rust in the system

In [3]:
pip install setuptools-rust

Note: you may need to restart the kernel to use updated packages.




In [4]:
pip install setuptools-rust

Note: you may need to restart the kernel to use updated packages.




**Making predictions**

There are five model sizes of Whisper. For this task, the base model is used, which takes in account 74 There are five model sizes of Whisper. For this task, the base model is used, which utilizes 74 million parameters and requires 1 GB of VRAM. In comparison, the large model utilizes 1.55 billion parameters and requires 10 GB of VRAM, making the base model 16 times faster. It is ideal for videos where the voice and pronunciations are clear.

In [5]:
import whisper
model = whisper.load_model("base")

In [6]:
result = model.transcribe("T:/SarvamAI/Task1_ytVideo/audio.mp3")
print(result["text"])    # Getting the full transcription



 Congratulations to you Mr. Raghavan for that. Thank you so much for joining us. Over to you. Hi everybody. How are you? Okay I am not hearing this at all. It's like a post lunch energy downer or something. Let's hear it. Are you guys awake? All right you better be because we have a superstar guest here. You heard the 41 million dollars and I didn't hear honestly anything she said after that. So we're going to ask for about 40 million dollars from him by the end of this conversation. But let's get started. I want to introduce Vivek and Pratius, she's co-founder who's not here. We wanted to start with a playing a video of what OpenHathe does. I encourage all of you to go to the website, www.severalm.ai and check it out. But let me start by introducing Vivek. Vivek is a dear friend and he is very, very modest. One of the most modest guys that I know. But his personal journey Vivek, you've got a PhD from Carnegie Mellon. You've sat in and sold the company to Magma. Vivek and I moved back 

# Semnatic chunking of the audio file

In [8]:
# Functionn to Transcribe audio to chunks
def transcribe_audio_to_chunks(audio_path, chunk_duration=10):
    result = model.transcribe(audio_path)
    full_text = result["text"]
    total_duration = result["segments"][-1]["end"] if result["segments"] else 0
    chunks = []

    for chunk_id in range(0, int(total_duration), chunk_duration):
        chunk_start = chunk_id
        chunk_end = min(chunk_id + chunk_duration, total_duration)
        chunk_text = " ".join([segment["text"].strip() for segment in result["segments"] if
                               segment["start"] >= chunk_start and segment["end"] <= chunk_end])
        
        # Converting text chunks into given format
        chunks.append({
            "chunk_id": chunk_id // chunk_duration,
            "chunk_length": chunk_end - chunk_start,
            "text": chunk_text,
            "start_time": chunk_start,
            "end_time": chunk_end
        })

    return chunks

In [9]:
transcribe_audio_to_chunks("T:/SarvamAI/Task1_ytVideo/audio.mp3", 15)

[{'chunk_id': 0,
  'chunk_length': 15,
  'text': 'Congratulations to you Mr. Raghavan for that. Thank you so much for joining us. Over to you. Hi everybody. How are you?',
  'start_time': 0,
  'end_time': 15},
 {'chunk_id': 1,
  'chunk_length': 15,
  'text': "Let's hear it. Are you guys awake? All right you better be because we have a superstar guest here.",
  'start_time': 15,
  'end_time': 30},
 {'chunk_id': 2,
  'chunk_length': 15,
  'text': "So we're going to ask for about 40 million dollars from him by the end of this conversation. But let's get started. I want to introduce Vivek and Pratius, she's co-founder who's not here.",
  'start_time': 30,
  'end_time': 45},
 {'chunk_id': 3,
  'chunk_length': 15,
  'text': 'We wanted to start with a playing a video of what OpenHathe does. I encourage all of you to go to the website, www.severalm.ai and check it out. But let me start by introducing Vivek. Vivek is a',
  'start_time': 45,
  'end_time': 60},
 {'chunk_id': 4,
  'chunk_length': 

Creating a basic Gradio Interface for the application

In [10]:
!pip install gradio





In [11]:
import gradio as gr
from pytube import YouTube 
from moviepy.editor import AudioFileClip

In [12]:
# In this block of code all the functions which we are going to use in Gradio interface are defined

# Download audio from YouTube in order to perform transcription on it
def download_audio_from_youtube(youtube_url):
    yt = YouTube(youtube_url) # getting the yt url
    audio_stream = yt.streams.filter(only_audio=True).first() # seperation of audio stream 
    audio_file_path = audio_stream.download(filename="youtube_audio.mp4")
    audio_clip = AudioFileClip(audio_file_path)
    audio_clip.write_audiofile("youtube_audio.wav", codec='pcm_s16le')
    return "youtube_audio.wav"


# Download video from YouTube (to diaplay the actual video on gradio interface)
def download_video_from_youtube(youtube_url):
    yt = YouTube(youtube_url)
    video_stream = yt.streams.filter(progressive=True, file_extension="mp4").first()
    video_file_path = video_stream.download(filename="youtube_video.mp4")
    return video_file_path

# Gradio interface
def process_youtube_link(youtube_url):
    # defining the audio and video file paths
    audio_path = download_audio_from_youtube(youtube_url)
    video_path = download_video_from_youtube(youtube_url)
    # getting chunks of the audio with the "transcribe_audio_to_chunks(audio_path)" function which is defined above
    chunks = transcribe_audio_to_chunks(audio_path)

    # Preparing chunks for display
    formatted_chunks = "".join([
        f"<span id='chunk-{chunk['chunk_id']}'>{chunk['text']} (Start: {chunk['start_time']}s, End: {chunk['end_time']}s)</span><br><br>"
        for chunk in chunks
    ])

    return video_path, formatted_chunks, chunks

In [13]:
# JavaScript code for syncing text highlighting with video playback
js_code = """
const videoPlayer = document.querySelector("video");
const transcriptionOutput = document.getElementById("transcription-output");

videoPlayer.addEventListener("timeupdate", () => {
    const currentTime = videoPlayer.currentTime;
    const chunks = JSON.parse(document.getElementById("chunks-data").innerText);

    chunks.forEach(chunk => {
        const chunkElement = document.getElementById(`chunk-${chunk.chunk_id}`);
        if (currentTime >= chunk.start_time && currentTime <= chunk.end_time) {
            chunkElement.style.backgroundColor = "yellow";
        } else {
            chunkElement.style.backgroundColor = "transparent";
        }
    });
});
"""


# creating and launching Gradio interface. Initializing buttons and their functions.
with gr.Blocks() as demo:
    youtube_link = gr.Textbox(label="Enter YouTube URL")
    submit_button = gr.Button("Submit")
    video_output = gr.Video(label="Downloaded Video")
    transcription_output = gr.HTML(label="Transcribed Chunks")
    hidden_chunks_data = gr.HTML(visible=False)

    submit_button.click(fn=process_youtube_link, inputs=youtube_link,
                        outputs=[video_output, transcription_output, hidden_chunks_data])

    # Add JavaScript to the interface
    gr.HTML(value=f"""
    <script>{js_code}</script>
    """)

demo.launch()

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


