# Task : Semantic Chunking of a Youtube Video

## Problem Statement:

The objective is to extract high-quality, meaningful (semantic) segments from the specified YouTube video: Watch Video.



Suggested workflow:

**1. Download Video and Extract Audio:** Download the video and separate the audio component. <br>
**2. Transcription of Audio:** Utilize an open-source Speech-to-Text model to transcribe the audio. Provide an explanation of the chosen model and any techniques used to enhance the quality of the transcription. <br>
**3. Time-Align Transcript with Audio:** Describe the methodology and steps for aligning the transcript with the audio. <br>
**4. Semantic Chunking of Data:** Slice the data into audio-text pairs, using both semantic information from the text and voice activity information from the audio, with each audio-chunk being less than 15s in length. Explain the logic used for semantic chunking and discuss the strengths and weaknesses of your approach.<br>

Judgement Criteria:

Precision-Oriented Evaluation: The evaluation focuses on precision rather than recall. Higher scores are achieved by reporting fewer but more accurate segments rather than a larger number of segments with inaccuracies. Segment accuracy is determined by:

Transcription Quality: Accuracy of the text transcription for each audio chunk.
Segment Quality: Semantic richness of the text segments.
Timestamp Accuracy: Precision of the start and end times for each segment. Avoid audio cuts at the start or end of a segment.
Detailed Explanations: Provide reasoning behind each step in the process.

Generalization: Discuss the general applicability of your approach, potential failure modes on different types of videos, and adaptation strategies for other languages.

[Bonus-1] Gradio-app Interface: Wrap your code in a gradio-app which takes in youtube link as input and displays the output in a text-box.

[Bonus-2] Utilizing Ground-Truth Transcripts: Propose a method to improve the quality of your transcript using a ground-truth transcript provided as a single text string. Explain your hypothesis for this approach. Note that code-snippet isn't required for this question.

As an example - for the audio extracted from yt-link, how can we leverage transcript scraped from here, to improve the overall transcription quality of segments?

Submission Format:

Your submission should be a well-documented Jupyter notebook capable of reproducing your results. The notebook should automatically install all required dependencies and output the results in the specified format.

Output Format: Provide the results as a list of dictionaries, each representing a semantic chunk. Each dictionary should include:
chunk_id: A unique identifier for the chunk (integer).
chunk_length: The duration of the chunk in seconds (float).
text: The transcribed text of the chunk (string).
start_time: The start time of the chunk within the video (float).
end_time: The end time of the chunk within the video (float).

In [1]:
sample_output_list = [
    {
        "chunk_id": 1,
        "chunk_length": 14.5,
        "text": "Here is an example of a semantic chunk from the video.",
        "start_time": 0.0,
        "end_time": 14.5,
    },
    # Additional chunks follow...
]

Ensure that your code is clear, well-commented, and easy to follow, with explanations for each major step and decision in the process. The notebook should be able to install all the dependencies automatically and generate the reported output when run.

## Workflow Steps:
1. Download Video and Extract Audio
2. Transcription of Audio
3. Time-Align Transcript with Audio
4. Semantic Chunking of Data
5. Evaluation and Judgement Criteria
6. Generalization
7. Gradio App Interface

## 1. Download Video and Extract Audio

In [None]:
import youtube_dl
from moviepy.editor import *

def download_video_extract_audio(youtube_url, output_audio_file):
    # Download video
    ydl_opts = {'format': 'bestaudio/best'}
    with youtube_dl.YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])

    # Extract audio
    video = VideoFileClip(youtube_url)
    audio = video.audio
    audio.write_audiofile(output_audio_file)

download_video_extract_audio('https://youtube.com/watch?v=your_video_id', 'output_audio.wav')
