# Task : Semantic Chunking of a Youtube Video

## Workflow Steps:
1. Download Video and Extract Audio
2. Transcription of Audio
3. Time-Align Transcript with Audio
4. Semantic Chunking of Data
5. Evaluation and Judgement Criteria
6. Generalization
7. Gradio App Interface

## 1. Download Video and Extract Audio

**Explanation of the Requirements:** <br>
1. youtube-dl: This library is used for downloading videos from YouTube.<br>
2. moviepy: Useful for video and audio processing, such as extracting audio tracks from video files.<br>
3. torch and transformers: These libraries are needed if you are using PyTorch-based models like Facebook's Wav2Vec for speech-to-text transcription.<br>
4. pandas and numpy: Essential for data manipulation and numerical operations.<br>
5. scipy and librosa: Used for more detailed audio analysis and processing tasks.<br>
6. soundfile: For reading and writing audio files.<br>
7. nltk and spacy: Natural Language Processing libraries useful for text manipulation and processing, especially for the semantic chunking.<br>
8. gradio: Allows you to create a web interface to easily interact with your models and processes.<br>

In [7]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting pytube (from -r requirements.txt (line 3))
  Downloading pytube-15.0.0-py3-none-any.whl.metadata (5.0 kB)
Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
   ---------------------------------------- 0.0/57.6 kB ? eta -:--:--
   ---------------------------------------- 57.6/57.6 kB 1.5 MB/s eta 0:00:00
Installing collected packages: pytube
Successfully installed pytube-15.0.0
Note: you may need to restart the kernel to use updated packages.


### using youtube_transcript_api

In [6]:
from youtube_transcript_api import YouTubeTranscriptApi

video_id = 'ysLiABvVos8'
YouTubeTranscriptApi.get_transcript(video_id)

[{'text': 'presents Morning', 'start': 0.24, 'duration': 7.97},
 {'text': '[Music]', 'start': 2.94, 'duration': 5.27},
 {'text': 'News good morning I am SRA maaba the',
  'start': 10.719,
  'duration': 7.041},
 {'text': 'headlines campaigning for remaining',
  'start': 15.64,
  'duration': 4.719},
 {'text': 'phases of L SAA elections', 'start': 17.76, 'duration': 5.04},
 {'text': 'intensifies president dropadi murmu to',
  'start': 20.359,
  'duration': 5.121},
 {'text': 'confir Padma awards at second investure',
  'start': 22.8,
  'duration': 6.44},
 {'text': 'ceremony in raspati bavan today try',
  'start': 25.48,
  'duration': 6.08},
 {'text': 'service conference parivartan chintan',
  'start': 29.24,
  'duration': 4.159},
 {'text': 'chaired by Chief of Defense staff',
  'start': 31.56,
  'duration': 4.88},
 {'text': 'General Anil Chan to be held in New',
  'start': 33.399,
  'duration': 6.361},
 {'text': 'Delhi unrest erupts in Pakistan occupied',
  'start': 36.44,
  'duration': 5.

In [4]:
from youtube_transcript_api import YouTubeTranscriptApi

def get_video_id_from_url(url):
    # This function extracts the video ID from the YouTube URL
    if "youtube.com/watch?v=" in url:
        return url.split("watch?v=")[1]
    else:
        raise ValueError("Invalid YouTube URL")

def get_transcript(youtube_url):
    try:
        # Extract the video ID from the URL
        video_id = get_video_id_from_url(youtube_url)

        # Fetch the available transcripts
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        
        # Print each entry of the transcript
        for entry in transcript:
            print(f"Start: {entry['start']} - Duration: {entry['duration']}s")
            print(entry['text'] + '\n')
    
    except Exception as e:
        print(f"An error occurred: {e}")

# Example usage
get_transcript('https://www.youtube.com/watch?v=ysLiABvVos8')  # Replace with the actual YouTube URL


Start: 0.24 - Duration: 7.97s
presents Morning

Start: 2.94 - Duration: 5.27s
[Music]

Start: 10.719 - Duration: 7.041s
News good morning I am SRA maaba the

Start: 15.64 - Duration: 4.719s
headlines campaigning for remaining

Start: 17.76 - Duration: 5.04s
phases of L SAA elections

Start: 20.359 - Duration: 5.121s
intensifies president dropadi murmu to

Start: 22.8 - Duration: 6.44s
confir Padma awards at second investure

Start: 25.48 - Duration: 6.08s
ceremony in raspati bavan today try

Start: 29.24 - Duration: 4.159s
service conference parivartan chintan

Start: 31.56 - Duration: 4.88s
chaired by Chief of Defense staff

Start: 33.399 - Duration: 6.361s
General Anil Chan to be held in New

Start: 36.44 - Duration: 5.84s
Delhi unrest erupts in Pakistan occupied

Start: 39.76 - Duration: 4.4s
Kashmir as Pakistan Security Forces

Start: 42.28 - Duration: 5.16s
deployed ahead of

Start: 44.16 - Duration: 5.48s
protests IMD forast fresh spell of heat

Start: 47.44 - Duration: 5.72s
wav

### using pytube

In [8]:
from pytube import YouTube

In [9]:
def download_audio(youtube_url, output_path):
    # Create a YouTube object with the URL
    yt = YouTube(youtube_url)

    # Get the best audio stream
    audio_stream = yt.streams.get_audio_only()

    # Download the audio stream
    audio_stream.download(output_path)

    print(f"Downloaded '{audio_stream.title}' to '{output_path}' as '{audio_stream.mime_type}'.")


In [10]:
# Example usage
download_audio('https://www.youtube.com/watch?v=ysLiABvVos8', 'downloaded_audio')

Downloaded 'Campaigning for remaining phases of Lok Sabha elections intensifies' to 'downloaded_audio' as 'audio/mp4'.


In [7]:
!ffmpeg -version

'ffmpeg' is not recognized as an internal or external command,
operable program or batch file.


In [17]:
!ffmpeg -i 'downloaded_audio\Campaigning for remaining phases of Lok Sabha elections intensifies.mp4' -acodec pcm_s16le -ar 16000 ytaudio.wav

'ffmpeg' is not recognized as an internal or external command,
operable program or batch file.


### using youtube-dl

In [None]:
%pip install git+https://github.com/ytdl-org/youtube-dl.git

In [10]:
import yt_dlp
from moviepy.editor import *

def download_video_extract_audio(youtube_url, output_audio_file):
    # Options for downloading best audio
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'wav',  # or 'mp3' depending on your needs
            'preferredquality': '192',
        }],
        'outtmpl': 'downloaded_audio.%(ext)s',  # Template for the output filename
    }

    # Download audio using yt-dlp
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])

    # Load the downloaded audio file with moviepy (change the filename accordingly if needed)
    audio_clip = AudioFileClip('downloaded_audio.wav')
    
    # Write the audio to the desired output file
    audio_clip.write_audiofile(output_audio_file)

# Example usage
download_video_extract_audio('https://youtube.com/watch?v=your_video_id', 'output_audio.wav')

[youtube] Extracting URL: https://youtube.com/watch?v=your_video_id
[youtube] your_video_: Downloading webpage




[youtube] your_video_: Downloading ios player API JSON




[youtube] your_video_: Downloading android player API JSON




[youtube] your_video_: Downloading iframe API JS




[youtube] your_video_: Downloading web player API JSON


ERROR: [youtube] your_video_: Failed to extract any player response; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U


DownloadError: ERROR: [youtube] your_video_: Failed to extract any player response; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U

In [18]:
!youtube-dl --verbose https://www.youtube.com/live/ysLiABvVos8?si=7o4BF8DuZg91_i1t

[youtube:tab] live: Downloading webpage
[youtube] ysLiABvVos8: Downloading webpage


[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', 'https://www.youtube.com/live/ysLiABvVos8?si=7o4BF8DuZg91_i1t']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.10.14 (CPython) - Windows-10-10.0.22631-SP0
[debug] exe versions: none
[debug] Proxy map: {}
ERROR: Unable to extract uploader id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "c:\Users\fyzan\anaconda3\envs\sarvam\lib\site-packages\youtube_dl\YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
  File "c:\Users\fyzan\anaconda3\envs\sarvam\lib\site-packages\youtube_dl\YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url

In [7]:
import youtube_dl
from moviepy.editor import *

def download_video_extract_audio(youtube_url, output_audio_file):
    # Download video
    ydl_opts = {'format': 'bestaudio/best'}
    with youtube_dl.YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])

    # Extract audio
    video = VideoFileClip(youtube_url)
    audio = video.audio
    audio.write_audiofile(output_audio_file)

download_video_extract_audio('https://www.youtube.com/live/ysLiABvVos8?si=7o4BF8DuZg91_i1t', 'output_audio.wav')


[youtube:tab] live: Downloading webpage


[youtube] ysLiABvVos8: Downloading webpage


ERROR: Unable to extract uploader id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.


DownloadError: ERROR: Unable to extract uploader id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.