# Task : Semantic Chunking of a Youtube Video

## Workflow Steps:
1. Download Video and Extract Audio
2. Transcription of Audio
3. Time-Align Transcript with Audio
4. Semantic Chunking of Data
5. Evaluation and Judgement Criteria
6. Generalization
7. Gradio App Interface

## 1. Download Video and Extract Audio

**Explanation of the Requirements:** <br>
1. youtube-dl: This library is used for downloading videos from YouTube.<br>
2. moviepy: Useful for video and audio processing, such as extracting audio tracks from video files.<br>
3. torch and transformers: These libraries are needed if you are using PyTorch-based models like Facebook's Wav2Vec for speech-to-text transcription.<br>
4. pandas and numpy: Essential for data manipulation and numerical operations.<br>
5. scipy and librosa: Used for more detailed audio analysis and processing tasks.<br>
6. soundfile: For reading and writing audio files.<br>
7. nltk and spacy: Natural Language Processing libraries useful for text manipulation and processing, especially for the semantic chunking.<br>
8. gradio: Allows you to create a web interface to easily interact with your models and processes.<br>

In [1]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.


In [1]:
import torch
torch.cuda.is_available()

True

In [2]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.


### using youtube_transcript_api

In [3]:
from youtube_transcript_api import YouTubeTranscriptApi

In [4]:
def get_video_id_from_url(url):
    # This function extracts the video ID from the YouTube URL
    if "youtube.com/watch?v=" in url:
        return url.split("watch?v=")[1]
    else:
        raise ValueError("Invalid YouTube URL")

In [5]:
def get_transcript(youtube_url):
    try:
        # Extract the video ID from the URL
        video_id = get_video_id_from_url(youtube_url)

        # Fetch the available transcripts
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        
        # Print each entry of the transcript
        for entry in transcript:
            print(f"Start: {entry['start']} - Duration: {entry['duration']}s")
            print(entry['text'] + '\n')
    
    except Exception as e:
        print(f"An error occurred: {e}")

In [6]:
youtube_url = "https://www.youtube.com/watch?v=ysLiABvVos8"
get_transcript(youtube_url)

Start: 0.24 - Duration: 7.97s
presents Morning

Start: 2.94 - Duration: 5.27s
[Music]

Start: 10.719 - Duration: 7.041s
News good morning I am SRA maaba the

Start: 15.64 - Duration: 4.719s
headlines campaigning for remaining

Start: 17.76 - Duration: 5.04s
phases of L SAA elections

Start: 20.359 - Duration: 5.121s
intensifies president dropadi murmu to

Start: 22.8 - Duration: 6.44s
confir Padma awards at second investure

Start: 25.48 - Duration: 6.08s
ceremony in raspati bavan today try

Start: 29.24 - Duration: 4.159s
service conference parivartan chintan

Start: 31.56 - Duration: 4.88s
chaired by Chief of Defense staff

Start: 33.399 - Duration: 6.361s
General Anil Chan to be held in New

Start: 36.44 - Duration: 5.84s
Delhi unrest erupts in Pakistan occupied

Start: 39.76 - Duration: 4.4s
Kashmir as Pakistan Security Forces

Start: 42.28 - Duration: 5.16s
deployed ahead of

Start: 44.16 - Duration: 5.48s
protests IMD forast fresh spell of heat

Start: 47.44 - Duration: 5.72s
wav

### using pytube

In [7]:
import os
import subprocess
from pytube import YouTube

In [8]:
def rename_audio(file_path):
    # Define the new file name
    new_file_name = "audio_file"

    # Get the directory of the file
    file_directory = os.path.dirname(file_path)

    # Get the file extension
    file_extension = os.path.splitext(file_path)[1]

    # Create a unique new file name
    count = 1
    while os.path.exists(os.path.join(file_directory, f"{new_file_name}_{count}{file_extension}")):
        count += 1

    new_file_name = f"{new_file_name}_{count}{file_extension}"
    # Get the new file path
    new_file_path = os.path.join(file_directory, new_file_name)

    # Rename the file
    os.rename(file_path, new_file_path)

    print(f"Renamed '{file_path}' to '{new_file_path}'")

    return new_file_path

In [9]:
def convert_to_wav(file_path):
    # Define output directory
    output_directory = os.path.dirname(file_path)

    # Define the new file name with .wav extension
    wav_file_name = os.path.splitext(os.path.basename(file_path))[0] + ".wav"

    # Define the output path for the converted .wav file
    wav_file_path = os.path.join(output_directory, wav_file_name)

    # Execute FFmpeg command to convert to .wav format
    subprocess.run(["ffmpeg", "-i", file_path, "-acodec", "pcm_s16le", "-ar", "44100", wav_file_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

    return wav_file_path

In [10]:
def download_audio(youtube_url):
    # Create a YouTube object with the URL
    yt = YouTube(youtube_url)

    # Define output directory
    output_path = "downloaded_audio"

    # Get the best audio stream
    audio_stream = yt.streams.get_audio_only()

    # Download the audio stream
    downloaded_file_path = audio_stream.download(output_path)

    # Rename the downloaded audio file
    renamed_file_path = rename_audio(downloaded_file_path)

    print(f"Downloaded '{audio_stream.title}' to '{renamed_file_path}' as '{audio_stream.mime_type}'.")

    # Convert to .wav format
    wav_file_path = convert_to_wav(renamed_file_path)

    print(f"Converted '{renamed_file_path}' to '{wav_file_path}'.")

    return wav_file_path

In [11]:
# Example usage:
youtube_url = "https://www.youtube.com/watch?v=ysLiABvVos8"
wav_file_path = download_audio(youtube_url)

Renamed 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\Campaigning for remaining phases of Lok Sabha elections intensifies.mp4' to 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_1.mp4'
Downloaded 'Campaigning for remaining phases of Lok Sabha elections intensifies' to 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_1.mp4' as 'audio/mp4'.
Converted 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_1.mp4' to 'e:\Github_projects\sarvamai-hiring-challenge\downloaded_audio\audio_file_1.wav'.


In [12]:
wav_file_path

'e:\\Github_projects\\sarvamai-hiring-challenge\\downloaded_audio\\audio_file_1.wav'

In [13]:
youtube_url = "https://www.youtube.com/watch?v=ysLiABvVos8"

### audio to text

In [14]:
!pip install --upgrade pip
!pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to c:\users\fyzan\appdata\local\temp\pip-req-build-9mlbgqwu
  Resolved https://github.com/huggingface/transformers.git to commit 15c74a28294fe9082b81b24efe58df16fed79a9e
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git 'C:\Users\fyzan\AppData\Local\Temp\pip-req-build-9mlbgqwu'


In [16]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

In [17]:
torch.cuda.is_available()

True

![Image Description](image.png)

In [18]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-tiny"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [24]:
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=5,
    batch_size=2,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

In [26]:
result = pipe(wav_file_path, return_timestamps=True, generate_kwargs={"language": "english"})
print(result["chunks"])

[{'timestamp': (0.0, 28.99), 'text': ' He presents morning news. Thank you. you Good morning, I am Saira Muchthaba. much the bar. The headlines. campaigning for remaining phases of looks of elections intensifies. files. Padmin Awards at Second Investiture Serbony in Rush by the Bhavan today.'}, {'timestamp': (28.99, 37.49), 'text': ' Troy service conference, Barry Wartanthan Chinthan, chaired by Chief of Defence staff, General Anil Johan to be held in New Delhi.'}, {'timestamp': (37.49, 44.33), 'text': ' Undressed Irritate And rest erupts in Pakistan occupied Kashmir as Pakistan security forces deployed ahead of protests.'}, {'timestamp': (44.33, 64.33), 'text': ' IMD forecast fresh spell of heat wave to continue over Rajasthan and Madhipraadesh. and in table tennis, Monica Battra reaches quarterfinals at Saudi Smash, in Jadda. . And now the news in detail. Camping for the news. care.'}, {'timestamp': (64.33, 66.53), 'text': ' campaigning for the remaining phases of looks of high'}, {'

In [27]:
data = result["chunks"]

for entry in data:
    start_time, end_time = entry['timestamp']
    text = entry['text']
    print(f"Timestamp: {start_time:.2f} - {end_time:.2f}")
    print(f"Text: {text}")

Timestamp: 0.00 - 28.99
Text:  He presents morning news. Thank you. you Good morning, I am Saira Muchthaba. much the bar. The headlines. campaigning for remaining phases of looks of elections intensifies. files. Padmin Awards at Second Investiture Serbony in Rush by the Bhavan today.
Timestamp: 28.99 - 37.49
Text:  Troy service conference, Barry Wartanthan Chinthan, chaired by Chief of Defence staff, General Anil Johan to be held in New Delhi.
Timestamp: 37.49 - 44.33
Text:  Undressed Irritate And rest erupts in Pakistan occupied Kashmir as Pakistan security forces deployed ahead of protests.
Timestamp: 44.33 - 64.33
Text:  IMD forecast fresh spell of heat wave to continue over Rajasthan and Madhipraadesh. and in table tennis, Monica Battra reaches quarterfinals at Saudi Smash, in Jadda. . And now the news in detail. Camping for the news. care.
Timestamp: 64.33 - 66.53
Text:  campaigning for the remaining phases of looks of high
Timestamp: 66.53 - 72.96
Text:  elections in Tennessee ha

In [None]:

def transcribe_audio(wav_file_path):
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

    model_id = "openai/whisper-tiny"

    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
    )
    model.to(device)

    processor = AutoProcessor.from_pretrained(model_id)

    pipe = pipeline(
        "automatic-speech-recognition",
        model=model,
        tokenizer=processor.tokenizer,
        feature_extractor=processor.feature_extractor,
        max_new_tokens=128,
        chunk_length_s=30,
        batch_size=16,
        return_timestamps=True,
        torch_dtype=torch_dtype,
        device=device,
    )

    result = pipe(wav_file_path, return_timestamps=True)
    return result["chunks"]

In [None]:
import re

def adjust_timestamps(transcription_data):
    new_chunks = []
    
    for entry in transcription_data:
        start_time, end_time = entry['timestamp']
        text = entry['text']

        # Split text based on punctuation marks
        sentences = re.split(r'([.!?])', text)
        
        # Combine sentences and punctuation
        sentences = ["".join(i) for i in zip(sentences[0::2], sentences[1::2])]
        
        if sentences and sentences[-1][-1] not in '.!?':
            sentences[-1] += "."

        # Calculate the average duration of each sentence
        sentence_count = len(sentences)
        duration = (end_time - start_time) / sentence_count if sentence_count > 0 else 0
        
        for i, sentence in enumerate(sentences):
            new_start_time = start_time + i * duration
            new_end_time = start_time + (i + 1) * duration
            new_chunks.append({
                'timestamp': (new_start_time, new_end_time),
                'text': sentence.strip()
            })
    
    return new_chunks

# Example usage within the main function
def main():
    st.title("YouTube Transcription App")

    youtube_url = st.text_input("Enter a YouTube video URL")

    if youtube_url:
        try:
            video_id = get_video_id_from_url(youtube_url)
            wav_file_path = download_audio(youtube_url)
            transcription_data = transcribe_audio(wav_file_path)

            adjusted_transcription_data = adjust_timestamps(transcription_data)

            st.subheader("Transcription")
            for entry in adjusted_transcription_data:
                start_time, end_time = entry['timestamp']
                text = entry['text']
                st.write(f"Timestamp: {start_time:.2f} - {end_time:.2f}")
                st.write(f"Text: {text}")
        except Exception as e:
            st.error(f"An error occurred: {e}")

if __name__ == "__main__":
    main()
