<a href="https://colab.research.google.com/github/aithentic-dev/colab-yt/blob/main/YT_fetch_and_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YouTube Comments Summarization with OpenAI

In this notebook we harness the power of AI to deeply analyze YouTube comments and video transcripts.

1. Fetch Video Transcripts: Automatically retrieve the textual content of YouTube videos using the youtube-transcript-api.
2. Extract and Analyze Comments: Use the YouTube Data API to gather viewer comments from specified videos for further analysis.
3. Summarize and Link Comments to Transcripts: Apply OpenAI's GPT-3 model to summarize comments and correlate them with video transcripts, extracting meaningful insights that highlight viewer engagement and feedback.
4. Perform Sentiment Analysis: Utilize Hugging Face’s sentiment-analysis model to classify comments into positive, negative, or neutral categories, providing a quantitative measure of audience sentiment.
5. Visualize Data: Display sentiment analysis results using pie charts to give a visual representation of audience reactions.



## Setup
First, let's import necessary libraries and define our API keys and youtube video id.

Save the following three tokens in the secrets to be able to run this notebook:

1.   YT_APIKEY - Google API Key
2.   OPENAI_API_KEY - OpenAI API Key



In [None]:
!pip install openai

from openai import OpenAI
from google.colab import userdata

# Instantiate OpenAI client (ensure your OpenAI API key is set in your environment variables or pass directly here).
client = OpenAI(
    api_key=userdata.get('OPENAI_API_KEY')
)

VIDEOID = 'boHg6quSZtI'#'boHg6quSZtI'
VIDEO_ID = 'boHg6quSZtI' #'1Vn_9iQfKek'
GOOGLE_APIKEY = userdata.get('YT_APIKEY')
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
YT_TRANSCRIPTS = '/content/drive/My Drive/YouTubeTranscripts'
AZ_STR = userdata.get('AZ_STR')



## Fetching Youtube Video Transcription

In [None]:
# Install required packages
!pip install youtube-transcript-api tiktoken openai
VIDEO_ID = '1Vn_9iQfKek'
# Import required libraries
from youtube_transcript_api import YouTubeTranscriptApi
import tiktoken
from typing import List
import time

def get_video_transcript(video_id: str) -> str:
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['te', 'en'])
        full_transcript = " ".join(entry['text'] for entry in transcript)

        # Print the full transcript
        print("=== FULL TRANSCRIPT ===")
        print(full_transcript)
        print("\nTranscript length (characters):", len(full_transcript))
        print("===== END TRANSCRIPT =====\n")

        return full_transcript
    except Exception as e:
        print(f"Failed to fetch transcript: {e}")
        return ""


def chunk_text(text: str, max_tokens: int = 12000) -> List[str]:
    """Split text into chunks that fit within token limit."""
    try:
        # Initialize tokenizer for GPT-3.5-turbo
        enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

        words = text.split()
        chunks = []
        current_chunk = []
        current_length = 0

        for word in words:
            word_tokens = len(enc.encode(word + " "))
            if current_length + word_tokens > max_tokens:
                # Save current chunk and start new one
                chunks.append(" ".join(current_chunk))
                current_chunk = [word]
                current_length = word_tokens
            else:
                current_chunk.append(word)
                current_length += word_tokens

        # Add the last chunk if it exists
        if current_chunk:
            chunks.append(" ".join(current_chunk))

        return chunks
    except Exception as e:
        print(f"Error in chunk_text: {e}")
        # Fallback to a simpler chunking method if tiktoken fails
        words = text.split()
        chunk_size = 3000  # Approximate size that should stay under token limit
        return [" ".join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

def get_transcript_summary(transcript: str, client) -> str:
    # Split transcript into smaller chunks
    chunks = chunk_text(transcript)
    summaries = []

    # Process each chunk
    for i, chunk in enumerate(chunks):
        try:
            response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": f"You are summarizing part {i+1} of {len(chunks)} of a YouTube video transcript. Provide a concise summary of this section."},
                    {"role": "user", "content": chunk}
                ]
            )
            summaries.append(response.choices[0].message.content)
            # Add a small delay to avoid rate limiting
            time.sleep(1)
        except Exception as e:
            print(f"Error processing chunk {i+1}: {e}")

    # Combine chunk summaries into final summary
    if summaries:
        final_prompt = "Combine these section summaries into a coherent overall summary:\n\n" + "\n\n".join(summaries)
        try:
            final_response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "Create a coherent overall summary from these section summaries."},
                    {"role": "user", "content": final_prompt}
                ]
            )
            return final_response.choices[0].message.content
        except Exception as e:
            print(f"Error creating final summary: {e}")
            return "\n\n".join(summaries)  # Return individual summaries if final combination fails

    return "Failed to generate summary."

# Usage example:

#from openai import OpenAI
#client = OpenAI(api_key='OPENAI_API_KEY')  # Replace with your actual API key

#video_id = 'VIDEO_ID'  # Replace with your YouTube video ID
video_transcript = get_video_transcript(VIDEOID)
transcript_summary = get_transcript_summary(video_transcript, client)
print(transcript_summary)


Collecting youtube-transcript-api
  Downloading youtube_transcript_api-0.6.3-py3-none-any.whl.metadata (17 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading youtube_transcript_api-0.6.3-py3-none-any.whl (622 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m622.3/622.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: youtube-transcript-api, tiktoken
Successfully installed tiktoken-0.8.0 youtube-transcript-api-0.6.3
=== FULL TRANSCRIPT ===
హలో వెల్కమ్ టు అవర్ ఛానల్ మొత్తానికి అల్లు అర్జున్ గారు రిలీజ్ అయ్యారు ఇంటికి వచ్చారు ఒక రోజంతా జైల్లో ఉండాల్సిన తప్పైతే అల్లు అర్జున్ గారు చేయలేదు అన్నది నా వ్యక్తిగత అభిప్రాయం బట్ అగైన్ ఎందుకు జర

## Fetching YouTube Video Comments

In [None]:
from googleapiclient.discovery import build

# Function to get YouTube comments
def get_comments(video_id, api_key):
    youtube = build('youtube', 'v3', developerKey=api_key)
    comments = []
    response = youtube.commentThreads().list(
        part="snippet",
        videoId=video_id,
        textFormat="plainText",
        maxResults=100
    ).execute()

    while response:
        for item in response['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comments.append(comment)

        if 'nextPageToken' in response:
            response = youtube.commentThreads().list(
                part="snippet",
                videoId=video_id,
                pageToken=response['nextPageToken'],
                maxResults=100
            ).execute()
        else:
            break

    return comments

video_comments = get_comments(VIDEOID, GOOGLE_APIKEY)
print(video_comments)



## Summarizing Comments
Summarize comments using OpenAI's GPT model.

In [None]:
# Function to split comments into manageable batches
def batch_comments(comments, max_tokens=2048):
    batches = []
    current_batch = []
    current_length = 0

    for comment in comments:
        comment_length = len(comment.split())
        if current_length + comment_length > max_tokens:
            batches.append(current_batch)
            current_batch = [comment]
            current_length = comment_length
        else:
            current_batch.append(comment)
            current_length += comment_length

    if current_batch:
        batches.append(current_batch)

    return batches

# Function to get summaries from OpenAI
def get_comments_summaries(batches):
    summaries = []

    for batch in batches:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "Summarize the following comments while keeping the detailed context."},
                {"role": "user", "content": " ".join(batch)}
            ]
        )
        print(response)
        summaries.append(response.choices[0].message.content)

    return summaries

# Function to create final summary from summaries
def create_final_summary(summaries, transcript_summary):
    summary_text = " ".join(summaries)
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
              {"role": "system", "content": f"This is the summary of a YouTube video's transcript: {transcript_summary}. A user has commented on the video. Your task is to analyze this comment in the context of the video transcript. Based on the comment content and its relation to the transcript, please provide detailed insights, addressing these key points:\n1. Identify positive aspects of the video that the comment highlights and link these to specific parts of the transcript where possible.\n2. Identify any criticisms or areas for improvement mentioned in the comment, and relate these to relevant sections of the transcript.\n3. Based on the feedback or suggestions in the comment, recommend new content ideas or topics for future videos that align with the viewer's interests and the overall content strategy but don't make up things from your side unnecessarily. Ensure your analysis is clear and includes specific examples from both the comment and the transcript to support your insights."},
              {"role": "user", "content": summary_text}
        ]
    )
    return response.choices[0].message.content

In [None]:
batches = batch_comments(video_comments)
summaries = get_comments_summaries(batches)
final_comments_summary = create_final_summary(summaries, transcript_summary)
print(final_comments_summary)

ChatCompletion(id='chatcmpl-Alje5eG8J0ogklXwnGM514YPjyXof', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Revanth Reddy criticizes film stars for their influence in the mega family, mentioning about their indifference and the need for clarity from Congress. Many comments express support for his statements and criticize the behavior of certain political figures. There is also a call to increase movie ticket rates, cap celebrities' remunerations, and a suggestion to put the cinema industry in its place. Some users show dissatisfaction with the behavior of certain celebrities and political figures in the Telugu states, emphasizing the need for ethical conduct in politics and entertainment industry.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1735940241, model='gpt-3.5-turbo-0125', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tok

## Sentiment analysis of comments

In [None]:
def get_sentiments(comments):
    sentiment_counts = {'positive': 0, 'negative': 0, 'neutral': 0}

    for comment in comments:
        # Perform sentiment analysis on each comment individually
        try:
            sentiments = sentiment_analyzer(comment)
            # Assuming the first result is the relevant one if multiple are returned
            sentiment = sentiments[0] if isinstance(sentiments, list) and sentiments else None
        except Exception as e:
            print(f"Error in sentiment analysis: {e}")
            sentiment = None

        # Tally sentiments based on the analysis result
        if sentiment and sentiment['label'] == 'POSITIVE' and sentiment['score'] > 0.9:
            sentiment_counts['positive'] += 1
        elif sentiment and sentiment['label'] == 'NEGATIVE' and sentiment['score'] > 0.9:
            sentiment_counts['negative'] += 1
        else:
            sentiment_counts['neutral'] += 1

    return sentiment_counts

In [None]:
# Function to plot the sentiment distribution
def plot_sentiment_distribution(sentiment_counts):
    labels = sentiment_counts.keys()
    sizes = sentiment_counts.values()
    colors = ['gold', 'lightcoral', 'lightskyblue']
    explode = (0.1, 0, 0)  # explode 1st slice

    plt.pie(sizes, explode=explode, labels=labels, colors=colors,
            autopct='%1.1f%%', shadow=True, startangle=140)

    plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    plt.show()

In [None]:
!pip install transformers matplotlib

import matplotlib.pyplot as plt
from transformers import pipeline

# Initialize the sentiment-analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

sentiment_counts = get_sentiments(video_comments)
plot_sentiment_distribution(sentiment_counts)

[31mERROR: Operation cancelled by user[0m[31m
[0mTraceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/commands/install.py", line 324, in run
    session = self.get_default_session(options)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/index_command.py", line 71, in get_default_session
    self._session = self.enter_context(self._build_session(options))
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/index_command.py", line 100, in _build_session
    session = PipSession(
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/network/session.py", line 344, in __init__
    self.headers["User-Agent"] = user_agent()
 

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


NameError: name 'video_comments' is not defined

Fetch Transcripts

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import TextFormatter

def get_video_transcript(video_id):
    try:
        # First try to get available transcripts
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

        # Try to get manually created transcripts first
        try:
            transcript = transcript_list.find_manually_created_transcript()
        except:
            # If no manual transcript, get the auto-generated one
            try:
                transcript = transcript_list.find_generated_transcript()
            except:
                # If no generated transcript in original language, get any available transcript
                transcript = transcript_list.find_transcript(['en', 'hi', 'te'])

        # Get the actual transcript
        transcript_data = transcript.fetch()

        # Format the transcript
        formatter = TextFormatter()
        formatted_transcript = formatter.format_transcript(transcript_data)

        return formatted_transcript

    except Exception as e:
        print(f"Failed to fetch transcript: {str(e)}")
        # Print available languages for debugging
        try:
            available_transcripts = YouTubeTranscriptApi.list_transcripts(video_id)
            print("Available transcript languages:")
            for transcript in available_transcripts:
                print(f"- {transcript.language_code}")
        except:
            print("Could not retrieve available transcripts")
        return ""

def get_transcript_summary(transcript):
    if not transcript:
        return "No transcript available to summarize."

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Provide a detailed summary of the given youtube video transcript."},
            {"role": "user", "content": transcript}
        ]
    )
    return response.choices[0].message.content

# Usage
video_id = "YOUR_VIDEO_ID"
video_transcript = get_video_transcript(video_id)
if video_transcript:
    transcript_summary = get_transcript_summary(video_transcript)
    print(transcript_summary)
else:
    print("Could not generate summary as no transcript was available.")

V2 - TRANSCRIPTS WITH TIMESTAMP

In [None]:
!pip install youtube-transcript-api

Collecting youtube-transcript-api
  Downloading youtube_transcript_api-0.6.3-py3-none-any.whl.metadata (17 kB)
Downloading youtube_transcript_api-0.6.3-py3-none-any.whl (622 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/622.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m614.4/622.3 kB[0m [31m26.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m622.3/622.3 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: youtube-transcript-api
Successfully installed youtube-transcript-api-0.6.3


In [None]:
from youtube_transcript_api import YouTubeTranscriptApi

def download_youtube_transcript(video_id):
    try:
        # Fetch the available transcript languages
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

        # Try to fetch transcript in the first available language
        transcript = None
        for transcript_info in transcript_list:
            try:
                transcript = transcript_info.fetch()
                print(f"Transcript found in language: {transcript_info.language}")
                break
            except Exception as inner_e:
                continue

        if not transcript:
            print("Could not fetch a transcript in any available language.")
            return

        # Print the structured output
        for entry in transcript:
            print({
                'text': entry['text'],
                'start_time': entry['start'],
                'duration': entry['duration']
            })
    except Exception as e:
        print(f"An error occurred: {e.__class__.__name__} - {str(e)}")

# Example usage
if __name__ == "__main__":
    video_id = 'boHg6quSZtI'
    download_youtube_transcript(video_id)


V2 - HYPER LINK TO REACH THE VIDEO PART

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi

def download_youtube_transcript(video_id):
    try:
        # Fetch the available transcript languages
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

        # Try to fetch transcript in the first available language
        transcript = None
        for transcript_info in transcript_list:
            try:
                transcript = transcript_info.fetch()
                print(f"Transcript found in language: {transcript_info.language}")
                break
            except Exception as inner_e:
                continue

        if not transcript:
            print("Could not fetch a transcript in any available language.")
            return

        # Print the structured output with hyperlinks
        base_url = f"https://www.youtube.com/watch?v={video_id}&t="
        for entry in transcript:
            hyperlink = base_url + str(int(entry['start'])) + "s"
            print({
                'text': entry['text'],
                'start_time': entry['start'],
                'duration': entry['duration'],
                'hyperlink': hyperlink
            })
    except Exception as e:
        print(f"An error occurred: {e.__class__.__name__} - {str(e)}")

# Example usage
if __name__ == "__main__":
    video_id = 'boHg6quSZtI'
    download_youtube_transcript(video_id)


Transcript found in language: Telugu (auto-generated)
{'text': 'హలో వెల్కమ్ టు అవర్ ఛానల్ మొత్తానికి', 'start_time': 0.4, 'duration': 6.399, 'hyperlink': 'https://www.youtube.com/watch?v=boHg6quSZtI&t=0s'}
{'text': 'అల్లు అర్జున్ గారు రిలీజ్ అయ్యారు ఇంటికి', 'start_time': 3.28, 'duration': 5.92, 'hyperlink': 'https://www.youtube.com/watch?v=boHg6quSZtI&t=3s'}
{'text': 'వచ్చారు ఒక', 'start_time': 6.799, 'duration': 5.361, 'hyperlink': 'https://www.youtube.com/watch?v=boHg6quSZtI&t=6s'}
{'text': 'రోజంతా జైల్లో ఉండాల్సిన తప్పైతే అల్లు', 'start_time': 9.2, 'duration': 4.88, 'hyperlink': 'https://www.youtube.com/watch?v=boHg6quSZtI&t=9s'}
{'text': 'అర్జున్ గారు చేయలేదు అన్నది నా వ్యక్తిగత', 'start_time': 12.16, 'duration': 5.44, 'hyperlink': 'https://www.youtube.com/watch?v=boHg6quSZtI&t=12s'}
{'text': 'అభిప్రాయం బట్ అగైన్ ఎందుకు జరిగింది ఏంటి', 'start_time': 14.08, 'duration': 6.08, 'hyperlink': 'https://www.youtube.com/watch?v=boHg6quSZtI&t=14s'}
{'text': 'అన్నది ఇంకా వారి విజ్ఞతకే వదిలేద

V2 - TRANSCRIPTS TO GOOGLE DRIVE

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi
import os

def save_transcripts_to_folder(video_id, folder_path):
    try:
        # Fetch the available transcript languages
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

        # Try to fetch transcript in the first available language
        transcript = None
        for transcript_info in transcript_list:
            try:
                transcript = transcript_info.fetch()
                print(f"Transcript found in language: {transcript_info.language}")
                break
            except Exception:
                continue

        if not transcript:
            print("Could not fetch a transcript in any available language.")
            return

        # Ensure the folder exists
        if not os.path.exists(folder_path):
            os.makedirs(folder_path)

        # Save the transcript to a file
        transcript_file = os.path.join(folder_path, f"transcript_{video_id}.txt")
        print(f"Saving file to: {transcript_file}")  # Debug print
        with open(transcript_file, 'w', encoding='utf-8') as file:
            for entry in transcript:
                file.write(f"{entry['start']} - {entry['duration']}: {entry['text']}\n")

        print(f"Transcript saved to {transcript_file}")
    except Exception as e:
        print(f"An error occurred: {e.__class__.__name__} - {str(e)}")

# Example usage
if __name__ == "__main__":
    from google.colab import drive
    drive.mount('/content/drive')  # Mount Google Drive
    video_id = 'boHg6quSZtI'  # Replace with your YouTube video ID
    folder_path = '/content/drive/My Drive/YT'  # Google Drive folder path
    save_transcripts_to_folder(video_id, folder_path)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Transcript found in language: Telugu (auto-generated)
Saving file to: /content/drive/My Drive/YT/transcript_boHg6quSZtI.txt
Transcript saved to /content/drive/My Drive/YT/transcript_boHg6quSZtI.txt


In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


V2 - TRANSCRIPTS TO AZURE BLOB

In [None]:
!pip install azure-storage-blob


In [None]:
from youtube_transcript_api import YouTubeTranscriptApi
from azure.storage.blob import BlobServiceClient
import os

def save_transcripts_to_blob(video_id, connection_string, container_name):
    try:
        # Fetch the available transcript languages
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

        # Try to fetch transcript in the first available language
        transcript = None
        for transcript_info in transcript_list:
            try:
                transcript = transcript_info.fetch()
                print(f"Transcript found in language: {transcript_info.language}")
                break
            except Exception:
                continue

        if not transcript:
            print("Could not fetch a transcript in any available language.")
            return

        # Create the transcript content as a string
        transcript_content = ""
        for entry in transcript:
            transcript_content += f"{entry['start']} - {entry['duration']}: {entry['text']}\n"

        # Upload the transcript to Azure Blob Storage
        blob_name = f"transcript_{video_id}.txt"
        blob_service_client = BlobServiceClient.from_connection_string(connection_string)
        blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

        blob_client.upload_blob(transcript_content, overwrite=True)

        print(f"Transcript uploaded to blob: {blob_name} in container: {container_name}")
    except Exception as e:
        print(f"An error occurred: {e.__class__.__name__} - {str(e)}")

# Example usage
if __name__ == "__main__":
    video_id = 'boHg6quSZtI'  # Replace with your YouTube video ID
    connection_string = (AZ_STR)  # Replace with your Azure Storage connection string
    container_name = 'content'  # Replace with your Azure Blob container name
    save_transcripts_to_blob(video_id, connection_string, container_name)


V2 - TRASCRIPTS WITH HYPERLINKS TO AZURE BLOB

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi
from azure.storage.blob import BlobServiceClient
import os

def save_transcripts_to_blob(video_id, connection_string, container_name):
    try:
        # Fetch the available transcript languages
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

        # Try to fetch transcript in the first available language
        transcript = None
        for transcript_info in transcript_list:
            try:
                transcript = transcript_info.fetch()
                print(f"Transcript found in language: {transcript_info.language}")
                break
            except Exception:
                continue

        if not transcript:
            print("Could not fetch a transcript in any available language.")
            return

        # Create the transcript content as a string with hyperlinks
        base_url = f"https://www.youtube.com/watch?v={video_id}&t="
        transcript_content = ""
        for entry in transcript:
            hyperlink = base_url + str(int(entry['start'])) + "s"
            transcript_content += f"{entry['start']} - {entry['duration']}: {entry['text']}\nHyperlink: {hyperlink}\n\n"

        # Upload the transcript to Azure Blob Storage
        blob_name = f"transcript_{video_id}.txt"
        blob_service_client = BlobServiceClient.from_connection_string(connection_string)
        blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

        blob_client.upload_blob(transcript_content, overwrite=True)

        print(f"Transcript uploaded to blob: {blob_name} in container: {container_name}")
    except Exception as e:
        print(f"An error occurred: {e.__class__.__name__} - {str(e)}")

def download_youtube_transcript(video_id):
    try:
        # Fetch the available transcript languages
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

        # Try to fetch transcript in the first available language
        transcript = None
        for transcript_info in transcript_list:
            try:
                transcript = transcript_info.fetch()
                print(f"Transcript found in language: {transcript_info.language}")
                break
            except Exception as inner_e:
                continue

        if not transcript:
            print("Could not fetch a transcript in any available language.")
            return

        # Print the structured output with hyperlinks
        base_url = f"https://www.youtube.com/watch?v={video_id}&t="
        for entry in transcript:
            hyperlink = base_url + str(int(entry['start'])) + "s"
            print({
                'text': entry['text'],
                'start_time': entry['start'],
                'duration': entry['duration'],
                'hyperlink': hyperlink
            })
    except Exception as e:
        print(f"An error occurred: {e.__class__.__name__} - {str(e)}")

# Example usage
if __name__ == "__main__":
    video_id = 'boHg6quSZtI'  # Replace with your YouTube video ID
    connection_string = (AZ_STR)  # Replace with your Azure Storage connection string
    container_name = 'contentwithlinks'  # Replace with your Azure Blob container name
    save_transcripts_to_blob(video_id, connection_string, container_name)
    download_youtube_transcript(video_id)


V2 - SEARCH FOR DILOG IN VIDEO AND GENERATE HYPER LINK - GOOGLE DRIVE AS SOURCE

**NOT GREAT ENOUGH TO UNDERSTAND THE CONTEXT HENCE NOT THE PREFFERED METHOD 😞**

In [None]:
!pip install rapidfuzz
!pip install indic-transliteration


Collecting indic-transliteration
  Downloading indic_transliteration-2.3.68-py3-none-any.whl.metadata (1.4 kB)
Collecting backports.functools-lru-cache (from indic-transliteration)
  Downloading backports.functools_lru_cache-2.0.0-py2.py3-none-any.whl.metadata (3.5 kB)
Collecting roman (from indic-transliteration)
  Downloading roman-4.2-py3-none-any.whl.metadata (3.6 kB)
Downloading indic_transliteration-2.3.68-py3-none-any.whl (155 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.3/155.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading backports.functools_lru_cache-2.0.0-py2.py3-none-any.whl (6.7 kB)
Downloading roman-4.2-py3-none-any.whl (5.5 kB)
Installing collected packages: roman, backports.functools-lru-cache, indic-transliteration
Successfully installed backports.functools-lru-cache-2.0.0 indic-transliteration-2.3.68 roman-4.2


In [None]:
from rapidfuzz import fuzz
import os
from indic_transliteration import sanscript
from indic_transliteration.sanscript import SchemeMap, SCHEMES, transliterate

def search_transcript_with_fuzzy_matching_from_file(video_id, search_term, folder_path, threshold=70):
    try:
        # Transliterate the search term to Telugu
        transliterated_term = transliterate(search_term, sanscript.ITRANS, sanscript.TELUGU)
        print(f"Transliterated term: {transliterated_term}")

        # Construct the file path from video_id
        file_path = os.path.join(folder_path, f"transcript_{video_id}.txt")

        # Check if the file exists
        if not os.path.exists(file_path):
            print(f"File not found: {file_path}")
            return

        # Read the transcript from the file
        with open(file_path, 'r', encoding='utf-8') as file:
            transcript = file.readlines()

        # Perform fuzzy matching on the transcript
        base_url = f"https://www.youtube.com/watch?v={video_id}&t="
        matches = []

        for line in transcript:
            try:
                start_time, rest = line.split(" - ", 1)
                duration, text = rest.split(": ", 1)
                score = fuzz.partial_ratio(transliterated_term, text.strip())
                if score >= threshold:  # Filter matches above the threshold
                    hyperlink = base_url + str(int(float(start_time))) + "s"
                    matches.append({
                        'text': text.strip(),
                        'start_time': float(start_time),
                        'duration': float(duration),
                        'hyperlink': hyperlink,
                        'score': score
                    })
            except ValueError:
                continue

        # Sort matches by score in descending order
        matches.sort(key=lambda x: x['score'], reverse=True)

        # Output results
        if matches:
            print(f"Found {len(matches)} match(es) for '{search_term}':")
            for match in matches:
                print(f"Score: {match['score']}%")
                print(f"Dialog: {match['text']}")
                print(f"Hyperlink: {match['hyperlink']}\n")
        else:
            print(f"No matches found for '{search_term}'.")

    except Exception as e:
        print(f"An error occurred: {e.__class__.__name__} - {str(e)}")

# Example usage
if __name__ == "__main__":
    from google.colab import drive
    drive.mount('/content/drive')  # Mount Google Drive

    video_id = 'boHg6quSZtI'  # Replace with your YouTube video ID
    folder_path = '/content/drive/My Drive/YT'  # Google Drive folder path
    search_term = input("Enter the dialog or part of speech to search: ").strip()
    search_transcript_with_fuzzy_matching_from_file(video_id, search_term, folder_path)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Enter the dialog or part of speech to search: anna dandam
Transliterated term: అన్న దన్దమ్
Found 3 match(es) for 'anna dandam':
Score: 76.19047619047619%
Dialog: అన్న చేత పెట్టించుకుంటాడు మా అన్న పెద్ద
Hyperlink: https://www.youtube.com/watch?v=boHg6quSZtI&t=429s

Score: 72.72727272727273%
Dialog: చెప్పేసి రేవంత్ రెడ్డి అన్నాడు దమ్ము
Hyperlink: https://www.youtube.com/watch?v=boHg6quSZtI&t=653s

Score: 70.58823529411764%
Dialog: అన్న మెగా స్టార్ మా అన్న అన్నావు మా
Hyperlink: https://www.youtube.com/watch?v=boHg6quSZtI&t=391s



V2 - Semantic Search with AI Models

Use senteance transformers to understand the context and return the output

In [None]:
!pip install sentence-transformers
!pip install torch



In [None]:
import torch  # Ensure PyTorch is imported
from sentence_transformers import SentenceTransformer, util
import os

def search_transcript_with_semantic_search(video_id, search_term, folder_path, top_k=5):
    try:
        # Load the pre-trained SentenceTransformer model
        model = SentenceTransformer('all-mpnet-base-v2')  # You can choose another model if needed

        # Construct the file path from video_id
        file_path = os.path.join(folder_path, f"transcript_{video_id}.txt")

        # Check if the file exists
        if not os.path.exists(file_path):
            print(f"File not found: {file_path}")
            return

        # Read the transcript from the file
        with open(file_path, 'r', encoding='utf-8') as file:
            transcript_lines = file.readlines()

        # Extract text and timestamps from the transcript
        transcript = []
        start_times = []
        for line in transcript_lines:
            try:
                start_time, rest = line.split(" - ", 1)
                _, text = rest.split(": ", 1)
                transcript.append(text.strip())
                start_times.append(float(start_time))
            except ValueError:
                continue

        # Compute embeddings for the transcript and search query
        transcript_embeddings = model.encode(transcript, convert_to_tensor=True)
        query_embedding = model.encode(search_term, convert_to_tensor=True)

        # Calculate semantic similarity
        similarities = util.pytorch_cos_sim(query_embedding, transcript_embeddings)[0]

        # Retrieve the top-k most similar dialogs
        top_results = torch.topk(similarities, k=top_k)

        print(f"Found {len(top_results.indices)} match(es) for '{search_term}':")
        base_url = f"https://www.youtube.com/watch?v={video_id}&t="
        for idx, score in zip(top_results.indices, top_results.values):
            idx = idx.item()  # Convert tensor to int
            print(f"Score: {score:.2f}")
            print(f"Dialog: {transcript[idx]}")
            print(f"Hyperlink: {base_url}{int(start_times[idx])}s\n")

    except Exception as e:
        print(f"An error occurred: {e.__class__.__name__} - {str(e)}")

# Example usage
if __name__ == "__main__":
    from google.colab import drive
    drive.mount('/content/drive')  # Mount Google Drive

    video_id = 'boHg6quSZtI'  # Replace with your YouTube video ID
    folder_path = '/content/drive/My Drive/YT'  # Google Drive folder path
    search_term = input("Enter the dialog or part of speech to search: ").strip()
    search_transcript_with_semantic_search(video_id, search_term, folder_path)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Enter the dialog or part of speech to search: chandra babu naidu arrest
Found 5 match(es) for 'chandra babu naidu arrest':
Score: 0.19
Dialog: దుర్గం చెరువు ఫ్లై ఓవరా ఆడ బైక్ లో పోతూ
Hyperlink: https://www.youtube.com/watch?v=boHg6quSZtI&t=561s

Score: 0.19
Dialog: విల్ బి బిజీ ఇన్ అవర్ లైఫ్ ఇప్పుడు పుష్ప
Hyperlink: https://www.youtube.com/watch?v=boHg6quSZtI&t=169s

Score: 0.19
Dialog: నో వాట్ ఎవర్ ద ప్రోగ్రాం నేమ్ బట్ ఐ
Hyperlink: https://www.youtube.com/watch?v=boHg6quSZtI&t=111s

Score: 0.19
Dialog: బి బిజీ ఇన్ లైఫ్ లైక్ యు నో మనకు
Hyperlink: https://www.youtube.com/watch?v=boHg6quSZtI&t=187s

Score: 0.19
Dialog: ఫార్మింగ్ లేదు ఏం లేదు సీ ఎవరీ వన్ విల్
Hyperlink: https://www.youtube.com/watch?v=boHg6quSZtI&t=185s



V2 - SEARCH TRANSCRIPT WITH OPENAI

In [None]:
!pip install --upgrade openai tiktoken

from openai import OpenAI
import os
from typing import Optional, List
from google.colab import userdata
import tiktoken

class TeluguTranscriptSearcher:
    def __init__(self):
        """Initialize the TranscriptSearcher with OpenAI API credentials from Colab secrets."""
        api_key = userdata.get('OPENAI_API_KEY')
        if not api_key:
            raise ValueError("OPENAI_API_KEY not found in Colab secrets")

        self.client = OpenAI(api_key=api_key)
        self.encoding = tiktoken.encoding_for_model("gpt-4")
        self.max_tokens = 6000

    @staticmethod
    def format_youtube_link(video_id: str, timestamp_seconds: int) -> str:
        """Generate a YouTube link with timestamp."""
        return f"https://youtube.com/watch?v={video_id}&t={timestamp_seconds}s"

    def count_tokens(self, text: str) -> int:
        """Count the number of tokens in a text string."""
        return len(self.encoding.encode(text))

    def chunk_transcript(self, transcript: str) -> List[str]:
        """Split transcript into chunks that fit within token limit."""
        chunks = []
        lines = transcript.split('\n')
        current_chunk = []
        current_tokens = 0

        for line in lines:
            line_tokens = self.count_tokens(line)

            if current_tokens + line_tokens > self.max_tokens:
                if current_chunk:
                    chunks.append('\n'.join(current_chunk))
                current_chunk = [line]
                current_tokens = line_tokens
            else:
                current_chunk.append(line)
                current_tokens += line_tokens

        if current_chunk:
            chunks.append('\n'.join(current_chunk))

        return chunks

    def read_transcript(self, file_path: str) -> Optional[str]:
        """Read transcript from a file."""
        if not os.path.exists(file_path):
            print(f"File not found: {file_path}")
            return None

        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                return file.read()
        except IOError as e:
            print(f"Error reading file: {str(e)}")
            return None

    def search_chunk(self, chunk: str, search_term: str, video_id: str, max_tokens: int = 1000) -> Optional[str]:
        """Search a single chunk of Telugu transcript."""
        try:
            messages = [
                {"role": "system", "content": """You are a bilingual assistant expert in Telugu and English.
Your task is to find relevant content in Telugu transcripts based on English queries.
When you find matches, provide the original Telugu text, its English translation, and a YouTube link with timestamp."""},
                {"role": "user", "content": f"""
The following is part of a Telugu video transcript. Find dialogs or parts most relevant to the English query.
Only return matches if they are highly relevant. If no relevant matches are found, return empty string.

Transcript (Telugu):
{chunk}

Query (English):
{search_term}

For each match, return in this format:
[Timestamp in seconds]
Telugu: (original Telugu text)
English: (English translation)
Link: {self.format_youtube_link(video_id, "TIMESTAMP")} (replace TIMESTAMP with the actual seconds)

Note: Always include the timestamp in seconds and generate the full YouTube link.
"""}
            ]

            response = self.client.chat.completions.create(
                model="gpt-4",
                messages=messages,
                max_tokens=max_tokens,
                temperature=0.7
            )

            return response.choices[0].message.content.strip()

        except Exception as e:
            print(f"Error processing chunk: {str(e)}")
            return None

    def search_transcript(self,
                         video_id: str,
                         search_term: str,
                         folder_path: str) -> Optional[str]:
        """Search through a Telugu video transcript for relevant content."""
        try:
            # Read and chunk the transcript
            file_path = os.path.join(folder_path, f"transcript_{video_id}.txt")
            full_transcript = self.read_transcript(file_path)

            if not full_transcript:
                return None

            chunks = self.chunk_transcript(full_transcript)
            print(f"Split transcript into {len(chunks)} chunks")

            # Search each chunk
            all_results = []
            for i, chunk in enumerate(chunks, 1):
                print(f"Processing chunk {i}/{len(chunks)}...")
                result = self.search_chunk(chunk, search_term, video_id)
                if result and result.strip():
                    all_results.append(result)

            # Combine and format results
            if not all_results:
                return "No relevant dialogs found. Note: The search was performed on Telugu content using your English query."

            return "\n\n".join(all_results)

        except Exception as e:
            print(f"An error occurred: {e.__class__.__name__} - {str(e)}")
            return None

def main():
    """Main function to run the transcript search"""
    try:
        # Mount Google Drive
        from google.colab import drive
        drive.mount('/content/drive')

        # Initialize the searcher
        searcher = TeluguTranscriptSearcher()

        # Set up search parameters
        video_id = 'boHg6quSZtI'  # Replace with your YouTube video ID
        folder_path = '/content/drive/My Drive/YT'
        search_term = input("Enter your search term in English: ").strip()

        print("\nSearching Telugu transcript for relevant content...")
        # Perform search
        result = searcher.search_transcript(video_id, search_term, folder_path)

        # Print results
        if result:
            print("\nSearch Results:")
            print("-" * 50)
            print(result)

    except Exception as e:
        print(f"An error occurred in main: {str(e)}")

if __name__ == "__main__":
    main()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Enter your search term in English: what presenter agree with revanth reddy

Searching Telugu transcript for relevant content...
Split transcript into 4 chunks
Processing chunk 1/4...
Processing chunk 2/4...
Processing chunk 3/4...
Processing chunk 4/4...

Search Results:
--------------------------------------------------
No relevant matches found in the provided Telugu transcript for the English query.

259.519 - 5.281: రేవంత్ రెడ్డి అన్న మాటలతో నేను కొంతవరకు
262.72 - 5.84: ఏకీభవిస్తాను కొంతవరకు నేను విభేదిస్తాను

[Timestamp in seconds: 259.519 - 265.56]
Telugu: (రేవంత్ రెడ్డి అన్న మాటలతో నేను కొంతవరకు ఏకీభవిస్తాను కొంతవరకు నేను విభేదిస్తాను)
English: (I agree to some extent with what Revanth Reddy said, and I disagree to some extent)
Link: https://youtube.com/watch?v=boHg6quSZtI&t=259s

No relevant matches found in the provided Telugu transcript for the Engl

In [None]:
from google.colab import drive
drive.flush_and_unmount()


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive
