<a href="https://colab.research.google.com/github/arjunsn-03/AI-youtube-summarizer/blob/main/AI_Youtube_Video_Summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
!pip install -q pytube # To download YouTube videos
!pip install -q git+https://github.com/openai/whisper.git  # OpenAI Whisper ASR
!pip install -q moviepy # For audio extraction (reliable, easy to use)
!pip install -q yt-dlp
# Install the Google GenAI SDK
!pip install -q google-genai


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.9/175.9 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m115.9 MB/s[0m eta [36m0:00:00[0m
[?25h

### Audio Extraction from video

In [9]:
# Colab Code Cell 2: Audio Extraction (FIXED)

import os
import subprocess

# --- USER INPUT ---
youtube_url = "https://www.youtube.com/watch?v=6M5VXKLf4D4" # Your video URL
output_audio_file = "extracted_audio.mp3"

# --- NEW DOWNLOAD AND EXTRACTION LOGIC USING YT-DLP ---
print(f"Downloading and extracting audio from: {youtube_url}")

# Define yt-dlp options to extract the best quality audio and convert it to MP3
ydl_opts = [
    youtube_url,
    '-x',  # Command to extract audio only
    '--audio-format', 'mp3',
    '--audio-quality', '0',  # '0' means best possible quality
    '-o', output_audio_file, # Define the output filename
    '--force-overwrites'
]

try:
    # Execute the yt-dlp command using the subprocess module
    subprocess.run(['yt-dlp', *ydl_opts], check=True)
    print(f"✅ Audio extraction complete. File: {output_audio_file}")

except subprocess.CalledProcessError as e:
    print(f"🚨 Error during yt-dlp execution: {e}")
    print("This might indicate an issue with the video URL or network connection.")

Downloading and extracting audio from: https://www.youtube.com/watch?v=6M5VXKLf4D4
✅ Audio extraction complete. File: extracted_audio.mp3


## Step 2: High-Fidelity Transcription with Whisper
This step uses the powerful Whisper model to transcribe the audio and provides time-stamps for everything.

### Transcription

In [10]:
import whisper
import torch
import json

# Check for GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load the model. 'base' is fast and good; 'medium' is slower but more accurate.
# Use 'base' for speed, 'medium' for higher accuracy.
model_size = "medium"
print(f"Loading Whisper model: {model_size}...")
model = whisper.load_model(model_size, device=device)

# Run the transcription
print("Starting transcription...")
result = model.transcribe(
    output_audio_file,
    verbose=True,
    word_timestamps=False  # Keep segment-level timestamps for faster processing
)

# Extract the segments
transcript_segments = result["segments"]
full_text = result["text"]

print("\n--- FIRST 5 SEGMENTS ---")
for i in range(min(5, len(transcript_segments))):
    seg = transcript_segments[i]
    print(f"[{seg['start']:.2f} - {seg['end']:.2f}]: {seg['text'].strip()}")

print(f"\n✅ Transcription complete. Total segments: {len(transcript_segments)}")

Using device: cuda
Loading Whisper model: medium...


100%|█████████████████████████████████████| 1.42G/1.42G [00:36<00:00, 41.3MiB/s]


Starting transcription...
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:05.800]  Ever wondered how Google translates an entire web page to a different language in a matter of seconds?
[00:05.800 --> 00:09.700]  Or your phone gallery group's images based on their location?
[00:09.700 --> 00:12.400]  All of this is a product of deep learning.
[00:12.400 --> 00:15.300]  But what exactly is deep learning?
[00:15.300 --> 00:21.100]  Deep learning is a subset of machine learning, which in turn is a subset of artificial intelligence.
[00:21.100 --> 00:26.400]  Artificial intelligence is a technique that enables a machine to mimic human behavior.
[00:26.500 --> 00:31.600]  Machine learning is a technique to achieve AI through algorithms trained with data.
[00:31.600 --> 00:37.900]  And finally, deep learning is a type of machine learning inspired by the structure of the human brain.
[00:37.900 --> 00:4

### Step 3: Segmentation and Prompt Preparation
The LLM cannot handle a single, massive block of text. We must break the transcript into smaller chunks (e.g., 5-minute segments) and ask the LLM to summarize each chunk.

### Segment the Transcript

In [23]:
# Function to group transcription segments into larger chunks for the LLM
def segment_transcript(segments, max_chunk_duration=300): # 300 seconds = 5 minutes
    chunks = []
    current_chunk = []
    current_duration = 0

    if not segments:
        return chunks

    current_start = segments[0]['start']

    for seg in segments:
        segment_duration = seg['end'] - current_start

        if segment_duration > max_chunk_duration and current_chunk:
            # Finalize the current chunk
            chunks.append({
                'start': current_start,
                'end': current_chunk[-1]['end'],
                'text': " ".join([s['text'] for s in current_chunk])
            })
            # Start a new chunk
            current_chunk = [seg]
            current_start = seg['start']
        else:
            current_chunk.append(seg)

    # Add the last chunk
    if current_chunk:
        chunks.append({
            'start': current_start,
            'end': current_chunk[-1]['end'],
            'text': " ".join([s['text'] for s in current_chunk])
        })

    return chunks

# Run the segmentation on your Whisper output
llm_chunks = segment_transcript(transcript_segments, max_chunk_duration=300)

print(f"Video length: {transcript_segments[-1]['end'] / 60:.2f} minutes")
print(f"Divided into {len(llm_chunks)} LLM chunks.")
print(f"Example Chunk 1 Start: {llm_chunks[0]['start']:.2f}, End: {llm_chunks[0]['end']:.2f}")

Video length: 5.69 minutes
Divided into 2 LLM chunks.
Example Chunk 1 Start: 0.00, End: 296.50


### Step 4: Summarization with Gemini (The AI Core)
We will use the Gemini API to perform high-quality, abstractive summarization on each chunk. You need to set your API key here.

### LLM Summarization

In [24]:
# UNINSTALL the old, deprecated SDK
!pip uninstall -y google-generativeai

# INSTALL the new, official, actively maintained SDK
!pip install -q -U google-genai

[0m

In [32]:
from google import genai
from google.genai import types

API_KEY = "<Your API Key>"
client = genai.Client(api_key=API_KEY)

SYSTEM_PROMPT = (
    "You are an expert archivist generating a detailed, continuous text description "
    "of a video's content based on its transcript. Your summary must be abstractive, "
    "highly detailed, and strictly confined to the time segment provided. "
    "Focus on key ideas, topics, and sequence of events. Do NOT use markdown headings or bullet points."
)

final_summary_data = []

for i, chunk in enumerate(llm_chunks):
    start_time_min = chunk['start'] / 60
    end_time_min = chunk['end'] / 60
    print(f"Processing Chunk {i+1}/{len(llm_chunks)}: [{start_time_min:.2f}m - {end_time_min:.2f}m]...")

    user_message = (
        f"{SYSTEM_PROMPT}\n\n"
        f"The video segment runs from minute {start_time_min:.2f} to minute {end_time_min:.2f}. "
        f"Generate a detailed narrative description of this segment based on the following transcript:\n\n---\n{chunk['text']}"
    )

    try:
        chat = client.chats.create(
            model="gemini-2.5-flash",
            history=[
                types.Content(role="user", parts=[types.Part(text=user_message)])
            ]
        )
        response = chat.send_message("Provide the summary.")
        summary_text = response.text.strip()

        final_summary_data.append({
            'start_s': chunk['start'],
            'end_s': chunk['end'],
            'start_m': start_time_min,
            'end_m': end_time_min,
            'description': summary_text
        })

        print(f"Chunk {i+1} summarized. (Output length: {len(summary_text)} chars)")

    except Exception as e:
        print(f"🚨 Error processing chunk {i+1}: {e}")
        final_summary_data.append({
            'start_s': chunk['start'],
            'end_s': chunk['end'],
            'start_m': start_time_min,
            'end_m': end_time_min,
            'description': "ERROR: Could not generate summary for this segment."
        })

print("✅ LLM Summarization complete.")


Processing Chunk 1/2: [0.00m - 4.94m]...
Chunk 1 summarized. (Output length: 3588 chars)
Processing Chunk 2/2: [4.94m - 5.69m]...
Chunk 2 summarized. (Output length: 987 chars)
✅ LLM Summarization complete.


### Step 5: Final Output and Archival
This step stitches all the individual LLM summaries together into a final, coherent document.

### Final Output Generation

In [34]:
import datetime
# --- FUNCTION TO FORMAT TIME ---
def format_time_hms(seconds):
    hours, remainder = divmod(int(seconds), 3600)
    minutes, seconds = divmod(remainder, 60)
    return f"{hours:02d}:{minutes:02d}:{seconds:02d}"

# --- GENERATE FINAL TEXT REPORT ---
final_report_lines = [
    f"VIDEO FOOTAGE DETAILED TEXT ARCHIVE",
    f"Source: {youtube_url}",
    f"Generated on: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
    "-" * 60,
    "\n"
]
for item in final_summary_data:
    start_time_str = format_time_hms(item['start_s'])
    end_time_str = format_time_hms(item['end_s'])

    # Format the detailed, time-stamped description
    summary_line = f"[{start_time_str} - {end_time_str}]: {item['description']}\n"
    final_report_lines.append(summary_line)

final_report_text = "\n".join(final_report_lines)

# --- DISPLAY AND SAVE ---
print("\n" + "=" * 60)
print("             FINAL DETAILED TEXT SUMMARY")
print("=" * 60)
print(final_report_text)
print("=" * 60)

# Save to a file (the small text file you wanted!)
archive_filename = "video_summary_archive.txt"
with open(archive_filename, "w", encoding="utf-8") as f:
    f.write(final_report_text)

# Download the file from Colab
from google.colab import files
files.download(archive_filename)

print(f"\n✅ SUCCESS! Your detailed text archive file ({archive_filename}) is ready and downloaded.")


             FINAL DETAILED TEXT SUMMARY
VIDEO FOOTAGE DETAILED TEXT ARCHIVE
Source: https://www.youtube.com/watch?v=6M5VXKLf4D4
Generated on: 2025-10-15 18:27:46
------------------------------------------------------------


[00:00:00 - 00:04:56]: The video segment commences by illustrating the practical applications of deep learning through common examples such as Google's real-time web page translation and phone galleries organizing images by location. It then systematically defines deep learning, positioning it as a subset of machine learning, which itself is a subset of artificial intelligence. Artificial intelligence is characterized as a technique enabling machines to mimic human behavior, while machine learning is a method to achieve AI by training algorithms with data. Deep learning, specifically, is described as a type of machine learning inspired by the human brain's structure, termed an artificial neural network.

A detailed comparison between deep learning and machine lea

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


✅ SUCCESS! Your detailed text archive file (video_summary_archive.txt) is ready and downloaded.
