
# YouTube Summarizer + RAG Prompts

Welcome! This notebook demonstrates a **GenAI-powered system** that can:
-  **Summarize YouTube videos** using [gemini-1.5-flash-latest]
-  **Ask questions** about the video using **Retrieval-Augmented Generation (RAG)**
-  Store and retrieve content using **Chroma Vector DB** and **SentenceTransformer**

---

## Problem statement

**Problem:** YouTube videos are rich in information, but it's time-consuming to extract insights or answer questions.

**Goal:** Build an AI assistant that can:
- Automatically summarize any video
- Let users ask focused questions using retrieved context

**GenAI Solution:** We use Google Gemini for summarization and Q&A, enhanced with embedding-based retrieval.

---

## GenAI Capabilities Used

| Capability            | Description                                      |
|----------------------|--------------------------------------------------|
| Few-shot prompting   | Gemini is guided by examples for better output   |
| Structured output    | Summaries returned in structured JSON format     |
| Long context window  | Handles long video transcripts with chunking     |
| RAG (Retrieval-Augmented Generation) | Retrieves relevant transcript context for Q&A |

---

## Notebook Guide

1. **Setup & Installation** – Install libraries, configure Gemini, ChromaDB, SBERT
2. **Transcript Extraction** – Fetch and process YouTube transcripts
3. **Summarization (Gemini)** – Few-shot + chunked summaries in JSON
4. **Embedding & Storage** – Use SBERT + Chroma to store transcripts
5. **RAG Q&A** – Ask questions and get contextual answers using Gemini

---



## Install Required Libraries

### Code Explanation
Install library for calling youtube transcipts api, chromadb and transformers

In [1]:
pip install google-generativeai youtube-transcript-api chromadb sentence-transformers


Note: you may need to restart the kernel to use updated packages.


## Setup Chroma Vector DB and Embeddings

### Code Explanation
Import secrets and modules, config chroma client and load embedding models

In [2]:

# import secrets and modules
from kaggle_secrets import UserSecretsClient
import google.generativeai as genai
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer

# Step 1: Secure Gemini API key
GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel("gemini-1.5-flash-latest")

# Step 2: Safe Chroma client setup
try:
    chroma_client = chromadb.Client()
except ValueError:
    print("Chroma already initialized. Reusing the instance.")
    chroma_client = chromadb.api.ClientAPI.instance()

collection = chroma_client.get_or_create_collection(name="youtube_transcripts")

# Step 3: Load embedding model (after verifying transformers and huggingface_hub versions)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
print("Setup complete.")


2025-04-13 22:42:28.017261: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744584148.045835     134 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744584148.054395     134 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Setup complete.


## Summarize Transcript with Gemini

### Code Explanation
This block defines the core functions that drive the summarization and retrieval workflow using GenAI (Gemini) and ChromaDB.

extract_video_id(url)

This function extracts the 11-character video ID from a standard YouTube URL using a regular expression. It ensures the URL is parsed correctly so that the transcript API can use the video ID.

get_transcript(video_id)

This function fetches the transcript of a YouTube video using the YouTubeTranscriptApi. It joins the individual text entries into a single string. If no transcript is found (e.g., auto-captions disabled), it safely returns None.

chunk_text(text, max_chars=3000)

This function splits long transcripts into smaller chunks of up to 3,000 characters each. It tries to split on sentence boundaries (periods) to preserve context. This is necessary because LLMs have context length limits and perform better with well-formed input chunks.

summarize_text(transcript)

Uses few-shot prompting to guide Gemini with a sample summary format.
Breaks the transcript into chunks and summarizes each one individually using Gemini.
Waits 3 seconds between calls to respect API rate limits.
Then combines the partial summaries into one final structured JSON summary using another Gemini call.

embed_and_store(video_id, transcript, summary)

This function generates an embedding (vector representation) of the transcript using SentenceTransformer, and stores it in ChromaDB along with its video ID and summary. This allows fast semantic search later during Q&A.

retrieve_similar_context(query)

This function performs a vector similarity search in ChromaDB. It embeds the user's question, retrieves the most relevant transcript chunk, and returns that as context for Gemini to answer the question.

answer_question_with_rag(query)

This function runs a Retrieval-Augmented Generation (RAG) workflow. It:
Retrieves relevant transcript context using the query.
Prompts Gemini with that context and the question.Outputs a concise, context-grounded answer.


In [3]:
#function call

def extract_video_id(url):
    match = re.search(r"(?:v=|\/)([0-9A-Za-z_-]{11})", url)
    return match.group(1) if match else None

def get_transcript(video_id):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        return " ".join([entry['text'] for entry in transcript])
    except NoTranscriptFound:
        return None

def chunk_text(text, max_chars=3000):
    chunks = []
    while len(text) > max_chars:
        split_at = text.rfind('.', 0, max_chars)
        split_at = split_at if split_at != -1 else max_chars
        chunks.append(text[:split_at + 1].strip())
        text = text[split_at + 1:].strip()
    if text:
        chunks.append(text)
    return chunks

def summarize_text(transcript):
    examples = [{
        "transcript": "This video discusses tips for effective time management, such as planning and prioritizing tasks.",
        "summary": {
            "title": "Effective Time Management",
            "summary": "The video offers strategies for organizing time effectively.",
            "key_points": [
                "Use a planner to schedule your day.",
                "Prioritize high-impact tasks.",
                "Limit distractions and multitasking."
            ]
        }
    }]
    chunks = chunk_text(transcript)
    partial_summaries = []

    for i, chunk in enumerate(chunks):
        print(f"Summarizing chunk {i + 1}/{len(chunks)}...")
        time.sleep(3) 
        prompt = f"""
You are a helpful assistant. Summarize the transcript chunk below in JSON format.

Example:
Transcript: {examples[0]['transcript']}
Output: {json.dumps(examples[0]['summary'], indent=2)}

Transcript:
{chunk}
"""
        response = model.generate_content(prompt)
        partial_summaries.append(response.text.strip())

    final_prompt = f"""
Combine the following partial summaries into a single final JSON object with keys: title, summary, key_points.

Summaries:
{partial_summaries}
"""
    final_response = model.generate_content(final_prompt)
    return final_response.text.strip()

def embed_and_store(video_id, transcript, summary):
    embedding = embedder.encode([transcript])[0]
    collection.add(
        documents=[transcript],
        embeddings=[embedding.tolist()],
        metadatas=[{"video_id": video_id, "summary": summary}],
        ids=[video_id]
    )

def retrieve_similar_context(query):
    embedding = embedder.encode([query])[0]
    results = collection.query(query_embeddings=[embedding.tolist()], n_results=1)
    return results["documents"][0][0] if results["documents"] else ""

def answer_question_with_rag(query):
    context = retrieve_similar_context(query)
    if not context:
        return "Sorry, no relevant video transcript found."

    prompt = f"""
You are answering a question using this transcript context:

Context:
{context}

Question:
{query}

Give a concise, helpful answer based only on the context.
"""
    response = model.generate_content(prompt)
    return response.text.strip()

### Code Explanation
This piece of code inputs a URL and gets the youtube URL d, extracts the transcripts and summarize the text and provide the output in JSON format

In [4]:
import google.generativeai as genai
from youtube_transcript_api import YouTubeTranscriptApi, NoTranscriptFound
from sentence_transformers import SentenceTransformer
import chromadb
import json
import re
import time

In [5]:
#Input the video URL

url = input("Paste a YouTube video URL: ")
video_id = extract_video_id(url)
transcript = get_transcript(video_id)

if transcript:
    print("Transcript retrieved. Generating summary...")
    summary = summarize_text(transcript)
    
    try:
        summary_json = json.loads(summary)
        print(json.dumps(summary_json, indent=2))
    except:
        print(summary)
    
    embed_and_store(video_id, transcript, summary)
    print("\nTranscript + summary stored in vector DB!")
else:
    print("Transcript not available.")


Paste a YouTube video URL:  https://www.youtube.com/watch?v=u1wNu7zELjE


Transcript retrieved. Generating summary...
Summarizing chunk 1/17...
Summarizing chunk 2/17...
Summarizing chunk 3/17...
Summarizing chunk 4/17...
Summarizing chunk 5/17...
Summarizing chunk 6/17...
Summarizing chunk 7/17...
Summarizing chunk 8/17...
Summarizing chunk 9/17...
Summarizing chunk 10/17...
Summarizing chunk 11/17...
Summarizing chunk 12/17...
Summarizing chunk 13/17...
Summarizing chunk 14/17...
Summarizing chunk 15/17...
Summarizing chunk 16/17...
Summarizing chunk 17/17...
```json
{
  "title": "Gautham Vasudev Menon: A Comprehensive Overview of his Career and Films",
  "summary": "This compilation summarizes various interviews and discussions related to filmmaker Gautham Vasudev Menon, encompassing his career, filmmaking process, collaborations with actors like Simbu and Mammootty, the emotional impact of his films, and his personal life.  The summaries include discussions about specific films such as *Vinnaithaandi Varuvaayaa*, *Vettaiyaadu Vilaiyaadu*, *Vaaranam Aayir

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Transcript + summary stored in vector DB!
