<a href="https://colab.research.google.com/github/VaishnaviThirumala07/Epoch/blob/main/app.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install gradio opencv-python transformers accelerate pillow git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-1bio_ulz
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-1bio_ulz
  Resolved https://github.com/openai/whisper.git to commit dd985ac4b90cafeef8712f2998d62c59c3e62d22
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting gradio
  Downloading gradio-5.29.1-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.10.1 (from gradio)
  Downloading gradio_client-1.10.1-

In [None]:
import gradio as gr
import cv2
import torch
from PIL import Image
import numpy as np
import whisper
from transformers import (
    BlipProcessor, BlipForConditionalGeneration,
    T5Tokenizer, T5ForConditionalGeneration
)
from moviepy.editor import VideoFileClip
import os

# Load models
device = 'cuda' if torch.cuda.is_available() else 'cpu'

blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to(device)

t5_tokenizer = T5Tokenizer.from_pretrained("t5-small")
t5_model = T5ForConditionalGeneration.from_pretrained("t5-small").to(device)

whisper_model = whisper.load_model("base")

# Main processing function
def process_video(video_file):
    try:
        video_path = video_file.name

        # --- AUDIO TRANSCRIPTION ---
        audio_transcript = whisper_model.transcribe(video_path)['text']

        # --- FRAME EXTRACTION (in-memory every 2 seconds) ---
        cap = cv2.VideoCapture(video_path)
        fps = cap.get(cv2.CAP_PROP_FPS)
        frames = []
        count = 0
        success, frame = cap.read()

        while success:
            if int(count % (fps * 2)) == 0:
                pil_img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
                frames.append(pil_img)
            success, frame = cap.read()
            count += 1
        cap.release()

        if not frames:
            return "No frames extracted.", "No transcription.", "No summary.", "No highlights."

        # --- CAPTION FRAMES in batch ---
        inputs = blip_processor(images=frames, return_tensors="pt", padding=True).to(device)
        outputs = blip_model.generate(**inputs)
        captions = [blip_processor.decode(out, skip_special_tokens=True) for out in outputs]

        # --- COMBINE & SUMMARIZE ---
        combined = "summarize: " + " ".join(captions) + " " + audio_transcript
        t5_input = t5_tokenizer(combined, return_tensors="pt", max_length=512, truncation=True).to(device)
        summary_ids = t5_model.generate(t5_input["input_ids"], max_length=100)
        summary = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        # --- HIGHLIGHTS from transcription ---
        highlight_prompt = "Extract 3 important moments from this content: " + audio_transcript
        hl_input = t5_tokenizer("summarize: " + highlight_prompt, return_tensors="pt", max_length=512, truncation=True).to(device)
        hl_ids = t5_model.generate(hl_input["input_ids"], max_length=100)
        highlights = t5_tokenizer.decode(hl_ids[0], skip_special_tokens=True)

        return "\n".join(captions), audio_transcript, summary, highlights

    except Exception as e:
        return f"Error: {str(e)}", "", "", ""

# Gradio UI
gr.Interface(
    fn=process_video,
    inputs=gr.File(label="Upload MP4 Video", file_types=[".mp4"]),
    outputs=[
        gr.Textbox(label="All Captions"),
        gr.Textbox(label="Audio Transcript"),
        gr.Textbox(label="Summary"),
        gr.Textbox(label="Highlights"),
    ],
    title="Audio & Video Captioning, Summarization & Highlights",
    description="Captions visual frames (BLIP), transcribes audio (Whisper), summarizes and extracts highlights (T5)."
).launch()

  if event.key is 'enter':

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



preprocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]


  0%|                                               | 0.00/139M [00:00<?, ?iB/s][A
  5%|█▊                                    | 6.69M/139M [00:00<00:02, 68.4MiB/s][A
 14%|█████▌                                 | 19.8M/139M [00:00<00:01, 109MiB/s][A
 24%|█████████▍                             | 33.5M/139M [00:00<00:00, 124MiB/s][A
 39%|███████████████▎                       | 54.6M/139M [00:00<00:00, 163MiB/s][A
 55%|█████████████████████▌                 | 76.5M/139M [00:00<00:00, 187MiB/s][A
 72%|███████████████████████████▉           | 99.4M/139M [00:00<00:00, 205MiB/s][A
100%|████████████████████████████████████████| 139M/139M [00:00<00:00, 188MiB/s]


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://6f92c31490d365cdd0.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


