# Video transcription with OpenAI Whisper

This notebook will:
- Install and set up OpenAI Whisper
- Configure two local video files
- Transcribe them with timestamps
- Save the transcription (with timelines) to a JSONL file for later use (e.g. vector / graph DB).

Whisper docs: [openai/whisper GitHub repo](https://github.com/openai/whisper)



In [2]:
# Install Whisper (run this once in your environment)
# If you are in a virtualenv, make sure it is activated before running this.

%pip install -U openai-whisper

# On macOS, Whisper also needs ffmpeg installed at the system level.
# If you don't have it yet, install via Homebrew in a terminal (NOT in this cell):
#   brew install ffmpeg



Collecting openai-whisper
  Downloading openai_whisper-20250625.tar.gz (803 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.2/803.2 kB[0m [31m16.3 MB/s[0m  [33m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting more-itertools (from openai-whisper)
  Using cached more_itertools-10.8.0-py3-none-any.whl.metadata (39 kB)
Collecting numba (from openai-whisper)
  Downloading numba-0.62.1-cp313-cp313-macosx_12_0_arm64.whl.metadata (2.8 kB)
Collecting numpy (from openai-whisper)
  Downloading numpy-2.3.5-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.12.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (6.7 kB)
Collecting torch (from openai-whisper)
  Downloading torch-2.9.1-cp313-none-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting tqdm (from openai-wh

In [17]:
import shutil
from pathlib import Path

import torch
import whisper

# Choose the Whisper model size.
# Options include: "tiny", "base", "small", "medium", "large", "turbo".
# See model table in the Whisper README: https://github.com/openai/whisper
MODEL_NAME = "large"  # adjust if you want faster ("tiny") or more accurate ("medium"/"large")

# Device selection
# NOTE: Whisper on Apple Silicon (MPS) can sometimes produce NaNs.
# To keep things stable, we force CPU here. Set FORCE_DEVICE to None to auto-detect.
FORCE_DEVICE = "cpu"  # options: "cpu", "cuda", "mps", or None for auto

if FORCE_DEVICE is not None:
    DEVICE = FORCE_DEVICE
else:
    if torch.cuda.is_available():
        DEVICE = "cuda"
    elif getattr(torch.backends, "mps", None) is not None and torch.backends.mps.is_available():
        DEVICE = "mps"  # Apple Silicon GPU
    else:
        DEVICE = "cpu"

print(f"Using device: {DEVICE}")

# Check that ffmpeg is available on the system
ffmpeg_path = shutil.which("ffmpeg")
if ffmpeg_path is None:
    print("ffmpeg was NOT found on PATH. Please install it, e.g. with: brew install ffmpeg")
else:
    print(f"ffmpeg found at: {ffmpeg_path}")

print(f"Loading Whisper model: {MODEL_NAME} on {DEVICE} ...")
model = whisper.load_model(MODEL_NAME, device=DEVICE)
print("Model loaded.")



Using device: cpu
ffmpeg found at: /opt/homebrew/bin/ffmpeg
Loading Whisper model: large on cpu ...
Model loaded.


In [18]:
# Configure paths to your two local video files.
# Edit these strings to point to the actual files on your Mac.

video1_path = Path("/Users/adi/digdir-video-ai/videos/video1.mp4")
video2_path = Path("/Users/adi/digdir-video-ai/videos/video2.mp4")

videos = [
    {"id": "video1", "path": video1_path},
    {"id": "video2", "path": video2_path},
]

for v in videos:
    print(f"{v['id']}: exists={v['path'].exists()} -> {v['path']}")



video1: exists=True -> /Users/adi/digdir-video-ai/videos/video1.mp4
video2: exists=True -> /Users/adi/digdir-video-ai/videos/video2.mp4


In [19]:
import json
from typing import Optional, Dict, Any


def transcribe_video(
    video_path: Path,
    *,
    language: Optional[str] = None,
    model: "whisper.Whisper" = model,
) -> Dict[str, Any]:
    """Transcribe a single video file with Whisper, keeping segment timestamps.

    Returns the full Whisper result dict, which includes:
    - "text": full transcript
    - "segments": list of {id, start, end, text, ...}
    """
    print(f"Transcribing: {video_path}")
    result = model.transcribe(
        str(video_path),
        language=language,  # set e.g. "en" if you know it's English, or leave None for auto-detect
        verbose=False,
        word_timestamps=False,  # set to True if you want per-word timestamps (requires newer Whisper)
    )
    print(f"Transcription finished for {video_path}")
    return result



In [20]:
# Transcribe both videos and save transcripts with segment-level timelines to JSONL.

output_dir = Path("transcripts")
output_dir.mkdir(exist_ok=True)

jsonl_path = output_dir / "transcripts_segments.jsonl"
all_segments = []

for v in videos:
    result = transcribe_video(v["path"])
    segments = result.get("segments", [])
    for seg in segments:
        record = {
            "video_id": v["id"],
            "video_path": str(v["path"]),
            "segment_id": seg.get("id"),
            "start": seg.get("start"),   # seconds from start of video
            "end": seg.get("end"),       # seconds from start of video
            "text": seg.get("text"),
        }
        all_segments.append(record)

with jsonl_path.open("w", encoding="utf-8") as f:
    for rec in all_segments:
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")

print(f"Saved {len(all_segments)} segments to {jsonl_path.resolve()}")



Transcribing: /Users/adi/digdir-video-ai/videos/video1.mp4




Detected language: Norwegian


100%|██████████| 372268/372268 [20:08<00:00, 308.00frames/s]


Transcription finished for /Users/adi/digdir-video-ai/videos/video1.mp4
Transcribing: /Users/adi/digdir-video-ai/videos/video2.mp4
Detected language: Norwegian


100%|██████████| 370700/370700 [15:30<00:00, 398.46frames/s]

Transcription finished for /Users/adi/digdir-video-ai/videos/video2.mp4
Saved 1660 segments to /Users/adi/digdir-video-ai/transcripts/transcripts_segments.jsonl





In [21]:
# Install LightRAG (run this once)
# This uses the official LightRAG package from https://github.com/HKUDS/LightRAG

%pip install "lightrag-hku[api]"



Collecting lightrag-hku[api]
  Downloading lightrag_hku-1.4.9.8-py3-none-any.whl.metadata (83 kB)
Collecting aiohttp (from lightrag-hku[api])
  Downloading aiohttp-3.13.2-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.1 kB)
Collecting configparser (from lightrag-hku[api])
  Downloading configparser-7.2.0-py3-none-any.whl.metadata (5.5 kB)
Collecting future (from lightrag-hku[api])
  Downloading future-1.0.0-py3-none-any.whl.metadata (4.0 kB)
Collecting google-genai<2.0.0,>=1.0.0 (from lightrag-hku[api])
  Downloading google_genai-1.52.0-py3-none-any.whl.metadata (46 kB)
Collecting json_repair (from lightrag-hku[api])
  Downloading json_repair-0.54.1-py3-none-any.whl.metadata (12 kB)
Collecting nano-vectordb (from lightrag-hku[api])
  Downloading nano_vectordb-0.0.4.3-py3-none-any.whl.metadata (3.7 kB)
Collecting pandas<2.4.0,>=2.0.0 (from lightrag-hku[api])
  Downloading pandas-2.3.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting pipmaster (from lightrag-hku[api])
  Downl

In [6]:
%pip install python-dotenv
%pip install psycopg2-binary neo4j
from dotenv import load_dotenv
load_dotenv()

Note: you may need to restart the kernel to use updated packages.
Collecting psycopg2-binary
  Downloading psycopg2_binary-2.9.11-cp313-cp313-macosx_11_0_arm64.whl.metadata (4.9 kB)
Collecting neo4j
  Downloading neo4j-6.0.3-py3-none-any.whl.metadata (5.2 kB)
Downloading psycopg2_binary-2.9.11-cp313-cp313-macosx_11_0_arm64.whl (3.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m11.9 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25hDownloading neo4j-6.0.3-py3-none-any.whl (325 kB)
Installing collected packages: psycopg2-binary, neo4j
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [neo4j]32m1/2[0m [neo4j]
[1A[2KSuccessfully installed neo4j-6.0.3 psycopg2-binary-2.9.11
Note: you may need to restart the kernel to use updated packages.


True

In [12]:
# Configure LightRAG and external store connection settings

import os
import asyncio

from lightrag import LightRAG, QueryParam
from lightrag.llm.openai import gpt_4o_mini_complete, openai_embed
from lightrag.kg.shared_storage import initialize_pipeline_status


LIGHTRAG_WORK_DIR = "./lightrag_store"
os.makedirs(LIGHTRAG_WORK_DIR, exist_ok=True)

# OpenAI API key (you will set this in your shell, not in the notebook)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    print("WARNING: OPENAI_API_KEY is not set. Set it before using LightRAG.")
else:
    print("OPENAI_API_KEY is set.")

# Postgres and Neo4j connection details (you will configure these via Docker)
POSTGRES_DSN = os.getenv("POSTGRES_DSN")  # e.g. postgresql://user:pass@localhost:5432/dbname
NEO4J_URI = os.getenv("NEO4J_URI")        # e.g. bolt://localhost:7687
NEO4J_USER = os.getenv("NEO4J_USER")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")

print(f"POSTGRES_DSN set: {bool(POSTGRES_DSN)}")
print(f"NEO4J_URI set: {bool(NEO4J_URI)}")


async def init_lightrag() -> LightRAG:
    """Initialize a LightRAG instance backed by the local workspace directory."""
    rag = LightRAG(
        working_dir=LIGHTRAG_WORK_DIR,
        embedding_func=openai_embed,
        llm_model_func=gpt_4o_mini_complete,
    )
    await rag.initialize_storages()
    # Ensure pipeline_status namespace is initialized so ainsert/aquery work
    await initialize_pipeline_status()
    return rag


rag = await init_lightrag()
print("LightRAG initialized.")



INFO: [_] Created new empty graph file: ./lightrag_store/graph_chunk_entity_relation.graphml
INFO:nano-vectordb:Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': './lightrag_store/vdb_entities.json'} 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': './lightrag_store/vdb_relationships.json'} 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': './lightrag_store/vdb_chunks.json'} 0 data


OPENAI_API_KEY is set.
POSTGRES_DSN set: True
NEO4J_URI set: True
LightRAG initialized.


In [13]:
# Optional: quick connectivity checks for Postgres and Neo4j
# Requires `psycopg2` (for Postgres) and `neo4j` Python driver to be installed.

import importlib


def check_postgres():
    if not POSTGRES_DSN:
        print("POSTGRES_DSN not set; skipping Postgres check.")
        return
    try:
        psycopg2 = importlib.import_module("psycopg2")
    except ImportError:
        print("psycopg2 is not installed; skipping Postgres check.")
        return

    try:
        with psycopg2.connect(POSTGRES_DSN) as conn:
            with conn.cursor() as cur:
                cur.execute("SELECT 1")
        print("Postgres connectivity OK.")
    except Exception as e:
        print("Postgres connectivity failed:", e)


def check_neo4j():
    if not (NEO4J_URI and NEO4J_USER and NEO4J_PASSWORD):
        print("NEO4J_* env vars not fully set; skipping Neo4j check.")
        return
    try:
        neo4j = importlib.import_module("neo4j")
    except ImportError:
        print("neo4j Python driver is not installed; skipping Neo4j check.")
        return

    from neo4j import GraphDatabase

    try:
        driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
        with driver.session() as session:
            session.run("RETURN 1").single()
        print("Neo4j connectivity OK.")
    except Exception as e:
        print("Neo4j connectivity failed:", e)


check_postgres()
check_neo4j()



Postgres connectivity OK.
Neo4j connectivity OK.


In [14]:
# Build LightRAG-ready contexts from the timestamped transcript segments

import json
from pathlib import Path

transcript_jsonl_path = Path("transcripts/transcripts_segments.jsonl")

contexts: list[str] = []
contexts_meta: list[dict] = []

if not transcript_jsonl_path.exists():
    raise FileNotFoundError(f"Transcript file not found: {transcript_jsonl_path}")

with transcript_jsonl_path.open("r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        rec = json.loads(line)
        text = (rec.get("text") or "").strip()
        if not text:
            continue

        meta = {
            "video_id": rec.get("video_id"),
            "video_path": rec.get("video_path"),
            "segment_id": rec.get("segment_id"),
            "start": rec.get("start"),
            "end": rec.get("end"),
        }

        header = (
            f"[video_id={meta['video_id']};start={meta['start']};"
            f"end={meta['end']};segment_id={meta['segment_id']}] "
        )
        context = header + text

        meta["text"] = text
        meta["context"] = context

        contexts.append(context)
        contexts_meta.append(meta)

print(f"Loaded {len(contexts)} contexts from {transcript_jsonl_path}.")
if contexts:
    print("Example context:\n", contexts[0][:300], "...")



Loaded 1660 contexts from transcripts/transcripts_segments.jsonl.
Example context:
 [video_id=video1;start=0.0;end=29.98;segment_id=0] Vi prøver å ta vare på disse opptakene og hva som er viktig å gjøre. ...


In [15]:
# Insert the prepared contexts into LightRAG

if not contexts:
    raise ValueError("No contexts loaded; run the previous cell first.")

print(f"Inserting {len(contexts)} contexts into LightRAG (this may take a while)...")

# We join the contexts into a single large document so LightRAG can re-chunk as needed.
joined_contexts = "\n\n".join(contexts)

await rag.ainsert(joined_contexts)

print("LightRAG insertion complete.")



INFO: Processing 1 document(s)
INFO: Extracting stage 1/1: unknown_source
INFO: Processing d-id: doc-879378d49c3aba1929a7b43d4e9914f9
INFO: Embedding func: 8 new workers initialized (Timeouts: Func: 30s, Worker: 60s, Health Check: 75s)


Inserting 1660 contexts into LightRAG (this may take a while)...


INFO: LLM func: 4 new workers initialized (Timeouts: Func: 180s, Worker: 360s, Health Check: 375s)
INFO:  == LLM cache == saving: default:extract:00ee9fc772ea4f26218812a081da2a65
INFO:  == LLM cache == saving: default:extract:6d1d8bdabb0fcb65f1e786315656b367
INFO: Chunk 1 of 62 extracted 5 Ent + 4 Rel chunk-6619ecb638d4499a9c23b6b8bd388a40
INFO:  == LLM cache == saving: default:extract:c04e39a2793987fc7c0aa15d0e6e9a34
INFO:  == LLM cache == saving: default:extract:6bc700e186f181f0d35ecf9863168a69
INFO:  == LLM cache == saving: default:extract:b4d7ffe5c7e88c108308e358b0f5ab59
INFO: Chunk 2 of 62 extracted 7 Ent + 6 Rel chunk-b340c913e3e67b8dc1a8277ea9eec1e3
INFO:  == LLM cache == saving: default:extract:9e1f3a39edd1b61c24326632c6c05016
INFO:  == LLM cache == saving: default:extract:e5b96caf9c09438da51842769c357ae4
INFO: Chunk 3 of 62 extracted 11 Ent + 8 Rel chunk-5d4cb93be36b654c9f9a556c667e0916
INFO:  == LLM cache == saving: default:extract:6856688d894cdfc6f46d183a919096cc
INFO: Chunk

LightRAG insertion complete.


In [16]:
# Semantic search helper: find relevant segments and return video/timestamp metadata

import numpy as np

# Base URL for building clickable links; adjust to your eventual video player.
VIDEO_BASE_URL = os.getenv("VIDEO_BASE_URL")

# Pre-compute embeddings for all contexts using the same embedding function as LightRAG
print("Computing embeddings for contexts (one-time)...")
context_embeddings = await rag.embedding_func(contexts)
context_embeddings = np.array(context_embeddings, dtype="float32")
context_norms = np.linalg.norm(context_embeddings, axis=1) + 1e-8


async def search_segments_async(query: str, top_k: int = 5):
    """Semantic search over transcript segments.

    Returns a list of dicts with video_id, start, end, text, score, and a URL
    you can later hook up to a player.
    """

    query_emb = await rag.embedding_func([query])
    query_emb = np.array(query_emb, dtype="float32")[0]
    query_norm = np.linalg.norm(query_emb) + 1e-8

    # Cosine similarity between query and all context embeddings
    sims = (context_embeddings @ query_emb) / (context_norms * query_norm)
    top_idx = np.argsort(-sims)[:top_k]

    results = []
    for idx in top_idx:
        meta = contexts_meta[int(idx)].copy()
        meta["score"] = float(sims[idx])
        # Build a simple URL that encodes video + start time
        meta["url"] = f"{VIDEO_BASE_URL}/{meta['video_id']}?t={int(meta['start'])}"
        results.append(meta)

    # Pretty-print results for quick inspection
    for r in results:
        print(
            f"- video_id={r['video_id']} start={r['start']:.2f}s end={r['end']:.2f}s score={r['score']:.3f}"
        )
        print(f"  url: {r['url']}")
        print(f"  text: {r['text'][:200]}...")
        print()

    return results


# Example usage in a notebook cell:
#   await search_segments_async("data sharing platform", top_k=5)



Computing embeddings for contexts (one-time)...


In [17]:
# Export chunks + metadata for external vector/graph stores (PostgreSQL, Neo4j, etc.)

from pathlib import Path

output_dir = Path("transcripts")
output_dir.mkdir(exist_ok=True)

export_jsonl_path = output_dir / "chunks_for_db.jsonl"

records_written = 0
with export_jsonl_path.open("w", encoding="utf-8") as f:
    for meta in contexts_meta:
        rec = {
            "video_id": meta["video_id"],
            "segment_id": meta["segment_id"],
            "start": meta["start"],
            "end": meta["end"],
            "text": meta["text"],
            # Same URL pattern used in search_segments_async
            "url": f"{VIDEO_BASE_URL}/{meta['video_id']}?t={int(meta['start'])}",
        }
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")
        records_written += 1

print(f"Wrote {records_written} records to {export_jsonl_path.resolve()}.")
print("You can load this JSONL into Postgres, Neo4j, or another vector DB as needed.")



Wrote 1660 records to /Users/adi/digdir-video-ai/transcripts/chunks_for_db.jsonl.
You can load this JSONL into Postgres, Neo4j, or another vector DB as needed.


In [18]:
# Load exported chunks into PostgreSQL (vector-ready table)

import json
import importlib
from pathlib import Path

if not POSTGRES_DSN:
    raise RuntimeError("POSTGRES_DSN is not set. Export POSTGRES_DSN before running this cell.")

try:
    psycopg2 = importlib.import_module("psycopg2")
except ImportError:
    raise ImportError("psycopg2 is not installed. Install it with `%pip install psycopg2-binary`.")

export_jsonl_path = Path("transcripts/chunks_for_db.jsonl")
if not export_jsonl_path.exists():
    raise FileNotFoundError(f"Export file not found: {export_jsonl_path}")

records: list[dict] = []
with export_jsonl_path.open("r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        records.append(json.loads(line))

print(f"Loaded {len(records)} records from {export_jsonl_path}.")

with psycopg2.connect(POSTGRES_DSN) as conn:
    with conn.cursor() as cur:
        # Create a simple table for video segments (one row per chunk)
        cur.execute(
            """
            CREATE TABLE IF NOT EXISTS video_segments (
                id SERIAL PRIMARY KEY,
                video_id TEXT NOT NULL,
                segment_id INTEGER,
                start DOUBLE PRECISION,
                "end" DOUBLE PRECISION,
                text TEXT,
                url TEXT,
                UNIQUE (video_id, segment_id, start)
            );
            """
        )

        insert_sql = """
            INSERT INTO video_segments (video_id, segment_id, start, "end", text, url)
            VALUES (%(video_id)s, %(segment_id)s, %(start)s, %(end)s, %(text)s, %(url)s)
            ON CONFLICT (video_id, segment_id, start) DO UPDATE
            SET "end" = EXCLUDED."end", text = EXCLUDED.text, url = EXCLUDED.url;
        """
        cur.executemany(insert_sql, records)
    conn.commit()

print(f"Inserted/updated {len(records)} rows into video_segments.")
print("You can now attach pgvector or another embedding column if desired.")



Loaded 1660 records from transcripts/chunks_for_db.jsonl.
Inserted/updated 1660 rows into video_segments.
You can now attach pgvector or another embedding column if desired.


In [19]:
# Load exported chunks into Neo4j as Video / VideoSegment graph

import json
import importlib
from pathlib import Path

if not (NEO4J_URI and NEO4J_USER and NEO4J_PASSWORD):
    raise RuntimeError("NEO4J_URI / NEO4J_USER / NEO4J_PASSWORD must be set before running this cell.")

try:
    neo4j_module = importlib.import_module("neo4j")
except ImportError:
    raise ImportError("neo4j Python driver is not installed. Install it with `%pip install neo4j`.\n")

from neo4j import GraphDatabase

export_jsonl_path = Path("transcripts/chunks_for_db.jsonl")
if not export_jsonl_path.exists():
    raise FileNotFoundError(f"Export file not found: {export_jsonl_path}")

records: list[dict] = []
with export_jsonl_path.open("r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        records.append(json.loads(line))

print(f"Loaded {len(records)} records from {export_jsonl_path}.")


def _ingest_segment_tx(tx, rec: dict):
    tx.run(
        """
        MERGE (v:Video {video_id: $video_id})
        MERGE (s:VideoSegment {
            video_id: $video_id,
            segment_id: $segment_id,
            start: $start
        })
        SET s.end = $end,
            s.text = $text,
            s.url = $url
        MERGE (v)-[:HAS_SEGMENT]->(s)
        """,
        **rec,
    )


driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

with driver.session() as session:
    for rec in records:
        session.execute_write(_ingest_segment_tx, rec)

driver.close()

print(f"Upserted {len(records)} VideoSegment nodes (and Video parents) into Neo4j.")



Loaded 1660 records from transcripts/chunks_for_db.jsonl.
Upserted 1660 VideoSegment nodes (and Video parents) into Neo4j.


In [20]:
# Example semantic query
query = "Når snakker de om datadeling og plattform?"
results = await search_segments_async(query, top_k=5)

len(results)

- video_id=video2 start=3057.10s end=3061.10s score=0.591
  url: file:///Users/adi/digdir-video-ai/videos/video2?t=3057
  text: Vi snakker om datamesh, vi snakker om datafabrik og så videre....

- video_id=video2 start=3027.10s end=3033.10s score=0.574
  url: file:///Users/adi/digdir-video-ai/videos/video2?t=3027
  text: Det som er viktig for oss er at vi får en plattform og har en plattform som vi kan bygge videre på....

- video_id=video2 start=3039.10s end=3057.10s score=0.567
  url: file:///Users/adi/digdir-video-ai/videos/video2?t=3039
  text: så tror vi jo veldig på at vi kan utvide funksjonaliteten og gjøre den plattformen egnet for fremtidige eller moderne dataarkitekturer fremover....

- video_id=video1 start=2231.58s end=2232.58s score=0.548
  url: file:///Users/adi/digdir-video-ai/videos/video1?t=2231
  text: det er å dele datadelen....

- video_id=video2 start=2443.10s end=2446.10s score=0.539
  url: file:///Users/adi/digdir-video-ai/videos/video2?t=2443
  text: og det som 

5

In [24]:
hit = results[0]  # or any index
hit["video_id"], hit["start"], hit["text"][:120]

('video2',
 3057.1,
 'Vi snakker om datamesh, vi snakker om datafabrik og så videre.')

In [25]:
from IPython.display import HTML
from pathlib import Path
import os


def show_video_segment(video_path: str, start: float, width: int = 640):
    """Embed a local video in the notebook and seek to `start` seconds.

    Works best when the video is inside the project directory (e.g. `videos/video2.mp4`).
    If an absolute path is passed, it will be made relative to the current working directory
    when possible so Jupyter can serve the file.
    """
    p = Path(video_path)
    try:
        # Try to make path relative to notebook root so Jupyter can serve it
        p_rel = p.relative_to(Path(os.getcwd()))
    except ValueError:
        # Fall back to original path
        p_rel = p

    src = p_rel.as_posix()

    return HTML(f"""
    <video id="segment_player" width="{width}" controls>
      <source src="{src}" type="video/mp4">
      Your browser does not support the video tag.
    </video>
    <script>
      const v = document.getElementById('segment_player');
      v.addEventListener('loadedmetadata', () => {{
        v.currentTime = {int(start)};
      }});
    </script>
    """)

In [26]:
show_video_segment(hit["video_path"], hit["start"])