In [1]:
import re
from keybert import KeyBERT

In [2]:
# Load raw transcription from file
with open("../data/transcriptions/transcript.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()

# Parse into list of dictionaries
utterances = []
pattern = r"\[(\d+\.\d+)s - (\d+\.\d+)s\] (.+)"

for line in lines:
    match = re.match(pattern, line.strip())
    if match:
        start, end, text = match.groups()
        utterances.append({
            "start": float(start),
            "end": float(end),
            "text": text.strip()
        })

print(f"Loaded {len(utterances)} utterances")
utterances[:3]  # Preview

Loaded 819 utterances


[{'start': 0.16,
  'end': 10.92,
  'text': 'We have been a misunderstood and badly mocked org for a long time. Like when we started, we like announced the org at the end of 2015 and'},
 {'start': 10.92,
  'end': 21.04,
  'text': 'said we were going to work on AGI. Like people thought we were batshit insane. Yeah, you know, like I, I remember at the time a eminent AI scientist at a'},
 {'start': 22.32,
  'end': 32.4,
  'text': "large industrial AI lab was like dming individual reporters being like, you know, these people aren't very good and it's ridiculous to talk about AGI and I can't believe you're giving"}]

In [4]:
# Join all utterance text into a single document
full_text = " ".join([utt["text"] for utt in utterances])

# Initialize KeyBERT model (uses all-MiniLM-L6-v2 by default)
kw_model = KeyBERT()

# Extract top 20 keyphrases (1 to 3 words), excluding stopwords
keywords = kw_model.extract_keywords(
    full_text,
    keyphrase_ngram_range=(1, 3),
    stop_words='english',
    top_n=40
)

# Store just the keyword strings
global_keywords = set([kw[0].lower() for kw in keywords])

print("Top extracted keywords from transcript:")
for kw in global_keywords:
    print("-", kw)

Top extracted keywords from transcript:
- conversation agi built
- better agi exists
- openai agi
- worry agi technologies
- agi really
- investors know agi
- agi coming real
- great concerns agi
- intelligent agi wrong
- chief scientist openai
- org openai
- agi super intelligent
- like openai agi
- agi think
- humans think openai
- agi great concerns
- openai like folks
- openai thinking
- agi time thought
- agi isn remarkable
- super intelligent agi
- org openai went
- agi openai deepmind
- open ai started
- agi exists
- openai agi created
- people better agi
- concerns agi great
- ways think agi
- intelligent agi
- looking forward agi
- altman ceo openai
- think agi
- agi created
- agi openai
- concerns agi
- think people openai
- agi think lessons
- ceo openai thing
- build agi openai


In [5]:
def score_utterance(text, keyword_set):
    text_lower = text.lower()
    score = 0
    for keyword in keyword_set:
        if keyword in text_lower:
            score += 1  # You can weight this higher if needed
    return score + len(text.split()) * 0.1  # small bonus for length

# Score each utterance
for utt in utterances:
    utt["score"] = score_utterance(utt["text"], global_keywords)

# Sort by score descending
utterances_sorted = sorted(utterances, key=lambda x: x["score"], reverse=True)

# Preview top 5 important utterances
top_n = 20
important_segments = utterances_sorted[:top_n]

print("\nTop Important Segments:")
for seg in important_segments:
    print(f"[{seg['start']}s - {seg['end']}s] {seg['text']} (score: {seg['score']:.2f})")



Top Important Segments:
[1861.53s - 1872.69s] wanted, it wrote some code and that was it. Now you can have this back and forth dialogue where you can say, no, no, I meant this, or no, no, fix this bug or no, no, do this. And then of course the next version is the system can debug (score: 5.00)
[5817.89s - 5828.01s] in the world? I think the world is going to find out that if you can have 10 times as much code at the same price, you can just use even more to write even more code. The world just needs way more code. It is true that a lot (score: 5.00)
[4744.14s - 4754.19s] if created has a lot of power. How do you think we're doing? Like, honest. How do you think we're doing so far? Like, how do you think our decisions are? Like, do you think we're making things not better or worse? What can we do better? Well, the (score: 4.90)
[7670.59s - 7680.79s] about it. But it kind of reveals the fragility of our economic system. We may not be done. That may have been like the gun shown falling o

In [None]:
# Create output path
output_path = "../data/summaries/key_topics.txt"

# Save top 20 segments to file
with open(output_path, "w", encoding="utf-8") as f:
    for seg in important_segments[:20]:
        f.write(f"[{seg['start']}s - {seg['end']}s] {seg['text']}\n")

print(f"Saved top 20 important segments to: {output_path}")

In [7]:
# Define the context window (±2 minutes)
context_window = 120  

# Store final contextual segments
contextual_paragraphs = []

for seg in important_segments:
    start_context = seg["start"] - context_window
    end_context = seg["end"] + context_window

    # Get all utterances within this window
    context_utts = [
        utt["text"]
        for utt in utterances
        if utt["start"] >= start_context and utt["end"] <= end_context
    ]

    # Join into a single paragraph
    paragraph = " ".join(context_utts)

    # Store with timestamp info
    contextual_paragraphs.append({
        "start": start_context,
        "end": end_context,
        "paragraph": paragraph
    })

# Print all contextual paragraphs
for idx, para in enumerate(contextual_paragraphs, 1):
    start_min = int(para["start"] // 60)
    start_sec = int(para["start"] % 60)
    end_min = int(para["end"] // 60)
    end_sec = int(para["end"] % 60)
    print(f"\nSegment {idx} [{start_min:02}:{start_sec:02} - {end_min:02}:{end_sec:02}]:\n{para['paragraph']}")



Segment 1 [29:01 - 33:12]:
That's a big one. Yeah. Yeah. But there's still some parallels that don't break down. There is something deeply. Because it's trained on human data. There's. It feels like it's a way to learn about ourselves by interacting with it. Some of it as the smarter and smarter it gets, the more it represents, the more it feels like another human in terms of the kind of way you would phrase a prompt to get the kind of thing you want back. And that's interesting because that is the art form. As you collaborate with it as an assistant, this becomes more relevant. For now, this is relevant everywhere. But it's also very relevant for programming, for example. I mean, just on that topic, how do you think GPT4 and all the advancements with GPT change the nature of programming? Today's Monday. We launched the previous Tuesday, so it's been six Days, the degree wild, the degree to which it has already changed programming and what I have observed from how my friends are creat

In [9]:
# Define output path
output_path = "../data/summaries/contextual_key_topics.txt"

# Save contextual paragraphs in [start - end] format in seconds
with open(output_path, "w", encoding="utf-8") as f:
    for para in contextual_paragraphs:
        start_sec = round(para["start"], 2)
        end_sec = round(para["end"], 2)
        f.write(f"[{start_sec}s - {end_sec}s] {para['paragraph']}\n")

print(f"Saved contextual segments in seconds format to: {output_path}")


Saved contextual segments in seconds format to: ../data/summaries/contextual_key_topics.txt
