In [1]:
import os
import pandas as pd

In [2]:
# Load clustered transcript segments
segments_path = "../data/transcriptions/clustered_segments.csv"
df_segments = pd.read_csv(segments_path)

# Load topic summaries
summaries_path = "../data/summaries/topic_summaries.csv"
df_summaries = pd.read_csv(summaries_path)

print("Loaded clustered segments and topic summaries.")
print("\n🧩 Segments Preview:")
print(df_segments.head())

print("\nSummaries Preview:")
print(df_summaries.head())


Loaded clustered segments and topic summaries.

🧩 Segments Preview:
   cluster            timestamp  \
0        0       0.16s - 10.92s   
1        0  1112.21s - 1122.77s   
2        0  1153.83s - 1164.26s   
3        0    125.81s - 136.06s   
4        0  1344.76s - 1355.68s   

                                                text  
0  We have been a misunderstood and badly mocked ...  
1  Second, we are building in public and we are p...  
2  the technology and shape it with us and provid...  
3  humans to create, to flourish, to escape the w...  
4  and I also like, I get why this is such an imp...  

Summaries Preview:
   cluster                                            summary
0        0  "We have been a misunderstood and badly mocked...
1        1  Some of the things that seem like they should ...
2        2  "I never, never really thought I would get the...
3        3  There's all these ways that the kind of analog...
4        4  Chat GPT seemed to struggle to understand how ...

In [3]:
# For each topic, get the earliest timestamp as the "start time" for that topic
topic_starts = (
    df_segments.groupby("cluster")["timestamp"]
    .first()
    .reset_index()
    .rename(columns={"timestamp": "start_time"})
)

# Merge start times with summaries
final_output = pd.merge(df_summaries, topic_starts, on="cluster")
final_output = final_output.sort_values(by="start_time").reset_index(drop=True)

# Display result
print("✅ Final topic summary timeline:")
for _, row in final_output.iterrows():
    print(f"[{row['start_time']}] Topic {row['cluster']}: {row['summary']}")

✅ Final topic summary timeline:
[0.16s - 10.92s] Topic 0: "We have been a misunderstood and badly mocked org for a long time," he says. "We put out things that are going to be deeply humans to create"
[10.92s - 21.04s] Topic 7: When I was a little kid, I thought building AI would be like different than the number of characters it said nice about some other person. If you hand people an AGI and that's what they want to do, I wouldn't have believed you. We have to be involved.
[1006.78s - 1016.94s] Topic 6: Jordan Peterson asked the system to can you rewrite it with an equal number, equal length string? And he showed that the response that contained positive things about Biden was much longer or longer than that about Trump.
[1027.04s - 1037.48s] Topic 4: Chat GPT seemed to struggle to understand how to help us find the good things and the bad things. The collective intelligence and ability of the outside world is imperfect. We want to make our mistakes while the stakes are low.
[1037.48

In [4]:
import json

# Save as plain text
txt_output_path = "../data/summaries/final_summary.txt"
with open(txt_output_path, "w", encoding="utf-8") as f:
    for _, row in final_output.iterrows():
        f.write(f"[{row['start_time']}] Topic {row['cluster']}: {row['summary']}\n")