# ðŸ§ª StreamSage Data Lab: Oracle (Subtitle Engineering)

**Goal**: Master the art of preparing unstructured text (movie subtitles) for RAG (Retrieval Augmented Generation).

**The Problem**: Raw `.srt` files are messy. They contain HTML, sound effects, and are broken into tiny 2-second lines. If we feed this directly to an LLM, it will get confused.

**The Solution**: 
1. **Clean**: Remove artifacts.
2. **Merge**: Combine lines into coherent sentences.
3. **Chunk**: Create 5-minute "Time Windows" with overlap.

**Outcome**: A clean dataset ready for Vector Embedding.

In [None]:
# 1. Setup & Imports
!pip install pysrt sentence-transformers pandas

import pysrt
import re
import pandas as pd
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer

print("Libraries installed!")

## 2. Get Data
We'll download a sample `.srt` file (e.g., *Big Buck Bunny*, an open-source movie) to practice on.

In [None]:
# Download sample subtitle
!wget -O sample.srt https://raw.githubusercontent.com/AGiuliani/Whisper-Subtitles-Generation/main/test/test.srt

# Load it
subs = pysrt.open('sample.srt')
print(f"Loaded {len(subs)} subtitle lines.")

# Show raw data
print("\n--- Raw Data Sample ---")
for i in range(5):
    print(f"[{subs[i].start} -> {subs[i].end}] {subs[i].text}")

## 3. Cleaning Lab
**Task**: Write a function to clean the text.
- Remove HTML tags (`<i>`, `<b>`).
- Remove sound effects (anything in `[]` or `()`).
- Fix multiple spaces.

In [None]:
def clean_text(text):
    # 1. Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # 2. Remove sound effects [Music], (Laughs)
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'\(.*?\)', '', text)
    # 3. Remove music notes
    text = re.sub(r'[â™ªâ™«]', '', text)
    # 4. Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Test it
dirty_sample = "<i>(Music playing)</i> Hello <b>World</b>! [Gunshot]"
print(f"Dirty: {dirty_sample}")
print(f"Clean: {clean_text(dirty_sample)}")

## 4. Feature Engineering: Sliding Window Chunking

**Concept**: We can't search line-by-line (too short). We can't search the whole movie (too long).
We need **Windows**.

- **Window Size**: 300 seconds (5 mins)
- **Overlap**: 30 seconds

Why overlap? Imagine a sentence starts at 4:59 and ends at 5:01. Without overlap, we'd cut it in half!

In [None]:
def create_chunks(subs, window_size=300, overlap=30):
    chunks = []
    
    # Convert all to seconds
    end_time = subs[-1].end.ordinal / 1000
    
    current_start = 0
    
    while current_start < end_time:
        current_end = current_start + window_size
        
        # Collect text in this window
        window_text = []
        for sub in subs:
            sub_start = sub.start.ordinal / 1000
            sub_end = sub.end.ordinal / 1000
            
            # Check if sub is inside window
            if sub_start >= current_start and sub_end <= current_end:
                cleaned = clean_text(sub.text)
                if cleaned:
                    window_text.append(cleaned)
        
        # Save chunk
        if window_text:
            chunks.append({
                'start': current_start,
                'end': current_end,
                'text': ' '.join(window_text),
                'char_count': len(' '.join(window_text))
            })
            
        # Slide window
        current_start += (window_size - overlap)
        
    return pd.DataFrame(chunks)

# Run it (using smaller window for this short demo file)
df_chunks = create_chunks(subs, window_size=30, overlap=5)
df_chunks.head()

## 5. Visualization
Let's see the distribution of our chunks. Are they too big? Too small?

In [None]:
plt.figure(figsize=(10, 5))
plt.hist(df_chunks['char_count'], bins=20, color='purple', alpha=0.7)
plt.title('Distribution of Chunk Sizes (Characters)')
plt.xlabel('Characters')
plt.ylabel('Count')
plt.show()

## 6. Vector Embeddings (Preview)
Now that we have clean chunks, let's turn them into numbers (vectors) using a pre-trained model.
This is what `ChromaDB` does under the hood.

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed the first chunk
sample_text = df_chunks.iloc[0]['text']
vector = model.encode(sample_text)

print(f"Text: {sample_text[:50]}...")
print(f"Vector Shape: {vector.shape}")
print(f"First 10 dimensions: {vector[:10]}")

## âœ… Next Steps
1. Download this notebook.
2. Upload to Google Colab.
3. Try it with **YOUR** favorite movie's `.srt` file.
4. Adjust `window_size` and see how it changes the chunks.