# Topic Modeling with BERTopic

This notebook demonstrates automated topic discovery using **BERTopic** - a transformer-based topic modeling technique that leverages sentence embeddings and clustering.

## What is BERTopic?

BERTopic combines:
1. **Sentence Transformers** - Create semantic embeddings
2. **UMAP** - Reduce dimensionality while preserving structure
3. **HDBSCAN** - Cluster similar documents
4. **c-TF-IDF** - Extract topic keywords

**Advantages over LDA:**
- 🎯 Better semantic coherence
- 📊 Automatic number of topics
- 🔍 Captures nuanced meanings
- 📈 Produces more interpretable topics

Let's dive in!

## Section 1: Setup and Imports

## Section 2: Step 1 — Basic Concepts

BERTopic pipeline:
- Embeddings (Sentence-Transformers)
- UMAP (dimensionality reduction)
- HDBSCAN (clustering)
- c-TF-IDF (topic words)

In [None]:
# Install BERTopic if needed
# !pip install bertopic

import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add parent directory
sys.path.insert(0, str(Path.cwd().parent))

# Import our module
from src.semantic.transformers_enhanced import BERTopicClustering

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)

print("✓ Setup complete!")

## Section 3: Step 2 — Core Implementation

We'll fit BERTopic on the sample news dataset generated earlier.

In [None]:
# Load sample data
import pandas as pd
from pathlib import Path

sample_path = Path.cwd() / 'sample_news.csv'
if not sample_path.exists():
    # Generate if missing
    import sys
    sys.path.insert(0, str(Path.cwd()))
    from sample_data import generate_news_dataset
    df = generate_news_dataset(n_docs=120)
    df.to_csv(sample_path, index=False)
else:
    df = pd.read_csv(sample_path)

print(df.head())
print(f"Documents: {len(df)}")

In [None]:
# Fit BERTopic
from src.semantic.transformers_enhanced import BERTopicClustering

bertopic = BERTopicClustering(embedding_model='sentence-transformers/all-MiniLM-L6-v2')

topics, probs = bertopic.model.fit_transform(df['text'].tolist())

print(f"Discovered {len(set([t for t in topics if t != -1]))} topics (excluding outliers)")

## Section 4: Step 3 — Practical Examples

Explore topics, top words, representative documents, and mapping back to the dataset.

In [None]:
# Top words per topic
topics_info = bertopic.model.get_topic_info()
print(topics_info.head())

# Show top 5 words for first non-outlier topic
topic_ids = [t for t in topics_info['Topic'].tolist() if t != -1]
if topic_ids:
    first_topic = topic_ids[0]
    print(f"\nTop words for topic {first_topic}:")
    for word, score in bertopic.model.get_topic(first_topic)[:10]:
        print(f"  {word:20s} {score:.4f}")

# Assign topics to the dataframe
df['topic'] = topics
print(df[['text', 'topic']].head(10))

## Section 5: Step 4 — Visualization and Analysis

Visualize topic similarity, hierarchy, and trends over time.

In [None]:
# Visualize topic similarity (interactive plot; may not render in all environments)
try:
    fig = bertopic.model.visualize_topics()
    fig.show()
except Exception as e:
    print(f"Visualization not available in this environment: {e}")

# Visualize hierarchy
try:
    fig_h = bertopic.model.visualize_hierarchy()
    fig_h.show()
except Exception as e:
    print(f"Hierarchy visualization not available: {e}")

# Topic trends over time (using sample timestamps)
if 'timestamp' in df.columns:
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    topics_over_time = bertopic.model.topics_over_time(df['text'].tolist(), df['timestamp'])
    try:
        fig_t = bertopic.model.visualize_topics_over_time(topics_over_time)
        fig_t.show()
    except Exception as e:
        print(f"Topics over time visualization not available: {e}")