# BYU Pathway Questions Topic Modeling - Google Colab Edition

This notebook analyzes student questions using AI-powered topic modeling with OpenAI embeddings and BERTopic.

## 📋 What you'll need:
1. **OpenAI API Key** - Get one from [platform.openai.com](https://platform.openai.com)
2. **Questions file** - A .txt file with one question per line
3. **About 5-10 minutes** for analysis to complete

---

## 🚀 Step 1: Install Required Libraries

Run this cell to install all necessary packages:

In [None]:
!pip install bertopic>=0.15.0 openai>=1.0.0 umap-learn>=0.5.0 hdbscan>=0.8.0 plotly>=5.0.0 scikit-learn>=1.0.0 pandas>=1.3.0 numpy>=1.21.0
print("✅ Installation complete!")

## 📦 Step 2: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Topic modeling libraries
from bertopic import BERTopic
from openai import OpenAI
import umap.umap_ as umap
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

# Utilities
import os
import re
from datetime import datetime
import getpass

# Try to import Google Colab utilities if available
try:
    from google.colab import files
    from google.colab import drive
    IN_COLAB = True
    print("✅ Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("✅ Running in local environment")

print("✅ Libraries imported successfully!")

## 🔑 Step 3: Configure OpenAI API Key

Enter your OpenAI API key when prompted:

In [None]:
# Get OpenAI API key securely
print("🔑 Please enter your OpenAI API key:")
print("   Get your key from: https://platform.openai.com/api-keys")
api_key = getpass.getpass("API Key: ")

# Initialize OpenAI client
client = OpenAI(api_key=api_key)

# Test the API key
try:
    response = client.models.list()
    print("✅ API key is valid and working!")
except Exception as e:
    print(f"❌ API key error: {e}")
    print("Please check your API key and try again.")

## 📤 Step 4: Upload Your Questions File

Upload a .txt file with one question per line:

In [None]:
# Upload or load questions file
if IN_COLAB:
    print("📤 Please upload your questions file (.txt format, one question per line):")
    uploaded = files.upload()
    
    if uploaded:
        filename = list(uploaded.keys())[0]
        print(f"\n📁 Processing file: {filename}")
        
        # Read questions
        with open(filename, 'r', encoding='utf-8') as f:
            content = f.read()
else:
    # For local environments, use a file path or sample data
    print("💻 Local environment detected.")
    print("Please ensure your questions file is in the same directory as this notebook.")
    filename = input("Enter the filename (e.g., 'questions.txt'): ").strip()
    
    if not filename:
        # Use sample questions for demonstration
        print("🎯 Using sample questions for demonstration...")
        content = """How do I register for classes?
What financial aid options are available?
How do I access my transcripts?
What are the graduation requirements?
How do I contact my academic advisor?
What is the refund policy?
How do I change my major?
What are the library hours?
How do I access online resources?
What technical requirements do I need?"""
        filename = "sample_questions.txt"
    else:
        try:
            with open(filename, 'r', encoding='utf-8') as f:
                content = f.read()
        except FileNotFoundError:
            print(f"❌ File '{filename}' not found. Using sample questions instead.")
            content = """How do I register for classes?
What financial aid options are available?
How do I access my transcripts?
What are the graduation requirements?
How do I contact my academic advisor?"""
            filename = "sample_questions.txt"

# Clean and process questions
questions = [line.strip() for line in content.split('\n') if line.strip()]
questions = [q for q in questions if len(q) > 10]  # Remove very short questions

print(f"✅ Loaded {len(questions)} questions from {filename}")

# Show preview
print("\n📖 First 5 questions:")
for i, q in enumerate(questions[:5], 1):
    print(f"{i}. {q}")

if len(questions) < 10:
    print("⚠️  Warning: Less than 10 questions detected. Consider adding more questions for better analysis.")
else:
    print(f"\n✅ Ready to analyze {len(questions)} questions!")

## 🧠 Step 5: Run Topic Modeling Analysis

This will take 5-10 minutes depending on the number of questions:

In [None]:
print("🚀 Starting topic modeling analysis...\n")
print(f"📊 Analyzing {len(questions)} questions")
print("⏳ This will take several minutes...\n")

# Configuration
embedding_model = "text-embedding-3-large"
chat_model = "gpt-4o-mini"
min_cluster_size = min(5, max(2, len(questions) // 20))  # Adaptive cluster size
batch_size = 100

print(f"📋 Configuration:")
print(f"   • Embedding Model: {embedding_model}")
print(f"   • Chat Model: {chat_model}")
print(f"   • Min Cluster Size: {min_cluster_size}")
print(f"   • Batch Size: {batch_size}\n")

# Step 1: Generate embeddings with improved error handling
print("🔄 Step 1/4: Generating embeddings with OpenAI...")
embeddings = []
failed_batches = []
total_batches = (len(questions) - 1) // batch_size + 1

for i in range(0, len(questions), batch_size):
    batch = questions[i:i + batch_size]
    batch_num = (i // batch_size) + 1
    
    print(f"   Processing batch {batch_num}/{total_batches} ({len(batch)} questions)...")
    
    try:
        response = client.embeddings.create(
            input=batch,
            model=embedding_model
        )
        
        batch_embeddings = [data.embedding for data in response.data]
        embeddings.extend(batch_embeddings)
        
        # Progress update
        if batch_num % 5 == 0:
            print(f"   ✅ Processed {batch_num}/{total_batches} batches successfully")
            
    except Exception as e:
        print(f"   ❌ Error processing batch {batch_num}: {e}")
        failed_batches.append(batch_num)
        # Continue processing remaining batches
        continue

if failed_batches:
    print(f"   ⚠️ Failed to process {len(failed_batches)} batches: {failed_batches}")
    print(f"   ✅ Successfully processed {len(embeddings)} questions")
else:
    print(f"   ✅ All batches processed successfully!")

embeddings = np.array(embeddings)
print(f"✅ Generated {len(embeddings)} embeddings (shape: {embeddings.shape})\n")

# Validate embeddings before proceeding
if len(embeddings) < len(questions) * 0.8:  # If we lost more than 20% of questions
    print("⚠️ Warning: Significant number of questions failed embedding generation.")
    print("Consider retrying or checking your API key and quota.")

# Step 2: Dimensionality reduction and clustering
print("🔄 Step 2/4: Reducing dimensions and clustering...")

# UMAP for clustering (5D) - adaptive neighbors based on dataset size
n_neighbors = min(15, max(5, len(questions) // 10))
print(f"   • UMAP dimensionality reduction (n_neighbors={n_neighbors})...")

umap_model = umap.UMAP(
    n_neighbors=n_neighbors,
    n_components=5,
    random_state=42,
    metric='cosine'
)
umap_embeddings = umap_model.fit_transform(embeddings)

# HDBSCAN clustering with adaptive parameters
print(f"   • HDBSCAN clustering (min_cluster_size={min_cluster_size})...")
hdbscan_model = HDBSCAN(
    min_cluster_size=min_cluster_size,
    metric='euclidean',
    cluster_selection_method='eom',
    prediction_data=True  # Allows for predicting cluster membership of new points
)
cluster_labels = hdbscan_model.fit_predict(umap_embeddings)

# Analyze clustering results
n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
n_noise = list(cluster_labels).count(-1)
categorization_rate = (len(embeddings) - n_noise) / len(embeddings) * 100

print(f"✅ Clustering Results:")
print(f"   • Clusters found: {n_clusters}")
print(f"   • Noise points: {n_noise} ({n_noise/len(embeddings)*100:.1f}%)")
print(f"   • Questions categorized: {categorization_rate:.1f}%")

# Provide feedback on clustering quality
if categorization_rate < 50:
    print("   ⚠️ Low categorization rate. Consider reducing min_cluster_size.")
elif categorization_rate > 90:
    print("   ✅ Excellent categorization rate!")
else:
    print("   ✅ Good clustering results!")

print()

# Step 3: Create BERTopic model
print("🔄 Step 3/4: Training BERTopic model...")

vectorizer_model = CountVectorizer(stop_words="english", max_features=1000)
topic_model = BERTopic(
    embedding_model=None,  # We provide embeddings directly
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    verbose=False
)

# Fit the model
topics, probs = topic_model.fit_transform(questions[:len(embeddings)], embeddings)
print(f"✅ BERTopic model trained successfully")
print(f"   • Topic assignments created for {len(topics)} questions\n")

# Step 4: Enhance topic labels with OpenAI
print("🔄 Step 4/4: Enhancing topic labels with AI...")

topic_info = topic_model.get_topic_info()
enhanced_labels = {}
failed_labels = []

for topic_id in topic_info['Topic'].unique():
    if topic_id == -1:  # Skip noise
        enhanced_labels[topic_id] = "Uncategorized"
        continue
    
    try:
        keywords = topic_model.get_topic(topic_id)[:10]
        keyword_str = ", ".join([word for word, _ in keywords])
        
        print(f"   • Labeling topic {topic_id} (keywords: {keyword_str[:50]}...)")
        
        prompt = f"""Based on these keywords from student questions: {keyword_str}

Create a clear, concise topic label (2-4 words) that describes the main theme.
Focus on what students are asking about. Examples: "Course Registration", "Financial Aid", "Technical Support"

Topic label:"""
        
        response = client.chat.completions.create(
            model=chat_model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=20,
            temperature=0.3
        )
        
        enhanced_labels[topic_id] = response.choices[0].message.content.strip().strip('"')
        
    except Exception as e:
        print(f"   ❌ Failed to label topic {topic_id}: {e}")
        failed_labels.append(topic_id)
        enhanced_labels[topic_id] = f"Topic {topic_id}"

if failed_labels:
    print(f"   ⚠️ Used default labels for {len(failed_labels)} topics: {failed_labels}")

print("\n✅ Topic labeling complete!")
print("🎉 Analysis finished! Preparing results...\n")

## 📊 Step 6: View Analysis Results

In [None]:
# Create results DataFrame with proper alignment
# Handle case where some questions might have failed embedding generation
processed_questions = questions[:len(embeddings)]  # Align with successful embeddings

results_df = pd.DataFrame({
    'Question': processed_questions,
    'Topic_ID': topics,
    'Probability': probs,
    'Topic_Name': [enhanced_labels.get(topic_id, f"Topic {topic_id}") for topic_id in topics]
})

# Add UMAP coordinates for visualization
umap_viz = umap.UMAP(
    n_neighbors=min(15, max(5, len(embeddings) // 10)), 
    n_components=2, 
    random_state=42, 
    metric='cosine'
)
viz_embeddings = umap_viz.fit_transform(embeddings)
results_df['UMAP_X'] = viz_embeddings[:, 0]
results_df['UMAP_Y'] = viz_embeddings[:, 1]

# Display summary statistics
print("📈 ANALYSIS SUMMARY")
print("=" * 50)
print(f"📊 Total Questions Processed: {len(results_df)}")
print(f"🏷️  Topics Discovered: {len(results_df['Topic_Name'].unique())}")

# Calculate categorization stats
categorized_count = len(results_df[results_df['Topic_ID'] != -1])
uncategorized_count = len(results_df[results_df['Topic_ID'] == -1])
categorization_rate = categorized_count / len(results_df) * 100

print(f"✅ Questions Categorized: {categorized_count} ({categorization_rate:.1f}%)")
print(f"❓ Uncategorized (Noise): {uncategorized_count} ({uncategorized_count/len(results_df)*100:.1f}%)")
print(f"🎯 Average Confidence: {results_df['Probability'].mean():.2f}")

# Show if any questions were dropped due to embedding failures
if len(processed_questions) < len(questions):
    dropped_count = len(questions) - len(processed_questions)
    print(f"⚠️  Questions Dropped (embedding failed): {dropped_count}")

print()

# Show topics found
topic_counts = results_df.groupby('Topic_Name').size().reset_index(name='Count')
topic_counts = topic_counts.sort_values('Count', ascending=False)

print("🏆 TOP TOPICS DISCOVERED:")
print("-" * 40)
for _, row in topic_counts.head(10).iterrows():
    percentage = (row['Count'] / len(results_df)) * 100
    print(f"📌 {row['Topic_Name']:<30} {row['Count']:>3} questions ({percentage:.1f}%)")

# Quality assessment
print(f"\n📊 ANALYSIS QUALITY:")
print(f"   • Cluster Count: {'Excellent' if 5 <= len(topic_counts) <= 20 else 'Good' if len(topic_counts) > 0 else 'Poor'}")
print(f"   • Categorization: {'Excellent' if categorization_rate > 80 else 'Good' if categorization_rate > 60 else 'Fair'}")
print(f"   • Data Coverage: {'Complete' if len(processed_questions) == len(questions) else f'{len(processed_questions)}/{len(questions)} processed'}")

print("\n✅ Results processed successfully!")

## 📊 Step 7: Interactive Visualizations

In [None]:
# Topic Distribution Bar Chart
print("📊 Creating topic distribution chart...")

fig = px.bar(
    topic_counts.head(15),
    x='Count',
    y='Topic_Name',
    orientation='h',
    title=f"📊 Topic Distribution - Top 15 Topics ({len(questions)} questions total)",
    color='Count',
    color_continuous_scale='Viridis',
    height=600
)

fig.update_layout(
    showlegend=False,
    yaxis_title=None,
    xaxis_title="Number of Questions",
    title_x=0.5
)

fig.show()
print("✅ Bar chart created!\n")

In [None]:
# Interactive Scatter Plot
print("🗺️ Creating interactive question clusters map...")

# Filter out uncategorized for cleaner visualization
categorized_df = results_df[results_df['Topic_ID'] != -1]

fig = px.scatter(
    categorized_df,
    x='UMAP_X',
    y='UMAP_Y',
    color='Topic_Name',
    hover_data={
        'Question': True,
        'Topic_Name': True,
        'Probability': ':.2f',
        'UMAP_X': False,
        'UMAP_Y': False
    },
    title=f"🗺️ Question Clusters Map - {len(categorized_df)} Categorized Questions (Hover to see questions)",
    width=900,
    height=700
)

# Customize hover template
fig.update_traces(
    hovertemplate='<b>%{hovertext}</b><br>' +
                  'Topic: %{customdata[1]}<br>' +
                  'Confidence: %{customdata[2]}<br>' +
                  '<extra></extra>',
    hovertext=[q[:100] + '...' if len(q) > 100 else q for q in categorized_df['Question']]
)

fig.update_layout(
    showlegend=True,
    title_x=0.5,
    legend=dict(
        orientation="v",
        yanchor="top",
        y=1,
        xanchor="left",
        x=1.01
    ),
    xaxis_title="UMAP Dimension 1",
    yaxis_title="UMAP Dimension 2"
)

fig.show()
print("✅ Scatter plot created! Hover over points to see individual questions.\n")

## 🔍 Step 8: Explore Questions by Topic

In [None]:
# Interactive topic exploration
print("🔍 Explore questions by topic:\n")

# Show available topics
available_topics = sorted(results_df['Topic_Name'].unique().tolist())
print("📋 Available topics:")
for i, topic in enumerate(available_topics, 1):
    count = len(results_df[results_df['Topic_Name'] == topic])
    print(f"{i:2d}. {topic:<30} ({count} questions)")

print("\n" + "="*60)
print("💡 To explore a specific topic, run the next cell and enter the topic number.")

In [None]:
# Topic-specific question browser
try:
    topic_num = int(input("Enter topic number to explore (1-{}):".format(len(available_topics))))
    
    if 1 <= topic_num <= len(available_topics):
        selected_topic = available_topics[topic_num - 1]
        topic_questions = results_df[results_df['Topic_Name'] == selected_topic]
        
        print(f"\n🏷️  TOPIC: {selected_topic}")
        print(f"📊 {len(topic_questions)} questions in this topic\n")
        
        # Sort by confidence
        topic_questions = topic_questions.sort_values('Probability', ascending=False)
        
        print("📝 Questions (sorted by confidence):")
        print("-" * 60)
        
        for i, (_, row) in enumerate(topic_questions.iterrows(), 1):
            confidence = row['Probability']
            question = row['Question']
            
            print(f"{i:2d}. [{confidence:.2f}] {question}")
            
            if i >= 20:  # Limit to first 20
                remaining = len(topic_questions) - 20
                if remaining > 0:
                    print(f"\n... and {remaining} more questions in this topic")
                break
    else:
        print("❌ Invalid topic number!")
        
except ValueError:
    print("❌ Please enter a valid number!")
except KeyboardInterrupt:
    print("\n⏹️  Exploration cancelled.")

## 💾 Step 9: Export Results

In [None]:
# Export results to CSV
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
csv_filename = f"pathway_questions_analysis_{timestamp}.csv"

# Export full results
results_df.to_csv(csv_filename, index=False)
print(f"✅ Full analysis exported to: {csv_filename}")

# Create summary report
summary_filename = f"analysis_summary_{timestamp}.txt"
with open(summary_filename, 'w') as f:
    f.write("BYU PATHWAY QUESTIONS ANALYSIS SUMMARY\n")
    f.write("=" * 50 + "\n\n")
    f.write(f"Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    f.write(f"Total Questions: {len(results_df)}\n")
    f.write(f"Topics Found: {len(results_df['Topic_Name'].unique())}\n")
    f.write(f"Questions Categorized: {len(results_df[results_df['Topic_ID'] != -1])} ({len(results_df[results_df['Topic_ID'] != -1])/len(results_df)*100:.1f}%)\n")
    f.write(f"Average Confidence: {results_df['Probability'].mean():.2f}\n\n")
    
    f.write("TOP TOPICS:\n")
    f.write("-" * 30 + "\n")
    for _, row in topic_counts.head(15).iterrows():
        f.write(f"{row['Topic_Name']:<30} {row['Count']:>3} questions\n")

print(f"✅ Summary report saved to: {summary_filename}")

# Download files
print("\n📥 Downloading files...")
files.download(csv_filename)
files.download(summary_filename)

print("\n🎉 Analysis complete! Your files have been downloaded.")
print("\n📋 What you received:")
print(f"   • {csv_filename} - Full analysis with all questions, topics, and confidence scores")
print(f"   • {summary_filename} - Summary report with key insights")

## 🎯 Analysis Complete!

### What you accomplished:
- ✅ **Processed** your questions using state-of-the-art AI embeddings
- ✅ **Discovered** meaningful topic clusters automatically
- ✅ **Generated** AI-enhanced topic labels
- ✅ **Created** interactive visualizations
- ✅ **Exported** results for further analysis

### Next Steps:
1. **Review** the downloaded CSV file with all results
2. **Share** insights with your team using the summary report
3. **Use** the topic clusters to improve student support
4. **Re-run** this analysis with new questions as they come in

---

### 🔄 Want to analyze more questions?
Simply restart this notebook and upload a new file!

### 🌐 Need a persistent dashboard?
Consider using the **Streamlit web app** version for ongoing analysis and team sharing.

---

*Built with ❤️ for BYU Pathway • Powered by OpenAI, BERTopic, and Google Colab*